generating visualizations from rdf graphs · generating visualizations from rdf graphs zhuo ma...

1

Generating Visualizations From RDF Graphs

Zhuo Ma

U4618536

Supervisor: Tom Gedeon, Armin Haller

COMP8715 Computing Project

Australian National University Semester 1, 2015

29 May, 2015

2

ACKNOWLEDGEMENT

I would like to express my greatest gratitude to my supervisors Tom Gedeon and Armin Haller for their enthusiastic and patient guidance. With their help and supports on this researching topic, I have gained much new knowledge.

I would also like to thanks Dr Weifa Liang for guiding us the technical writing skills. In addition, I am very appreciated for the help from PHD student Anila Sahar Butt, and the supports from my family and friends.

3

ABSTRACT

RDF language becomes increasingly significance for the studies of developing Semantic Web. For users to have better understanding in this area, this requires advanced methodologies and tools to visualize RDF data in a nice and intuitive way. In this project, we have designed a new method called “Concept-Matching” to visualise RDF graphs that contain schema and data in particular. We processed data from dbpedia database as an example to implementing this approach, and designed an algorithm to retrieve the required data. Moreover, we designed experiments to test the algorithms efficiency and worked on the algorithm optimization. Based on the results and analysis, we conclude that the new approach can be implemented successfully with the core algorithm.

Keywords: RDF visualization, Semantic Web, Concept-Matching

4

CONTENTS Acknowledgement ..................................................................................................................... 2

Abstract ..................................................................................................................................... 3

1 Introduction ............................................................................................................................ 5

2 Background ............................................................................................................................ 6

3 Related Work ......................................................................................................................... 8

3.1 Non-graph based RDF Visualization ............................................................................... 8

3.2 Graph based RDF Visualization ...................................................................................... 9

3.2.1 WebVOWL: Web-based Visualization of Ontologies .............................................. 9

3.2.2 RDF Graphs with LodLive ..................................................................................... 11

3.2.3 VisualRDF – Visual representation of RDF .......................................................... 13

3.2.4 Discussion of Related Work ................................................................................... 15

4 Methodology ........................................................................................................................ 16

4.1 RDF data structure analysis ........................................................................................... 16

4.2 Retrieve concept- match information ............................................................................ 20

4.3 Construct Mapping model ............................................................................................. 21

4.3.1 Build Base Layer .................................................................................................... 22

4.3.2 Build higher layers ................................................................................................. 25

4.3.3 Large graph layout - Five point scale ..................................................................... 28

4.4 Algorithm and Time Complexity .................................................................................. 30

5 Evaluation ............................................................................................................................ 33

5.1 Experiment Environment and Steps .............................................................................. 33

5.1.1 Experiment 1 .......................................................................................................... 34

5.1.2 Experiment 2 .......................................................................................................... 34

5.2 Testing Environment ..................................................................................................... 34

5.3 Experiment Results and Analysis .................................................................................. 35

5.4 Optimization .................................................................................................................. 37

6 Conclusion and Future Work ................................................................................................ 39

6.1 Conclusion ..................................................................................................................... 39

6.2 Future Work .................................................................................................................. 39

Reference ................................................................................................................................. 41

5

CHAPTER 1

1 INTRODUCTION

In recent years, data is being gathered from daily life as a general way to represent existing information and knowledge, and is frequently analysed in order to assist in making future decisions. The analysis of the web of data has attracted both data researchers and users.

RDF language as a way to store web of data, can be used for the studies on Semantic Web development. However, because of the complex data structure in RDF, expert, let alone causal users often have difficulties understanding the details of RDF and employing the information they provide. From the humans’ perspective, to recognize and analyse information provided by Semantic web, the best and most friendly way is to implement visualization and exploration. The purpose of visualization is to convert and transform big data to visual representation that can be understood and interacted easily by humans.

RDF visualization plays a key role to help users to better understand and interact with data. Many previous works usually visualize RDF data as a tree graph with linked nodes and edges, which will be discussed in more details in the section ‘related work’. While those approaches are well suited for small dataset, the visualization will result in very complex graphs and hard for users to manage and understand when dealing with large datasets that may contain billions of triples. In addition, as more and more data appeared to users, some visualization technic choose visualize the overall representation of a whole data set in order to avoid this issue. Although it may bring merits if users want to explore the high-level structure of a vast amount of information, it also causes problems when users want to discover the detail data. For instance, if users want to know the label and relations of source, then the overall representation of data would not be adaptable.

In order to conquer these problems and provide a more intuitive, effective and user-oriented visualization for RDF data, we have developed a new visualization approach called “Concept-Matching”. This approach combines a “Compound-Fisheye view” to visualize RDF data as different size of bubbles around main source. In this paper, we present previously related work, the process of this methodology in details and analyze the efficiency of the core algorithm inside of “Concept-Matching”. In addition, we designed two different experiments to test the time complexity of our algorithm, and compared with the time that spend on retrieving real data.

6

CHAPTER 2

2 BACKGROUND

RDF stands for Resource Description Frameworks, a data model, utilising metadata, being used to store and describe data resources on the web [1]. It stores the linked data as triples of class relations and uses URIs (Uniform Resource Identifier) to indicate relationships and two end links, which can and has been widely employed for different purposes by users of a variety of skill levels [2].

RDF is widely used to empower linked data in the development of Semantic Web as the evolution of the World Wide Web, which is proposed by Tim Berners-Lee’s article in 2001 [3]. In the humanities, the word “semantic” refers to the distinctions and similarities between the meanings of words. The term “Semantic Web”, therefore, refers to a web of meanings. The Semantic Web can be considered as a web of data, which provides a common framework for data integration and combination across various applications [4].

The reason for developing the Semantic Web is that different data is stored and controlled by different applications with little communication between them, resulting in the World Wide Web being otherwise unable to provide precise information to deal with users’ semantic requests. For example, search anything on the Internet or on application databases by sending the semantic request “find another person who called Obama”. The information retrieved will be based on all keywords and it will not provide precise data.

This result is due to the fact that the search engine could not understand the user’s request, because the different domains lack integration and consistency. Thus, to deal with more complex search terms, data across different domains needs to be shared and understood and for this reason, the Semantic Web is critical to the Internet revolution.

Figure 1: Evolution on the Web [5]

7

Figure 1 shows that the Semantic Web refers also to Web 3.0 as being the evolution and extension of Web 2.0, and aims to link separate data on the Web through URIs (Uniform Resource Identifiers) in order to achieve better search results regardless of language, in terms of sharing and reusing data on the Web [6].

The World Wide Web Consortium (W3C) has developed the Semantic Web Architecture, illustrated in the figure below, to developers to assist in the development of technology.

Figure 2: Semantic Web Architecture in layers [7]

The first layer, Unicode and URI, is used to standardize development languages used in the web and to identify original web resources, and then the higher layers are extrapolated from lower layers [8]. The core layer for Semantic Web is the third one, RDF and RDFs (RDF schema), which provides the standard format to represent metadata about web resources [8]. For different website, the data can be merged if data exists in both, and others can be linked together if the data are relevant, and finally will group up to a huge data structure such as ontology models that are syntactically based on RDF.

As such complex data structures in RDF, the visualization of RDF still has problems, as mentioned in Introduction section. A later section will demonstrate the related works done by others and discuss how our work is different.

8

CHAPTER 3

3 RELATED WORK

Research on visualizing linked data has become a popular area over the last few years. Numerous research works considered better and more intuitive visualization of ontologies for end users, either for experts in Semantic Web or casual users who are interested in this area [9][10][11]. Meanwhile, many visualization tools are developed to make better performance of visualizing RDF data, which is generally classified into two aspects, non-graph based and graph based approaches respectively [12].

3.1 NON-GRAPH BASED RDF VISUALIZATION

Non-graph based methodology presents data in a logical sequence containing facets, categories and subject description, such as in the Haystack [13] and mSpace platform [14]. In addition, in the article ‘The Pathetic Fallacy of RDF’ [15] by David and Schraefel also mentioned that these non-graph based approaches will have better performance than some traditional graph based tools such as RDF-gravity [16] and IsaViz [17] since the graph will become massive as the size of RDF data becomes larger. For instance, the figure 3 shows the FOAF (Friend of a Friend) Vocabulary Specification [18] as a table of data.

Figure 3: FOAF Vocabulary specification data table on Protege

9

Although in some aspects non-graph based representation may have more merits than graph based representation, we still believe the future should be graph-based visualization that provides users with more intuitive feeling. Moreover, the graph-based method will have clear representations of overall structure, interrelationship, patterns and trends [19]. Thus, the next section will discuss the benefits and weakness of graph based visualization tools in more detail.

3.2 GRAPH BASED RDF VISUALIZATION

As this area has become a popular research topic in recent years, there are many tools developed for RDF visualization; such as WebVOWL [20], lodlive [21], and the JavaScript based Visual RDF [22]. The next few sections will show how the FOAF ontology can be viewed by the use of those three tools and explore if this graph can be the best representation.

3.2.1 WEBVOWL: WEB-BASED VISUALIZATION OF ONTOLOGIES

WebVOWL is a standalone application for the user-oriented visualization of ontologies, which is based on the web technologies and D3 visualization library [20][23].

It uses force directed algorithm to formalize the data graph layout with the implementation of Visual Notation for OWL Ontologies (VOWL) that identifies the visual language for Ontology visualization [24].

The Visual Notation for OWL Ontologies (VOWL) contains graphical primitives and color scheme ingredients to form the basic constructions, which are shown in figure 4 (a) and (b) [24].

Figure 4: Graphical primitives and Color scheme for VOWL [25]

10

Graph Primitives: VOWL uses a list of symbols to demonstrate ontology concepts. The circles represent classes and the labeled arrows represent property relations between different sources. The ontology only has two types of objects, which are datatype that usually use literals and object property that contains URI. Thus, the object property still use circle to be visualized but the datatype is depicted as rectangles.

Color Scheme: Many studies shows that the color chosen may make it easier for users to interact with different elements. For example, the color red is often used to attract attention and is therefore used to illustrate highlighted elements.

Based on the FOAF vocabulary specification, the visualization graph with the usage of VOWL notation using a force-directed layout is presented in Figure 5.

Figure 5: Friend of a Friend (FOAF) visualization in WebVOWL

Advantage

• It clearly shows the overall structure of FOAF ontology. • The basic details including information about FOAF, metadata and the graph

statistics can be found on the side bar. • When click any element on the graph, the corresponding information will be showed

under the ‘Selection Details’ such as its name, type, domain and range.

Disadvantage

• The graph will become very complex and it is hard to find the useful information as the data becomes larger.

11

3.2.2 RDF GRAPHS WITH LODLIVE

LodLive can parse RDF resources whether they are stored in a SPARQL endpoint, and generate user-oriented graphs with the use of proper navigation model throughout the data [26]. This tool uses a JavaScript application layer without using any application servers to browser a SPARQL endpoint, which transforms any configured endpoints to JSON format in order to parser to JavaScript and visualization in an HTML5 web page [26].

LodLive is comprised of 5 different components [26]:

• LodLive-core.js: jQuery plug-in • LodLive-profile.js: JSON configuration map • HTML5 page • Few images – sprites • Some other jQuery public plug-ins

LodLive operations

• In the first place, choose a database and an endpoint such as FOAF class to retrieve the URI and access the detail of FOAF.

Figure 6: Single endpoint search panel

• After the endpoint request, JSONP is called to generate a central circle representing the main class, and many small circles representing Object properties surround the core class.

12

Figure 7: Central class with surrounding object properties

• The object properties can be expanded by user’s interaction with small circles to display more data and each new resource is connected with the main class through an arrow representing the value of given properties.

Figure 8: Object Properties in FOAF expanded

13

Advantage

• Use dynamic visual graph to traverse RDF data with users interaction • Discover relations in the linked data step by step • For the different resources, there is also corresponding description that show relevant

type and comments • Inverse relation between different resources is showed with arrow going back and

forth

Disadvantage

• Does not visualize the whole FOAF ontology • Hard for casual users to understand since all URI appear in each circle rather than

labels • Graph will become complex and hard to be visualized as more object properties are

expanded.

3.2.3 VISUALRDF – VISUAL REPRESENTATION OF RDF

VisualRDF is developed by Alangrafu at 2014 [27], which use D3 JavaScript library [28] for a nice data visualization and ARC2 [29] for parsing RDF.

Operations

• This tool provides a easy model to be operated, which only require users to type a URI

Figure 9: Single URI access panel

• The overall graph of data about FOAF will be generated automatically

14

Figure 10: Overall FOAF visualization graph

• There is also a function panel provided to help user better interact with the graph

Figure 11: Function panel

• The details of each node can be displayed while move the mouse to its position

Figure 12: Node details graph

15

Advantage:

• Easy for users to operate. • Easy to display the basic structure of the linked data model automatically.

Disadvantage:

• Many intersection lines. • The relations between classes are vague. • The graph become disorder when dealing with large dataset.

3.2.4 DISCUSSION OF RELATED WORK

By investigating the related work, most visualization tools focus on the whole ontology visualization, but only few tools provide the comprehensive and specifying visualization model. All these tools are trying to implement the classes and properties in a nice and clear way.

However, there also exist some major deficiencies:

• The graph become hard to be recognized while the data become larger. • Redundant properties are showed by arrows. • No clear visualization of individuals.

Most tools have implemented the visualization approach Visual Information- Seeking Mantra “overview first, zoom and filter, then details on demand” [30]. It provides users with an overview of the whole ontology and then allows users to explore each node accordingly.

To consider these deficiencies, we developed the Concept-Matching approach to visual data as different size of bubbles center around a main class while all the instances will be showed by subsequently exploring the bubble in depth. Moreover, for considering the user-oriented visualization, we also introduce the method “Compound-Fisheye Views” [31] on the tree map to visualize large graphs when there is a large amount of triples in the to be visualized RDF graph. Another important fact is that our approach mainly focuses on the users who have less knowledge about the Ontology and RDF, rather than most tools developed for experts.

16

CHAPTER 4

4 METHODOLOGY

Overcoming the shortcomings of those tools that are mentioned in the previous chapter while also finding a new approach to visualize RDF data is also the purpose of our project. To achieve this, we have done extensive research especially on the RDF data structure analysis and data visualization approaches. Finally, we come up with the idea to use bubbles to represent different type of data and use Concept –Matching method to restrict the size and content of bubbles.

In this chapter, we will use endpoint ”Canberra” as resources from the dbpedia database as an example to explore our approach. In the first sub section, we will analyse the basic RDF data structure and in the following sections we will describe the process of building the visualization model. Figure 13 presents the high-level structure of the methodology in this project.

Figure 13: Structure of the Concept-Matching methodology

4.1 RDF DATA STRUCTURE ANALYSIS

This part will explore basic RDF statement and the RDF Model including objects as both literals and resources, and illustrate how a SPARQL query can be used to find the necessary information.

17

The RDF Statement – Triples

RDF/XML stores data as triples: Subject, Property and Object.

For example, a simple sentence “The author of http://www.w3schools.com is Jan Egil Refsnes” will have a triple relation as follows:

Subject (Resource) http://www.w3schools.com

Property (Predicate) Author

Object (either literal or resource) “Jan Egil Refsnes”

Table 1: Subject-Property-Object

We adopt another example generated from the resource Canberra in dbpedia database (http://dbpedia.org/page/Canberra), which is shown below in terms of some simple RDF statements:

Figure 14: Simple RDF Statement for Canberra

18

Interpreting this RDF statements

Subject: http://dbpedia.org/page/Canberra

Property:

• dbpedia-owl:country

• dbpedia-owl:populationTotal

• dbpedia-owl:wikiPageID

• dbpedia-owl:date

Object:

• http://dbpedia.org/resource/Australia • 381488 • 51983 • September 2011

RDF Namespace URIs

Line 4 “xmlns: rdf=http://www.w3.org/1999/02/22-rdf-syntax-ns#” shows the standard W3C namespace, which indicates that the enclosing document is an RDF document tagged by “rdf:RDF”. Moreover, the namespace “xmlns:dbpedia-owl ” specifies the elements with the dbpedia-owl prefix “http://dbpedia.org/ontology/”.

RDF Model

The set of statements inside the RDF documents can be viewed as a directed labeled graph since the data is stored as triples. The resources including subject and object are represented by nodes and all properties are presented by edges. Thus, the above RDF can be illustrated in figure 15:

Refers to http://dbpedia.org/ontology/country

Refers to http://dbpedia.org/ontology/populationTotal

Refers to

http://dbpedia.org/ontology/wikiPageID

Refers to http://dbpedia.org/ontology/date

19

Figure 15: Simple Canberra RDF model

We can see the graph becomes very difficult to parse when stating it with fully qualified URIs, so we adopt namespace prefix as labels for representing each node and edge to make the visualization simple and clear.

SPARQL query

SPARQL in terms of SPARQL Protocol and RDF Query Language is the W3C recommendation language for RDF query [32]. SPARQL is similar to SQL, which allows us to use the query words including that the use of SELECT clause choose which set of data should be queried and the use of WHERE statement find a match through the query data set.

For example, we can use the following query to return every person’s name in the FOAF database.

Figure 16: Name Return Query on FOAF database

20

This query will search all the triples in FOAF database, and return each person’s name. It notes that SELECT ?name clause request all the variables names return from the set found in WHERE statement. The statements inside WHERE are also triples formats; for example “?person foaf:name ?name” searches all the persons who have names, as well as the statement “?Person a foaf:Person” that “a” is a type predicate.

4.2 RETRIEVE CONCEPT- MATCH INFORMATION

In our Concept-Matching visualization approach, we only show the important concept related to the resource as bubbles around the central class and we use the number of instances that a concept has to decide the size of bubbles. Thus, in order to retrieve the necessary concepts most relevant to the resource, we are supposed to retrieve the number of instances count for each concept and the relations between each concept and its sub concepts.

We choose the endpoint ‘Canberra’ as the resource from the dbpedia database to retrieve its instance count and concept relations. To accomplish this, we need to use the Virtuoso SPARQL Query Editor [33] for querying the dbpedia database.

To get properties and their count attached to a type that exported to file “InstanceCountPerType.csv”, we wrote the SPARQL query language as it is shown on Figure17:

Figure 17: SPARQL query for Instance Count Per Type of Canberra

21

To retrieve the type and subtype relations among those concepts/types related to Canberra in dbpedia, we wrote the SPARQL query to export it to file “Concept-Subconcept.csv”, which is showed as follows:

Figure 18: SPARQL query for concept relations of Canberra

4.3 CONSTRUCT MAPPING MODEL

In this section, we built a program to scan both the “InstanceCountPerType.csv” (a set of triples) and “Concept-Subconcept.csv” (a set of concept-relations) dataset to get the concepts that are most relevant to the source ‘Canberra’ in which to draw the different layers in “Canberra” data visualization.

To have a better understanding of what is the most relevant concept to resource “Canberra”, we used a simple example to illustrate:

Figure 19: Canberra-ANU demo

22

As the above graph showed, ‘Canberra’ has school “ANU” and “ANU” has concept “University” but University is a sub concept of “Organization”. Thus, the most relevant concept to “Canberra” is “University”.

The next section explains the way to retrieve most relevant concept from the two dataset of “Canberra”. The data in two dataset looks like graph 1 and graph 2 in Appendix A. We separate this process into two different stages.

4.3.1 BUILD BASE LAYER

Constructing the base layer requires recursive iteration through the dataset. The overview process model is shown as the following figure:

Figure 20: Process mode of building base layer

23

Process 1 – Filter process

When we analyzed set of triples, we found there many concepts that have URI not only from dbpedia database but also from other source. We filtered all the concepts that are not starting at “http://dbpedia.org”, since we dealt with dbpedia resources. Thus after this process, we will have a new dataset that only contains all the concepts with URI starting at “http://dbpedia.org”.

Process 2 – Ranking process (deal with “InstanceCountPerType.csv” dataset)

• Situation 1: ranking the number of instance

Firstly, we rank the number of instance from the largest to smallest, which is shown on Figure 21:

Figure 21: First 10 lines of data in “InstanceCountPerType.csv” dataset

If there two concepts that have the same number of instances and property, those concepts will be waited for checking their relations. For instance, the concepts “Agent” and “Person” have the same number of instance 186 and same property “birthPlace”. Thus, the relations between these two concepts will be compared, and the program will return the concept that is the sub concept of another.

• Situation 2: ranking the property

If various concepts that have the same number of instance but different properties as it is shown on Figure 22, we ranked their property.

Figure 22: Concepts with instance count 92

24

Thus, there will have two pairs of concepts (Pair 1: “PhysicalEntity” and “CausalAgent”; Pair 2: “Person” and “Agent”) need to be checked for their relations, since the pair of concept has the same number of instance and same property.

• Situation 3: more than two concepts with same number of instance and property

If more than two concepts have the same number of instance and property, each pair of concept need to be checked for its relation.

Figure 23: Concepts with instance count 145

The Figure 23 shows many concepts have same instance count and property, so each concept needed to compare with the others. Finally, the program returns any one concept that is the sub concept but not being a super concept. For instance, (eg. A->B, C->D, E->F, B->C where -> stands for is sub concept of), it returns either A or E.

Process 3 – Check concept relations: return the most relevant concept

For each pair of concepts, we scan the concept-relations dataset to check its relation. For instance, in the “situation 1” above, both concepts “Person” and “Agent” with the same number of instance and property are waiting for check the relation. Then, by searching concept-relation, it found the relation that “http://dbpedia.org/ontology/Person” is the sub concept of “http://dbpedia.org/ontology/Agent”. Then, the program returns the concept “http://dbpedia.org/ontology/Person”.

Base Layer demo

After we apply the steps described above, we got a list of triples (concept – property – number of instance). We add all the number of instances together when their concepts are the same and keep the record of their properties. For example, the property “birthPlace” has 186 instances for concept “Person” and the property “deathPlace” has 124 instances for concept “Person”. In this way, we can calculate the concept “Person” has the largest number of instance that will be showed by the largest size of bubble, and we still keep the record of their relations.

The demo of the base layer will look like:

25

Figure 24: Base layer demo

The concepts filtered are most relevant and important to “Canberra” and are arranged by how many instances they have. The black dots between concept labels “Organization” and “Dom” here mean that many bubbles are omitted in this demo. At the base layer, if it still can be expanded, then we chose not to show either the property arrow or instances, unless they cannot be expanded any more. This will be discussed more on the next section.

4.3.2 BUILD HIGHER LAYERS

As mentioned above, if the bubbles around “Canberra” in the base layer can be expanded further, and then it has go through the process to build its higher layers. To illustrate this, we chose to expand the concept “Person” in the following. When a user clicks the “Person” bubble, it will show its SubConcept as various sized bubbles around it. We built a recursive method to expand the higher layer as below.

Step1- finds sub concepts

We use the program to search the concept-relations to find all the concepts that are SubConcepts of “Person”, which is shown on table 2 below.

SubConcept Concept http://dbpedia.org/ontology/Artist http://dbpedia.org/ontology/Person http://dbpedia.org/ontology/Athlete http://dbpedia.org/ontology/OfficeHolder http://dbpedia.org/ontology/Economist http://dbpedia.org/ontology/MilitaryPerson http://dbpedia.org/ontology/Politician http://dbpedia.org/ontology/BeautyQueen http://dbpedia.org/ontology/SportsManager http://dbpedia.org/ontology/Model http://dbpedia.org/ontology/Scientist http://dbpedia.org/ontology/Cleric http://dbpedia.org/ontology/Philosopher

26

http://dbpedia.org/ontology/OrganisationMember http://dbpedia.org/ontology/Chef http://dbpedia.org/ontology/Architect

Table 2: SubConcepts of Concept “Person”

Step2 – finds number of instances

Now back to read the dataset “InstanceCountPerType.csv” to looking for how many instances those sub concepts have, in order to decide the size of bubbles around source “Person”; simultaneously, the property and relations are recorded for the instances visualization. Then, we run the program to get the triple relations that are shown in Appendix B. The total number of instance for each concept is:

Concept/Type Total No of Instance http://dbpedia.org/ontology/Artist 47 http://dbpedia.org/ontology/Athlete 316 http://dbpedia.org/ontology/OfficeHolder 50 http://dbpedia.org/ontology/Economist 3 http://dbpedia.org/ontology/MilitaryPerson 70 http://dbpedia.org/ontology/Politician 69 http://dbpedia.org/ontology/BeautyQueen 3 http://dbpedia.org/ontology/SportsManager 4 http://dbpedia.org/ontology/Model 2 http://dbpedia.org/ontology/Scientist 53 http://dbpedia.org/ontology/Cleric 3 http://dbpedia.org/ontology/Philosopher 1 http://dbpedia.org/ontology/OrganisationMember 1 http://dbpedia.org/ontology/Chef 1 http://dbpedia.org/ontology/Architect 4

Table 3: Total number of instance for sub concepts of concept “Person”

Step3 – visualizing “Person”

After retrieving those data, the higher layer concept label “Person” will be generated based on the number of instance, which looks like:

27

Figure 25: Higher layer of “Person” demo

Recursive step

After reaching the second level of source “Canberra”, we check if those sub concepts of “Person” could be expanded further. If any concept that represented by bubbles could be expanded to the next level, then the above process is repeated to determine what concepts would be involved and use pointers to record the properties and instances.

The instances and properties will be shown until no concept has any more sub concepts.

Figure 26: Higher level of concept “Artist”

28

For instance, the Figure 26 illustrates the next level of concept “Artist” where we found that the concept “Writer” does not have further sub concepts. Thus, when a user clicks on Writer, it will not show any more sub concepts around the “Writer” bubble; instead, the properties and instances that have concept “Writer” will be shown by arrows and rectangles. The instances are retrieved from the endpoint Canberra in dbpedia database. The graph looks like:

Figure 27: The instances with concept “Writer”

“Place of Death” relation means that Bryce Courtenay who was a Writer died in Canberra. We used the asterisk to represent that there is more than one instance that connect with “Canberra” and show the instance directly if only one instance exist. This method is briefly explained in [12]. Since we used the Compound-Fisheye Views [31], other bubbles will become far small than the one that the user is focusing on.

4.3.3 LARGE GRAPH LAYOUT - FIVE POINT SCALE

When the data become large such as vast amount of concepts, we decided to use a simple ranking method to strict the grapy layout.

1) According to the first letter of concepts’ label, we separated the concepts into five different bubbles such as the following Figure:

29

Figure 28: Base layer by character

2) When a user clicks the bubble with label “A-E”, then the labels of concepts that

starts from A to E will be showed respectively.

Figure 29: “A-E” graph expanded

3) If there are still many concepts in the bubble “A” (such as 20), we compared the concepts’ second letter and separate to another five bubbles. It is shown in Figure 30 when a user click the bubble “A”:

Figure 30: “A” graph expanded

30

This graph layout method combine with “Compound-Fisheye view” technic would works properly for visualizing large dataset.

4.4 ALGORITHM AND TIME COMPLEXITY

Since we are dealing with a huge dataset in RDF, an effective algorithm is to be designed in order to decrease the complexity time in finding the data relations. For implementing the approach illustrated above to cope with the real RDF dataset “Canberra”, we considered the way that use list structure inside the hash map. Firstly, we used hash key to record the number of instances and used lists to contain concepts as the hash value, and then travel the “Concept- Subconcept” dataset for each list to find relations. However, by running the real data, the time consumption is very high and costs a quite long time to produce the result. Therefore, we redesigned a completed different algorithm that will be explained in details at the next part “Algorithm”.

Algorithm

We tried different ways to reduce the time complexity. Finally, by comparing the efficiency on different algorithms, we designed an appropriate algorithm that has the follow steps:

• Construct the “Concept- Subconcept” relation to be directed graphs. (Concept-relation graph)

• As considering the time and space complexity, we used quick sort to sort the instances data.

• Construct a Breath- first search (BFS) algorithm to search the graph to find the required concept.

The pseudo code is shown below:

Pseudo-code for ranking instance data

RANKING (Instance_DATA, p , r) 1 if p < r 2 then q ← PARTITION(Instance_DATA,p,r) 3 RANKING(Instance_DATA,p,q-1)

4 RANKING(Instance_DATA,q+1,r)

RANKING() modify quick sort to rank the given data in a set of triples dataset.

We modified the partition exchange sort in quick sort to get:

Pseudo-code for partition-exchange instance data

31

PARTITION(Relation_DATA,Instance_DATA,p,r) 1 x ← Instance_DATA[r] 2 i ← p-1 3 for j ← p to r - 1 4 do remove ← 0 5 if ISGREATER(Relation_Data,Instance_DATA[j],x,remove) 6 then i ← i+1 7 exchange Instance_DATA[i] <-> Instance_DATA[j] 8 if remove = 2 9 then removeList ∪ {Instance_DATA[j]} 10 else if remove = 1 11 then removeList ∪ {x} 12 remove ← 0 13 ISGREATER(Relation_Data,x,Instance_DATA[i+1],remove) 14 exchange Instance_DATA[i+1] <-> Instance_DATA[r] 15 if remove = 2 16 then removeList ∪ {Instance_DATA[i+1]} 17 return i+1

Firstly, we compared the number of instance. If the numbers of instances are the same, then we compared the property. Until the properties are the same, then we used BFS (Breadth-first search) to search the concept-relation graph.

• Firstly, compare number of instance, return true if data1.no > data2.no. Return false if less, and move to next step if equal

• Secondly, compare property when number of instance is the same. Return false if different property and go to next step if same property

• Lastly, compare their relation if they have same property and same number of instance. Recall BFS in this step.

Pseudo-code for instance and property comparison

ISGREATER(Relation_Data, data1,data2, remove) 1 if data1.NoInstance > data2.NoInstance ##compare No. of instance 2 then return true 3 else if data1.NoInstance < data2.NoInstance 4 then return false 5 else if data1.Property != data2.Property ##compare property if No. of instance 6 then return false ## are same 7 else if BFS(Relation_DATA, data1.Concp, data2.Concp) ##compare concept, 8 then p = 2 ##if property are same 9 return true 10 else return false

We modify the Breadth-first search algorithm to search the concept graph in order to get the required concept.

Pseudo-code for searching concept-relation graph

32

BFS(Relation_DATA,data1,data2) 1 for each vertex u ∈ Relation_DATA[G] - {s} 2 do color[u] ← WHITE 3 color[data1] ← GRAY 4 Q ← ∅ 5 ENQUEUE(Q,data1) 6 while Q != ∅ 7 do u ← DEQUEUE(Q) 8 for each v ∈ Adj[u] 9 do if color[v] = WHITE 10 then color[v] ← GRAY 11 if v = data2 12 do return true 13 color[u] ← BLACK 14 return false

Complexity Analysis

The first process to build a directed graph from the dataset of Concept-Subconcept relations costs O(n) where n is the number of lines, since we read this dataset line by line. For each line, an edge will be crated between two nodes and the node will be added before adding the edge if the vertex does not exist.

For ranking the item “number of instance” and “property” in the dataset of InstanceCountPerType, we chose to use Quicksort as we consider both time complexity and space complexity. The average case performance of using Quicksort cost O(n log n) and the worst case would be O(n^2). Although the worst case for other sorting methods such as Merge sort and Heapsort cost time complexity of O(n log n), their space complexity is up to O(n log n) unlike Quicksort which has a space complexity of O(log n) even in the worst case. That’s because Merge sort use O(log n) stack space and the extra O(n) space for storing array, so the total space complexity is O(n log n). The same reason when using heap sort, it takes O(n log n) space to build the heap tree structure and use O(1) auxiliary space. Thus, the use of Quicksort can save a large amount of space especially on dealing with large dataset.

In each process of Quicksort, we also need to recall Breath-first search for finding the concepts’ relation if it needs concept comparison. The BFS algorithm requires the time complexity of O(|V| + |E|) in the worst case where the V is the set of vertex and E is the set of edges. In this “Concept-Subconcept” dataset, V is the set of concepts and E is the set of concepts’ relations.

Therefore, the total time complexity of implementing the algorithm to retrieve the required data is O((|V| + |E|) * n log n).

33

CHAPTER 5

5 EVALUATION

When we use the real data to test this visualization approach, we found that the algorithm efficiency could be the most difficult task to overcome while dealing with a large dataset. To test the usability of our algorithm implemented above, we designed two controlled experiments. The next few sections will explain the details of the experiment, analyse the experiments’ results and illustrate the ways to optimize the algorithms.

5.1 EXPERIMENT ENVIRONMENT AND STEPS

In the first place, we have briefly view on how the graph of “Concept- Subconcept” relations looks like.

When the number of concepts and subconcepts become large, the relations become very complex. Simultaneously, the time consumption of using BFS to traverse the graph also becomes larger.

We designed two controlled experiments and used the experimental datasets to test the time consumption when increase the number of concept relations in “Concept-Subconcept” and the number of triple data in “InstanceCountPerType” respectively.

Example: Subconcept Concept A B B C B D D F C F C E E G F G

Figure 31: Directed graph

34

5.1.1 EXPERIMENT 1

We kept the number of triples (concept/type, property and number of instance) in “InstanceCountPerType ” dataset as a constant at 5000 triples, while continuous increasing the number of data (lines) in “Concept-Subconcept” dataset by adding 100 data every time from 200 up to 2000. We check the time it costs through increasing the number of relations, which has the following steps:

1. Build 5000 numbers of triples data using recursive function, and the sample is shown in Appendix C graph 1.

2. Generate the different random relation of the concepts from those triples, which is shown in Appendix B graph 2. First trial, we generated 200 relations.

3. Run the program to test these two dataset to check what time it cost 4. Keep the number of triples and increase the number of relations by 200 and record

the time consumed.

5.1.2 EXPERIMENT 2

We used the dataset generated from experiment1 in experiment 2 as follows:

• We kept the number of data (relations/ lines) in “Concept- Subconcept” data set as a constant at 1000 different lines,

• Increase the number of triples in “InstanceCountPerType” dataset in 500 steps from 500 to 10000 data.

• Record the time cost for each point (500, 1000, 1500…).

5.2 TESTING ENVIRONMENT

Experiments are processed via a Java program on a Mac system with the following specifications.

Hardware / Software Information Eclipse Standard Version 1.0 Java SE Development Kit Jdk1.7.0_51 OSX Yosemite Version 10.10.3 Processor 2.4 GHz Intel Core i5 Memory 4GB 1333 MHz DDR3 Graphics Intel HD Graphics 3000 384 MB

Table 4: Testing environment

35

5.3 EXPERIMENT RESULTS AND ANALYSIS

In the experiment 1, it takes a long time to produce the final results when the numbers of relations become very large. Compared with the experiment 1, producing the final results in experiment 2 takes shorter even with data increased.

The results are shown in the following Figures:

The results from two experiments show that the time cost in experiment 1 increase faster than in experiment 2 as the number of data increased. In experiment 1, when the number of data (relations / lines) increased, the trend of time cost is showing an exponential growth pattern. In experiment 2, along with the number of triples (concept, property and number of instance) increased, the trend of time cost is linear.

When we test our algorithm with real dataset such as testing with “Canberra”(contains 5760 triples and 1674 lines of concept relations), the result is shown in Figure 34,

Figure 34: Time cost for running “Canberra”

The time consumed of testing real dataset has matched the time cost in experiments.

Figure 32: Result of experiment 1 Figure 33: Result of experiment 2

36

Although there exit some deviation on few data that may be due to the CPU efficiency, the trend of those results still satisfy the time complexity of our algorithm O((|V|+|E|)*n log n).

We can calculate the time complexity from O((|V|+|E|)*n log n) for each experiment.

• O((|V|+|E|)*n log n) where V is the set of vertex, E is the set of edges and n is the number of triples. The worst case for building the graph is |E| = |V|*(|V|-1)/2

• For experiment 1, n is a constant C1, O V + E ∗ n log n

= O V + V ∗V − 12

∗ 𝐶! log𝐶!

= 𝑂 12

𝑽 𝟐 + 𝑽 𝐶! log𝐶!

= 𝑂 𝑀!𝑙𝑜𝑔𝐶!

Where 𝑀! = !!

𝑽 𝟐 + 𝑽 𝐶!

Time complexity becomes exponential function.

• For experiment 2, the number of relations is a constant, so |V| + |E|= C2, O V + E ∗ n log n = 𝑂 (𝐶! ∗ 𝑛 𝑙𝑜𝑔𝑛)

Time complexity becomes linear function

Therefore, from the results analysis, the time complexity in our algorithm with the real time is totally matched.

Special situation

However, time consumption is still high due to that colouring vertex in BFS for searching directed graphs cost the most time. Colouring vertex in BFS is necessary when dealing with circles in directed graph such as circle B- C- F- D- B in Figure 35:

Figure 35: Circle in the directed graph

While the network may not exist a relation such that an object is a subset of another object and also the object itself is a superset of the other, we may ignore the steps of colouring vertex. Then, the time cost will be:

37

Figure 36: Time cost without using colouring

In order to design a good approach, we here considered the entire possible relations including the circles. Thus, we did pruning on the algorithm to reduce the frequent use of BFS in order to promote its efficiency. The next section will illustrate the details of using pruning to optimize our algorithm.

5.4 OPTIMIZATION

We used the pruning approach to decrease the searching steps such as the times of calling BFS, there three main steps.

1. Create another graph called “no relation graph” to contain a set of unrelated vertices. 2. Update relation graph 3. Avoid the adjacent vertex to be null

Step1

We created another graph to record the concept as vertex and all other concepts that has no relations with the target concept as its adjacent vertex during each time to call BFS for traversing the directed graph to compare the relation between pair of concepts. Then, we can check if the concept is included in the “no relation graph” before calling BFS to check if two concepts have relation. If it appears as vertex and edges in “no relation graph”, then it does not need to call BFS and can return no relation between these two concepts.

For example when we used BFS searching the graph (Figure 31) to check if C is the sub concept of B, it needs to traverse the entire vertex in the graph and finally return False and generate a “no relation graph” to concept C such as Figure 37.

Figure 37: No relations to concept C

38

When check if C is the sub concept of A, the program does not need to use BFS to search graph; instead, it check “no relation graph” first and found that A is the adjacent vertex of C and then return False. This way essentially reduced the times of calling BFS.

Step 2

When comparing two concepts (A and B) by calling BFS, it searches if the concept A has a super concept that has distance more than 1 to A. If it does, it updated the original graph by adding the super concept to become an adjacent vertex of the concept A.

For example in Figure 31, when we check if C is the sub concept of B, it recalled the BFS to search the directed graph and update the original graph to be:

Figure 38: updated graph to vertex “C”

Thus, it can save times of using BFS for finding path while comparing relation between C and G.

Step 3

When check if concept 1 is the sub concept of concept 2, the program checked if the concept 1 in directed concept-relation graph has no adjacent vertex. If the concept 1 has no adjacent vertex that means it has no super concept relation, the program returns False directly and it does not need to call BFS. For example (based on the Figure 31), if the program needs to check if concept G is the sub concept of F, the program will return False directly without calling BFS for searching. The reason is that in the directed graph, the vertex G has no adjacent, which present G has no super concept.

Those three steps are the most important steps in pruning methodology, which has the core strategy that is to reduce the times of calling BFS for searching. Since the BFS searching always consume a large amount of time, reducing the frequent use of BFS can make huge contribution on reducing the time complexity.

We have applied the pruning steps on optimizing our algorithm. The details of the updated algorithm and the result of testing “Canberra” dataset can refer to Appendix D.

39

CHAPTER 6

6 CONCLUSION AND FUTURE WORK

6.1 CONCLUSION

This project, generating visualization from RDF graphs, is going to explore a method to visualize RDF graphs that contain schema and data in particular. We started it from scratch, and did enormous researches on RDF data structure and data visualization. Most previous works on RDF visualization have the same major defect that the graph will become disorder and hard to be recognized along with the size of data become larger. To overcome those shortcomings, we have developed a new approach “Concept-Matching” that use bubbles to represent RDF data and use the importance of data to decide the size and position of bubbles.

In our approach, we found one of the most difficult things is to implement a high-efficiency algorithm to retrieve data for the implementation of this method since the size of RDF dataset always be very large. We combined the use of Graph layout algorithm, Quicksort and Breadth-first search algorithms to improve the efficiency on retrieving data.

From our experiments, we discovered:

• Experiment 1: When numbers of concept-relations stay the same, the time complexity appear exponential growth as the number of triples data increased. In this situation, the algorithm we developed is only suitable for calculating small dataset but not working properly for large dataset.

• Experiment 2: When numbers of triples data stay the same, the time complexity appear linear growth along with the number of concept-relations increased. In this situation, the algorithm is working properly for both small and large dataset

We still found the time cost on retrieving data is quite high, so we did pruning to decrease the times of using BFS; simultaneously, the time consumption has been deduced.

In conclusion, although the time complexity of implementing the algorithm is not as fast as we expected, new approach “Concept-Matching” still can be a good way to visualize large RDF dataset in a nice way.

6.2 FUTURE WORK

By the experiments, even the process of methodology works properly, but we still need survey various users with HCI experiments. In the future work, firstly we would like to design sorts of Human-Computer Interaction experiments to test the useability of the “Concept-Matching” approach and the effectiveness of the graph layout including the “Five

40

Point Scale” approach. We can mainly focus on casual users and gather more data on the feeling of using this method to visualize RDF data.

Moreover, if we ignore the circle relation, we do not need to colour vertices while calling BFS. As we tested, the running time will be decreased to less than few seconds via this way. To consider this fact, we would like to design some specific experiments to test what kind of data should use colouring and what kind of data can ignore this relation.

Finally, we would like to develop a visualization tool to implementing this approach.

41

REFERENCE

[1] W3schools. http://www.w3schools.com/webservices/ws_rdf_intro.asp

[2] W3C Semantic Web. http://www.w3.org/RDF/

[3] Semantic Web – part of business world 2010, viewed 15 March 2015, < http://www.semanticweb.rs/Article.aspx?iddoc=32&id=65&lang=2#>

[4] W3C Semantic Web Activity. http://www.w3.org/2001/sw/

[5] Casellas, N 2011, Semantic Enhancement of Legal Information, Legal Information Institute, Cornell University Law School, viewed 16 March 2015, < https://blog.law.cornell.edu/voxpop/category/semantic-web-and-law/>

[6] Coudyzer, E. (2013). First release GLAM sector reference terminologies, viewed 16 March 2015, <http:// www.athenaplus.eu/getFile.php?id=187 >

[7] Berners-Lee, T, Architecture, W3C, viewed 17 March 2015, < http://www.w3.org/2000/Talks/1206-xml2k-tbl/slide10-0.html >

[8] Obitko, M 2007, Semantic Web Architecture, viewed 16 March 2015, <http://www.obitko.com/tutorials/ontologies-semantic-web/semantic-web-architecture.html >

[9] Dadzie, A & Rowe, M. Approached to Visualising Linked Data: A Survey, IOS Press, Semantic Web 1-2, 2011.

[10] Geroimenko, V & Chen, C. Visualizing the Semantic Web: XML-Based Internet and Infor- mation Visualization. Springer, 2nd edition, 2006.

[11] Janowicz, K., Schlobach, S., Lambrix, P & Hyvonen, E. Knowledge Engineering and Knowledge Management: 19th International Conference, EKAW 2014, Linkoping, Sweden, Novermber 24 – 28, 2014, Proceedings. Springer International Publishing AG, 2015.

[12] Sundara, S., Atre, M., Kolovski, V., Das, S., Wu, Z., Chong, EI & Srinivasan, J. Subsets, Summaries, and Sampling in Oracle. IEEEXplore ICDE Conference, 2010.

[13] Quan, D., Huynh, D & Karger, DR. Haystack: A Platform for Authoring End User Semantic Web Applications. In Proceedings of the 2nd International Semantic Web Conference, 2003, pp. 738- 753.

[14] Schraefel, M., Smith, DA., Owens, A., Russell, Alistair., Harris, C & Wilson, M. The Evolving mSpace Platform: Leveraging the Semantic Web on the Trail of the Memex. Proceedings of the sisteenth ACM conference on Hypertext and hypermedia, 2005, pp. 174-183.

[15] David & Schraefel, The Pathetic Fallacy of RDF, viewed 27 March 2015, < http://swui.semanticweb.org/swui06/papers/Karger/Pathetic_Fallacy.html >

[16] RDF – Gravity. http://semweb.salzburgresearch.at/apps/rdf-gravity/

42

[17] W3C RDF, IsaViz: A Visual Authoring Tool for RDF. http://www.w3.org/2001/11/IsaViz/

[18] FOAF Vocabulary Specification. http://xmlns.com/foaf/spec/

[19] Collberg, C., Kobourov, S., Nagra, J., Pitts, J & Wampler, K. A System for Graph-Based Visualization of the Evolution of Software. Proceedings of the 2003 ACM symposium on Software visualization, pp. 77

[20] WebVOWL: Web-based Visualization of Ontologies. http://vowl.visualdataweb.org/webvowl.html

[21] Lodlive. http://en.lodlive.it/

[22] Visual RDF. http://graves.cl/visualRDF/?url=http://graves.cl/visualRDF/

[23] Bostock, M., Ogievetsky, V & Heer, J. D3: Data-Driven Documents. IEEE Transactions on Visualization and Computer Graphics, Vol. 17, No. 12, December 2011.

[24] Negru, S., Lohmann, S & Haag, F 2014. VOWL: Visual Notation for OWL Ontologies, viewed 10 April, < http://vowl.visualdataweb.org/v2/>

[25] Lohmann, S., Negru, S., Haag, F & Ertl, T. VOWL 2: User-Oriented Visualization of Ontologies. In Knowledge Engineering and Knowledge Management, Vol. 8876, 2014, pp 266-281

[26] Camarda, DV., Mazzini, S & Antonuccio, A. LodLive, exploring the web of data. In Proceedings of the 8th International Conference on Semantic Systems, I-SEMANTICS’12, 2012, pp. 197-200

[27] GitHub, alangrafu / visualRDF. https://github.com/alangrafu/visualRDF

[28] D3 Data-Driven Documents. http://d3js.org/

[29] GitHub, semsol / arc2. https://github.com/semsol/arc2

[30] Shneiderman, B. The eyes have it: a task by data type taxonomy for information visualizations. In Proceedings 1996 IEEE Symposium on Visual Languages, pp. 336-343.

[31] Abello, J., Kobourov, SG & Yusufov, R. Visualizing Large Graphs with Compound-Fisheye Views and Treemaps. In 12th International Symposium, GD 2004, pp. 431- 441.

[32] Perez, J., Arenas, M & Gutierrez, C. Semantics and Complexity of SPARQL. In The Semantic Web – ISWC, 2006, pp. 30-43

[33] Virtuoso SPARQL Query Editor. http://dbpedia.org/sparql

43

Appendices Appendix A The first 20 lines of data in “InstanceCountPerType.csv” and “Concept-Subconcept” is shown as the following graph:

Graph 1: “InstanceCountPerType.csv” sample

Graph 2: “Concept-Subconcept.csv” sample

44

Appendix B

The Triple relations retrieved for the sub concepts of concept “Person” is:

Concept/Type Property No Of Instance

http://dbpedia.org/ontology/Artist http://dbpedia.org/ontology/birthPlace 11 http://dbpedia.org/ontology/hometown 8 http://dbpedia.org/property/placeOfBirth

7

http://dbpedia.org/property/origin 6 http://dbpedia.org/property/birthPlace 6 http://dbpedia.org/ontology/deathPlace 5 http://dbpedia.org/property/deathPlace 2 http://dbpedia.org/property/placeOfDeath

2

http://dbpedia.org/ontology/Athlete http://dbpedia.org/ontology/birthPlace 117 http://dbpedia.org/property/birthPlace 94 http://dbpedia.org/property/placeOfBirth

67

http://dbpedia.org/ontology/residence 8 http://dbpedia.org/property/residence 8 http://dbpedia.org/ontology/deathPlace 4 http://dbpedia.org/property/deathPlace 4 http://dbpedia.org/ontology/hometown 3 http://dbpedia.org/ontology/school 2 http://dbpedia.org/property/hometown 2 http://dbpedia.org/ontology/highschool 2 http://dbpedia.org/ontology/billed 1 http://dbpedia.org/ontology/formerTeam

1

http://dbpedia.org/property/fightingOutOf

1

http://dbpedia.org/property/placeOfDeath

1

http://dbpedia.org/property/billed 1 http://dbpedia.org/ontology/OfficeHolder http://dbpedia.org/ontology/deathPlace 19

http://dbpedia.org/property/deathPlace 11 http://dbpedia.org/ontology/birthPlace 7 http://dbpedia.org/property/placeOfDeath

4

http://dbpedia.org/property/birthPlace 4 http://dbpedia.org/ontology/residence 2 http://dbpedia.org/property/placeOfDeath

2

http://dbpedia.org/property/residence 1 http://dbpedia.org/ontology/Economist http://dbpedia.org/ontology/birthPlace 1

http://dbpedia.org/property/birthPlace 1 http://dbpedia.org/property/placeOfBirth

1

http://dbpedia.org/ontology/MilitaryPerson http://dbpedia.org/ontology/deathPlace 26

45

http://dbpedia.org/property/deathPlace 22 http://dbpedia.org/property/placeOfDeath

19

http://dbpedia.org/property/birthPlace 1 http://dbpedia.org/ontology/restingPlace

1

http://dbpedia.org/ontology/birthPlace 1 http://dbpedia.org/ontology/Politician http://dbpedia.org/ontology/deathPlace 19

http://dbpedia.org/property/placeOfDeath

16

http://dbpedia.org/property/deathPlace 13 http://dbpedia.org/ontology/birthPlace 7 http://dbpedia.org/property/birthPlace 5 http://dbpedia.org/property/placeOfBirth

4

http://dbpedia.org/ontology/residence 2 http://dbpedia.org/property/residence 1 http://dbpedia.org/ontology/almaMater 1 http://dbpedia.org/property/almaMater 1

http://dbpedia.org/ontology/BeautyQueen http://dbpedia.org/property/placeOfBirth

1

http://dbpedia.org/property/birthPlace 1 http://dbpedia.org/ontology/birthPlace 1

http://dbpedia.org/ontology/SportsManager http://dbpedia.org/ontology/birthPlace 2 http://dbpedia.org/property/birthPlace 1 http://dbpedia.org/property/placeOfBirth

1

http://dbpedia.org/ontology/Model http://dbpedia.org/property/birthPlace 1 http://dbpedia.org/ontology/birthPlace 1

http://dbpedia.org/ontology/Scientist http://dbpedia.org/ontology/deathPlace 16 http://dbpedia.org/property/deathPlace 13 http://dbpedia.org/property/placeOfBirth

11

http://dbpedia.org/ontology/residence 6 http://dbpedia.org/property/residence 6 http://dbpedia.org/property/workplaces 1

http://dbpedia.org/ontology/Cleric http://dbpedia.org/ontology/deathPlace 1 http://dbpedia.org/property/placeOfBirth

1

http://dbpedia.org/property/deathPlace 1 http://dbpedia.org/ontology/Philosopher http://dbpedia.org/property/residence 1 http://dbpedia.org/ontology/OrganisationMember

http://dbpedia.org/property/stadium 1

http://dbpedia.org/ontology/Chef http://dbpedia.org/ontology/birthPlace 1 http://dbpedia.org/ontology/Architect http://dbpedia.org/ontology/significant

Building 2

http://dbpedia.org/ontology/significantProject

1

http://dbpedia.org/ontology/significantDesign

1

46

Appendix C Experiment 1 – build the number of triples including “concept”, “property” and “number of instance”. Here we took the screen capture of first few items, which is shown as the following graph.

Graph 1: number of triples

The number of relations has been generated, which only shows first few items:

Graph 2: number of relations

47

Appendix D Optimized algorithm

After we applied pruning steps, the algorithm has been updated as the following:

Pseudo-code for instance and property comparison

ISGREATER(Relation_Data, data1,data2, remove) 1 if data1.NoInstance > data2.NoInstance ##compare No. of instance 2 then return true 3 else if data1.NoInstance < data2.NoInstance 4 then return false 5 else if data1.Property != data2.Property ##compare property if No. of instance 6 then return false ## are same 7 else if ISSUB(Relation_DATA, data1.Concp, data2.Concp) ##compare concept, 8 then p = 2 ##if property are same 9 return true 10 else return false

Check if concept s is the sub concept of b in terms of using pruning steps

Pseudo-code for finding concept relation

ISSUB(Relation_DATA, s, b) 1 if Adj[s] = null OR Adj[b] = null 2 then return false 3 if b ∈ Adj[s] 4 then return true 5 if s ∈ no_Relate AND b ∈ no_Relate_Adj[s] 6 then return false 7 return BFS(Relation_DATA,s,b)

48

Use the pruning steps inside of BFS searching

Pseudo-code for searching concept-relation graph

BFS(Relation_DATA,s,b) 1 for each vertex u ∈ Relation_DATA[G] - {s} 2 do color[u] ← WHITE 3 color[data1] ← GRAY 4 Q ← ∅ 5 ENQUEUE(Q,data1) 6 while Q != ∅ 7 do u ← DEQUEUE(Q) 8 if Adj[u] = null 9 then if s ∈ no_Relate 10 then {u} ∪ no_Relate_Adj[s] 11 else 12 {s} ∪ no_Relate 13 {u} ∪ no_Relate_Adj[s] 14 for each v ∈ Adj[u] 15 do if color[v] = WHITE 16 then color[v] ← GRAY 17 if v = b 18 {v} ∪ Adj 19 then return true 20 color[u] ← BLACK 21 if Adj[u] = null 22 then if s ∈ no_Relate 23 then {b} ∪ no_Relate_Adj[s] 24 else 25 {s} ∪ no_Relate 26 {b} ∪ no_Relate_Adj[s] 27 return false

After running the updated algorithm, the time has dramatically decreased as it is showed on Figure 34.

Figure 34: Testing “Canberra” dataset