a metadata-based recommender system for statistical linked open data · 2017-03-01 · and/or...

A Metadata-based Recommender Systemfor Statistical Linked Open Data

IT4BI MSc Thesis

Student: Ekaterina DobrokhotovaSupervisor: Oscar Romero

Advisor: Jovan Varga

Master on Information Technologies for Business IntelligenceUniversitat Politecnica de Catalunya

Barcelona31 July, 2016

A thesis presented by Ekaterina Dobrokhotovain partial fulfillment of the requirements for the MSc degree on

Information Technologies for Business Intelligence

AbstractIn recent years, there are increasing efforts of Business Intelligence (BI) and Semantic Web com-munities to enable On-Line Analytical Processing (OLAP) over Statistical Linked Open Data.Unlike internal sources where data organization is generally familiar, Linked Data sources aretypically uncontrolled and bring challenges regarding the integrity constraints and data complete-ness required by the multidimensional (MD) data model needed for OLAP. Moreover, users needguidance and recommendations for the analysis process in order to improve the quality of theuser experience.

This thesis presents an approach to assist users in the exploration process of Statistical LinkedOpen Data sources as a step towards next generation BI systems. Building on top of an existingapproach for constructing MD dimension hierarchies, we address two data completeness chal-lenges in this context caused by the missing links in internal and/or external RDF data sources.Once having an OLAP cube represented in RDF we develop a recommender system to supportusers in their analysis. The core of recommender system is metadata collected about users, theiractions and MD schemas. Thus, we propose the solution named EXR:Extended Enrichment andExploration Recommendations for Statistical Linked Open Data, present a prototype and demon-strate the feasibility of our solution. Our approach facilitates the analysis of Statistical LinkedOpen Data and represents a base for further research towards user assistance in this context.

ii

Contents

Abstract ii

1 Introduction 1

2 Related work 2

3 Prerequisites and problem statement 33.1 The Enrichment module of QB2OLAP tool . . . . . . . . . . . . . . . . . . . . 43.2 Running example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.3 Handling challenges for dimension hierarchy construction . . . . . . . . . . . . 53.4 Providing metadata-based recommendations for user exploration . . . . . . . . . 6

4 Proposed solution 64.1 Overview of EXR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64.2 E2 Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

4.2.1 Completing missing data for dimension hierarchy construction . . . . . . 84.2.2 Completing missing synonyms in external data sources . . . . . . . . . . 9

4.3 AM and D Repositories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104.4 Q Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114.5 XR Module . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

5 Formalization 155.1 Completing missing data for dimension hierarchy construction . . . . . . . . . . 155.2 Handling missing synonyms in external data sources . . . . . . . . . . . . . . . 165.3 AM Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185.4 A metadata-based query recommendation approach for the exploration of Statis-

tical Linked Open Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6 Implementation and Evaluation 226.1 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226.2 Evaluation of the completing the missing level members and linking an external

dimensional data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 236.3 Evaluation of a metadata-based recommendation system . . . . . . . . . . . . . 24

7 Conclusion and Future Work 26

A Appendix SPARQL queries 27A.1 Counting the number of a given property in a dataset . . . . . . . . . . . . . . . 27A.2 Counting the number of level members in a dataset . . . . . . . . . . . . . . . . 27

iii

A.3 Counting the number of synonyms in a dataset . . . . . . . . . . . . . . . . . . . 27

B Example of instance of the metadata model 29

C Evaluation scenario 32

References 34

iv

List of Figures

3.1 Example for enriching MD with external data . . . . . . . . . . . . . . . . . . . 5

4.1 EXR Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.2 BPMN model: Interaction between a user and the EXR system . . . . . . . . . . 74.3 The workflow of Extended QB2OLAP Enrichment Module . . . . . . . . . . . . 84.4 Example of adding a missing level member . . . . . . . . . . . . . . . . . . . . 94.5 Example of adding a missing synonym . . . . . . . . . . . . . . . . . . . . . . . 104.6 Example of hierarchy construction for several external data sources . . . . . . . . 104.7 The workflow of the Querying Module . . . . . . . . . . . . . . . . . . . . . . . 124.8 Example of a user vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.9 Example of a query vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134.10 Example of the semantic value computation . . . . . . . . . . . . . . . . . . . . 144.11 The workflow of Exploration Recommendation Module . . . . . . . . . . . . . . 14

5.1 The RDF model of metadata needed for the user assistance . . . . . . . . . . . . 18

6.1 Percentage of used recommendations by rank . . . . . . . . . . . . . . . . . . . 25

v

List of Tables

6.1 Evaluation of handling the missing level members . . . . . . . . . . . . . . . . . 236.2 Handling of the missing synonyms evalutaiton . . . . . . . . . . . . . . . . . . . 24

vi

1. IntroductionNext generation Business Intelligence (BI) systems are characterized by data sources with semi-structured, unstructured and non-controlled (i.e., external) data sources. In these settings, inexpe-rienced users need support to explore data and be well-informed to make decisions. A frameworkof next generation BI systems is described in [1] and we follow its ideas as guidelines for ourresearch. The key concept is the idea of fusion cube as OLAP cubes that can be extended both intheir schema and their instances on the fly. Furthermore, the user should be assisted in exploringthe data, e.g., performing ad-hoc micro analytics of a piece of data. The supportive backbone offusion cubes is a smart use of metadata, for example, about the quality, freshness, validity of datasource or actions previously performed by user.

Semantic Web (SW) and Linked Data [2] can bring plenty of data and metadata available.The set of standards promotes common data formats and exchange protocols on the Web,e.g.,like a framework for representing information Resource Description Framework (RDF). LinkedData has a graph nature: links are made between related resources, for example, links to DBpediafrom Eurostat and and they define relationships between these resources. In these settings, mul-tidimensional (MD) data is typically published using the RDF Data Cube (QB)1. Many instancesof Linked Open Data (interlinked RDF datasets with open content) can be easily found in theWeb and in this research they exemplify of possible datasets for ad-hoc analysis by users.

There are significant efforts towards enabling OLAP-fashion analysis of Linked Open Data.In this context a tool proposed in [3] supports OLAP on statistical linked open data like Eurostat2

or World Bank Linked Data3. The tool takes an existing statistical dataset represented with QB,and enriches it with additional QB4OLAP [4] metadata to fully support native OLAP analysis [5].It automatically discovers potential enrichment concepts that are suggested to the user to selectthe one of her interests, for instance, to add a new level to hierarchy. The candidate conceptsare discovered in the existing dataset. This tool sets a foundation for the construction of MDdimension hierarchies and opens new possibilities for the research in this direction, for example,an enrichment of schema with external data.

Even with the enriched schema, the exploration of Statistical Linked Open datasets can stillbe tedious. Thus, it is important to consider possibilities for supporting users in this task. Ac-cording to [6] the user support activities are classified into 2 categories: the querying assistance,the most commonly known as query recommendation and the visualization assistance. It sup-posed that for easing the process of the data exploration, the system is capable to suggest anentire query or aid the user to build a query, also customise the results according to the userprofile and help with the visualizing. Most of these activities are based on the exploitation ofmetadata, such as queries, preferences, schema information, etc., and these metadata artifactsare referred to as Analytical Metadata (AM) whose taxonomy is proposed in the same paper. Asemantic metamodel of AM framework, SM4AM, is defined in [7] covering metadata artifacts

1https://www.w3.org/TR/vocab-data-cube/2http://eurostat.linked-statistics.org/3http://worldbank.270a.info/

1

that are needed for user assistance functionalities. SM4AM is created as an RDF formalizationof AM artifacts. This metamodel is used for the metadata representation in our research.

One of the approaches to assist a user in a BI system is to create a recommender systembased on available metadata. In this thesis, we build on the top of the QB2OLAP Enrichmentmodule and further improve the possibilities for working with Linked Data sources. Unlike in-ternal sources where data organization is generally familiar, Linked Data sources are typicallynon-controlled and new for the user that needs guidance and recommendations for their analysis.Thus, once having an OLAP cube represented in QB4OLAP we can develop a recommendersystem to support the user in his/her analysis.The recommendations can be based on the AMartifacts, e.g., queries and user characteristics. Furthermore, in the spirit of Linked Data, if repre-sented in a semantic-aware fashion it is easier to share and to link them among different systems.Thus, we believe, that the synthesis of state-of-art research on Semantic Web, Linked Open Dataand existing recommender approaches brings an innovation in this field of BI. Moreover, withthe current research we want to accelerate the realization of the concept of fusion cube. Tosummarize, our research contributions are following:

– An approach to handle data completeness challenges caused by the missing links in internaland/or external RDF data sources.

– An RDF metadata model for the user assistance.

– A metadata-based query recommendation approach for the exploration of Statistical LinkedOpen Data.

– A prototype named EXR:Extended Enrichment and Exploration Recommendations for Sta-tistical Linked Open Data as proof of concept of previous ideas.

2. Related workIn this section, we present and discuss approaches related to the scope of the thesis. First,

we introduce the approaches about next generation BI systems. Then, we discuss the relationof these systems and the Semantic Web and Linked Data. Finally, we elaborate on the existingapproaches about the user assistance in this context.

In recent years, the next generation BI systems have gained an attention of many researchers.As explained in Section 1, we follow the vision of Fusion Cubes [1] and there are more ap-proaches related to this context that we discuss next. For instance, the platform “LiveBI” from [8]declares to unify data and analytics in the BI. It discovers the possibilities to turn BI systems intonear real-time processing platforms. However, it does not consider external data sources withouttight integration with internal data. On the other hand, an architecture proposed in [9] for Ad-hoc and Collaborative BI takes into account an option to have a flexible data store and proposesRDF as a solution for sharing and linking data. The collaboration is defined as simultaneouswork on the reporting and decision-making phase. However, this research does not consider anyrecommendation assistance on earlier phases, like the data source choice or an exploration. An-other proposal, called “OpenBI” and published in [10], demonstrates the closest vision of the

2

problem of ad-hoc analytics of Linked Data sources. However, this research puts an accent onthe selection process of data mining algorithm for discovering knowledge from Linked Data. Ingeneral, it can be noticed that there is an important need to consider Linked Data sources anduser assistance in the context of next generation BI systems.

Not only an architecture of BI systems is subject to change, but also the way to analyze data,as the needs of users and technological opportunities are growing. The user considers the dif-ferent sources and possibly incorporates with the existing data. The new type of OLAP analysiscalled Exploratory OLAP, that was defined in [11], lets users work with external data in the mostefficient way. This survey highlights that SW technologies are a promising approach to providehigh-quality and user-friendly data analysis in an ad-hoc manner. Mentioned before QB4OLAPvocabulary [4] and the QB2OLAP tool [12] are an evolution of the idea Exploratory OLAP inSW. QB4OLAP extends QB in order to support OLAP models and operators. QB2OLAP toolpractically does a transformation QB to QB4OLAP and enable OLAP on existing data.

Several publications have appeared in recent years proposing recommender systems in thesettings of Linked Data and Semantic Web, for example, [13] and [14], where authors elaboratehow to use Linked Open Data to build semantics-aware recommendation systems in domainswith typical recommendation items, e.g., movies or books. Both studies focus on providing rec-ommendations based on the similarities between users and/or items. Moreover, the research [15]adds a concept of the “context” into the classical paradigm of user-item recommendations, andcontext defined as any additional contextual information. A detailed discussion about the recom-mender systems can be found in the following surveys [16] and [17]. Overall, existing researchabout user assistance for the data exploration in the context of Linked Data and next generationBI systems are just in its inception and there are many non-exploited possibilities in this context.For instance, instead of only focusing on data as items for the recommendations, the recom-mender systems can also consider queries. Moreover, the recommendations should not generaterecommendations only based on the similarities but also on the value that a recommendationbrings. This value can be defined based on the domain (e.g., OLAP exploration)

3. Prerequisites and problem statementIn this section, we introduce the prerequisites for understanding our approach and give the

problem statement. The prerequisites present the running example and explain the previous work(i.e., the QB2OLAP approach) on which we build. The problem statement is presented as twomain tasks with their challenges that we address in our approach.

The QB2OLAP tool brings an innovation in the field of the exploration of Statistical LinkedOpen datasets. It enables the transformation and enrichment of existing QB1 datasets intoQB4OLAP [4] ones. The enriched datasets can then be queried with a high-level OLAP querylanguage and their schemata can be visually explored. We consider the Enrichment module of theQB2OLAP tool and extend it to provide additional functionalities. We focus on the Enrichmentmodule of QB2OLAP tool as target candidate for extension.

1https://www.w3.org/TR/vocab-data-cube/

3

3.1 The Enrichment module of QB2OLAP toolThe main goal of the Enrichment module is a semi-automatic generation of the schema thatpublished as QB4OLAP graph for further exploration. According to original paper this moduleconsists of the Redefinition Phase, the Enrichment Phase and the Triple Generation Phase, whichis responsible for generating schema and instance triples.

The Redefinition Phase is responsible for extension of the input schema of the QB graphto QB4OLAP semantics by sequential extraction of measures, redefinition of dimensions. Itcollects the level members and their properties for further processing.

The Enrichment Phase generates suggestions of new possible levels for an existing schemaand enriches the schema based on the user choices. The algorithm for detecting new levels isbased on the following idea: we collect all properties and their objects for each level member ifchild level and search for functional dependencies by means of identifying many-to-one cardi-nalities between level members and property objects. We assume, that if all level members havesome property X and this property X points to several different objects Yi such that each levelmember points to exactly one member of Yi, so we can suggest the property X as a potentialcandidate for being a new parent level. The set of objects Yi will be level members of the newlevel. Furthermore, to cope with the missing properties, the QB2OLAP Enrichment module usesthe hierarchy construction percentage (e.g. HCP) that is a threshold expressing the percentageof the properties that need to satisfy the constraint for quasi cardinalities (e.g., 90%). If the HCPvalue is less than 100%, it means that some level members may be missing or that there is morethan one property X for one level member. If so, the user needs to manually address these casesbefore performing the enrichment. Thus, our first task in this thesis is to automatically handlethese cases whenever possible.

3.2 Running exampleThe user is analyzing data from Food and Agriculture Organization of the United Nations. Hediscovered several data sources like the Regional Commission for Fisheries Capture Productionthat might help him answer business questions. The data are published in QB. The user sees,the possibilities for analysis are limited and that it is not possible to analyze data in the wayhe is used to, for example, to aggregate data by years or roll up from countries to continents.Luckily, he knows about EXR tool (Extended Enrichment and Exploration Recommendations forStatistical Linked Open Data) and with its help he can enrich the schema (e.g., with new levelsfor rollup). More than using just local metadata to search for potential enrichment concepts, theuser would like to explore some other sources like DBpedia and World Bank. Once the schemais enriched the user needs to explore the data. When the user starts querying the enriched cube,he is interested to see some recommendations on querying. With the growth of complexity andnumber of actions the user needs to perform, the user assistance is required.

4

3.3 Handling challenges for dimension hierarchy constructionWe extend the QB2OLAP enrichment module to automate the handling of the missing levelinstances needed for the hierarchy construction of MD on the Semantic Web that are representedaccording to the QB4OLAP vocabulary. The fact that some level members for new candidatehierarchy levels might be missing raises two main challenges.

Challenge 1: How to automatically complete the missing parent level members neededfor the dimension hierarchy construction?

The detection of the candidate properties directly depends of the quality of the dataset thatideally does not have any level members missing. This comes from the summarizability conditionthat according to [18] requires certain conditions to be met like completeness, disjointness andtype compatibility. While disjointness is to be addressed in the data cleaning stage and typecompatibility is to be provided by the user selection of the aggregate function, our approachfocuses to provide automatic solution to guarantee the completeness of the MD schema producedby our tool.

Challenge 2: How to automatically complete the missing synonyms in external datasources?

The second point for extending the QB2OLAP Enrichment module is to add the functionalityfor linking external dimensional data. In this context, we assume the existence of links in theinternal (i.e., local) dataset that point to synonyms in other sources. On this base, we automatizethe enrichment of the cube schema with external data sources. Due to an unpredictable quality ofexternal data sources and a degree of connectedness between data sources there is no guaranteethat for every concept in the internal dataset exists synonymic link in external dataset. Thus, ourgoal in this thesis is to upgrade the QB2OLAP Enrichment module to automatically handle thecases where level members and/or their synonyms are missing. An example of the cube schemaenrichment by adding a new hierarchy level detected in an external data source is presented inFigure 3.1. There is a possible hierarchy with 2 levels: Countries and Government Type, where

Figure 3.1: Example for enriching MD with external data

the child level Countries presented as a small subset of level members from the FAO (Food andAgriculture Organization) dataset and Government type as parent level coming from DBpedia.

5

3.4 Providing metadata-based recommendations for user ex-ploration

As discussed previously, once the schema is enriched the users still need support to explore thedata. Thus, our second task is to collect and use metadata to assist the user. The QB2OLAPEnrichment module tool is generating a schema and the schema is only one piece of AnalyticalMetadata. Beside the schema metadata, we collect additional metadata and provide query rec-ommendations based on these metadata. Thus, we need to address the following two challenges.

Challenge 3: What metadata we want to collect and how to model them?

We want to create a metadata model by instantiating SM4AM metamodel. In this context wefocus on the user (User metadata) and metadata about the user queries (Query metadata).

Challenge 4: How to exploit the metadata to provide query recommendations for theuser?

After creating a metadata model by instantiating the SM4AM, we need to exploit metadatafor recommendation of queries. We have to consider the fact that our recommended artifactcannot be recommended only on the similarity property: the recommended query must bring anew value for the user.

4. Proposed solutionIn this section, we present the architecture of the proposed solution. We first provide an

overall solution overview. Then we focus on details about each module and explain solutions forthe challenges previously stated.

4.1 Overview of EXROur solution, “EXR: Extended Enrichment and Exploration Recommendations for StatisticalLinked Open Data”, consists of the following components: E2M - the extended QB2OLAPEnrichment Module, QM - the Querying Module, XRM - the Exploration Recommender Module,AMR - the AM Repository and DR - the Data Repository, as illustrated in Figure 4.1. Usersinteract with the system via a graphical user interface and our systems queries external sourcesvia their SPARQL endpoints.

E2 Module(E2M): the Extended QB2OLAP Enrichment Module. The E2 Module extendsthe QB2OLAP Enrichment module to enable handling of the missing level members neededfor the hierarchy construction. Furthermore, to search for additional enrichment concepts inexternal sources, it introduces the mechanism to automatically address the situations of missinglevel member synonyms.

6

Figure 4.1: EXR Architecture

Q Module (QM): the Querying Module. This module enables the user to explore (i.e., query)the dataset via a web-based graphical user interface. The Q Module transforms the user interfaceactions into SPARQL queries and retrieves and presents the results to the user. It also generatesquery metadata that are stored in AMR and further exploited by the XR module as explainedbelow.

XR Module (XRM): the Exploration Recommendation Module. It is generating recommen-dations for the user data exploration, by exploiting the metadata from AMR.

AM Repository(AMR) and D Repository(DR): the Analytical Metadata Repository andthe Data Repository. AMR are a specialized database for the storage and retrieval of triplesthrough semantic queries, that belong to metadata schema, user or query graphs. Data Repositoryis responsible for storage of data triples (statistical observations).

The scenario of user-system interaction including the user assistance tasks and points ofcollection of the metadata is following: the registration process provides User Metadata, forexample, users age, profession or position. If the user is already registered he can just log in tothe system. The schema generated in the enrichment process is also stored as Schema Metadatain AMR. Every query, that is run by users, is collected as Query Metadata in AMR. Based onthese, we can generate some recommendations on how to query the data. This interaction isillustrated as BPMN model in Figure 4.2:

Figure 4.2: BPMN model: Interaction between a user and the EXR system

7

4.2 E2 Module

We next focus on E2M and present the details on its functioning. The E2M workflow is shown inFigure 4.3. The processes (i.e., boxes) in green are the points of extension in E2M with respectto the QB2OLAP enrichment module.

Figure 4.3: The workflow of Extended QB2OLAP Enrichment Module

The Redefinition Phase is affected in the following manner: on the step of the populating ini-tial level members for each local instance of level members we need to find their synonyms, e.g.concepts that have the same meaning in other data sources. For example, <http://dbpedia.org/resource/Spain> in DBpedia is the synonym for <http://worldbank.270a.info/classification/country/ES> in the World Bank Linked Data: both IRIs in dif-ferent datasets identify the country Spain. In this phase we handle missing synonyms.

The Enrichment Phase collects the level instances and their properties. In order to perform asearch for level candidates in external sources, we get external instances, e.g. we get propertiesof synonymic level member instances in desired external data source. The results from internaland external sources are processed in order to discover the properties that can be used as potentialnew levels for OLAP analysis. Also we collect all level members to the set that contains missinglevel members (internal or external) and later we process them.

The Triple Generation Phase generates schema and instance triples. This phase is updated togenerate artificial IRIs for the missing level members in the hierarchy construction process.

Next, we elaborate on the challenges for performing these tasks and how we solve them.

4.2.1 Completing missing data for dimension hierarchy constructionAs mentioned in the Challenge 1, an approach for automatic handling of missing higher levelmembers for hierarchy construction is required. Having missing level members violates condi-tions of completeness of MD data. To address this, we introduce an artificial level member inparent level, e.g., <Level nameOther> which is an automatically generated IRI.

When a new level is added, the dimension hierarchies are automatically constructed and weassign to a new artificial parent level member those child level members, that initially have nocorresponding parent level members. For instance, in Figure 4.4 the level member “BT”(Bhutan)

8

of level “Countries” has no option to rollup to the level “Regions”. Thus, we add a missing levelmember <regionOther> to the level “Regions” and map “BT” to this new level member.

Figure 4.4: Example of adding a missing level member

4.2.2 Completing missing synonyms in external data sourcesIn order to solve the Challenge 2, which requires an approach for handling missing synonyms inexternal data sources, we need to split it into three smaller challenges and tackle each of them.

Challenge 2.1: How to identify synonyms?

Some built-in properties in different ontologies have the following semantic meaning: theproperty X connects an individual object to the semantically identical individual object in othersource. We assume, that these properties are used (i.e., exist) in the dataset and that they can beused to identify synonyms between different sources. For example, the property owl sameAs1

satisfies this requirements. If <IRIx> owl:sameAs <IRIy>, we know, that <IRIx> and<IRIy> are synonyms.

Challenge 2.2: How to handle missing properties from internal dataset to externaldatasets?

Since we cannot be sure that every concept has properties from internal to external datasets,i.e., synonyms in external dataset, we can have a subset of level instances that do not have syn-onyms. If so, in order to guarantee completeness we need to introduce artificial level memberslike <synonymicIRI %UniqueID%>. For instance, there is a missing link to DBpedia in theexemplary dataset for Ireland as presented in Figure 4.5.

Challenge 2.3: How do the artificial synonyms affect the detection of properties for thehierarchy construction?

The artificial synonyms that we generate do not have properties as the synonyms that alreadyexist in the dataset. Thus, this directly affects the discovery of parent level members as fol-lows. During the process of looking for new candidates for the set of synonym IRIs, HCP , i.e.,

1https://www.w3.org/TR/owl-ref/#sameAs-def

9

Figure 4.5: Example of adding a missing synonym

the hierarchy construction percentage (Section 3.1), should be adjusted according to synonymidentification percentage (SIP). SIP is the minimal percentage of base level members that musthave synonyms in the desired external dataset. For example, if the desired HCP is 70% andthe percentage of missing synonyms is high, like SIP is 50%, we are not able to meet levelof HCP , because in the worst case we already have 50% of missing properties and we do notmeet 70% threshold of HCP . So we need to adjust the SIP to be greater or equal to HCP ,while the algorithm for the discovery of new candidate properties should now take into accountthe adjusted HCP ′ value that considers SIP. Adjusted HCP ′ can be calculated in this way:HCP ′ = (HPC/SIP ) ∗ 100.

We assume, that a user can add multiple external data sources. It is important to keep a trackof synonyms per source in order to build hierarchies properly and according to their sources. Forexample, for the same set of level members in World Bank Linked Data we might have differentset of synonyms in DBpedia and EuroStat. So SIP of each external data source per level mayvary. The illustration is presented in Figure 4.6.

Figure 4.6: Example of hierarchy construction for several external data sources

4.3 AM and D RepositoriesAMR and DR are triplestores, i.e., a database for capturing, storing and querying data in theform of triples. A triple is a data entity composed of three elements, subject-predicate-object,

10

like “Region is a Level” or “Spain belongs to Europe”. AMR stores AM graphs, DR stores datatriples (statistical observations). DR uses a RDF vocabulary QB, which explains how data arestored, but for AM we need to propose the own model.

As an answer for questions of the third challenge, we propose the RDF metadata model thatis created by instantiating the SM4AM metamodel. As mentioned earlier, SM4AM is coveringmetadata artifacts that are needed for user assistance functionalities. SM4AM is created as anRDF formalization of AM artifacts. For the sake of recommendation of queries, we need toinstantiate the metamodel with a model and to do an analysis of analytical artefacts and possibleoperations to identify artefacts from SM4AM framework to be used for recommender system.In Figure 5.1 (see Section 5.3) we present the model of Users and Querys metadata. Next weexplain the details of our model.

User Metadata. For recommendation needs the model of User metadata can be extended orchanged depending on the application domain. In our model we use a basic set of user charac-teristics: Age Group, Experience Level, Position and Country.

Query Metadata. Every query can be decomposed in the set of operations, that can describethis query. Since we are working with queries that have the OLAP semantics, we decomposequery into some query descriptive features and basic OLAP operations. A query in our model isidentified with a time-date timestamp (when this query was done), Data Structure Definition Link(which schema was used) and with descriptive attributes, like which data and schema graphs areused. The basic set of operation, that have the OLAP semantics and in the same time, allowto fully reconstruct the query from metadata, is following: Selection, Roll Up, Projection andChange Base.

Schema Metadata. The information about the schema is stored according to the QB4OLAPvocabulary.

4.4 Q ModuleIn our vision, the querying module needs to enable the user to explore data in the convenientmanner. This way it should enable that the user is not very experienced in writing OLAP norSPARQL queries and can use a web interface to explore data in a controlled way by the interface.In turn, the querying module will parse or process the input from the user interface, and in parallelperform two tasks:

1. Generate a Query Metadata, e.g. the essential information about query that is stored inAMR and later used (for recommendation of new queries).

2. Generate SPARQL-query and retrieve results from DR.

In the end, the results of a users request should be returned to him. The workflow of the queryingmodule is shown in Figure 4.7.

11

Figure 4.7: The workflow of the Querying Module

4.5 XR ModuleThe main purpose of this module is generating recommendations (queries) for the user using themetadata repository. Thus, we next explain the process of exploiting AM and then we proposeour recommendation approach.

To address Challenge 4 and elaborate on how to generate metadata-based query recommen-dations for the exploration of Statistical Linked Open data, we need to address two specificchallenges as explained next.

Challenge 4.1: How to exploit the metadata from AMR for creating a recommendersystem?

In a classical recommendation systems there are two groups of concepts, users and items.Instead, in our approach, we consider the users and queries, e.g., a specie in a biology relateddatabase. In order to recommend a query, we can try several approaches: User-Query approachor Query-Query approach in recommendation. In both cases metadata must be represented asa utility matrix, giving for each user-query pair or query-query pair a value that represents thedegree of possible preference or interest of that user for that query.

In order to build a utility matrix, we need to represent the user metadata and the query meta-data as vectors and later on we need to build a vector matrices.

User metadata can be expressed as a profile vector of User Characteristics from metadataextracted. Figure 4.8 presented below illustrates an example of User vector for an abstract userof 35 years old, from Spain, who works as Senior BI analyst.

The vector consists of 4 characteristics: Experience, Age Group, Position and Country. Eachof them should take a value 0 or 1. The length of the vector must be fixed and an order ofelements matters as each element refers to a value for a specific characteristic. In this example,the value 1 on the third place of vector means the user is senior level of experience. Each user isrepresented with the same vector structure.

12

Figure 4.8: Example of a user vector

Query vector is a profile vector of the query operations from the metadata extracted. Weassume, that the length of query vectors is the same within a single dataset. For instance, wehave a dataset where there are 3 dimensions (Dim1, Dim2, Dim3). The Queryy as in Figure 4.9below, shows that the query has 1 Rollup over Dim3, does not have any selections, uses Dim1and Dim3, has 1 projection over a single Measure1 and uses the SUM aggregate function.

Figure 4.9: Example of a query vector

Thus, the length of vector is the same within a single dataset and recommendations are gen-erated in the context of this dataset.

Furthermore, we introduce the semantic value between queries as a solution of balancingproblem between a query similarity and an added value for the user from recommended query.We propose to compare query vectors and not only consider, how similar they are, but as well,if the difference between the queries is an interesting recommendation for the user, i.e., what isvalue of recommendation semantics.

For example, UserX posed a query Qx, which used 2 dimensions out of 3 available. Mean-while, the UserY posed a query Qy , which uses all 3 dimensions and 1 rollup. So we presume,that Qy is interesting for the UserX , because it brings a new value: new rollup and new di-mension. At the same time, Qx is not interesting for UserY , because it does not bring any newrecommendation semantics. Considering the MD semantics of the queries in our system, weconsider that value of new elements for query Qx compared with input Query may vary:

– 1st priority: New projection or new change base operations are 1.0

– 2nd priority: New roll up operation or new aggregate function used are 0.5

– 3rd priority: New selection operation is 0.3

Considering these values, an example in Figure 4.10 illustrates two queries Qx and Qy , andshows how the difference between these vectors is computed and used to calculate the value ofrecommendation semantics.

13

Figure 4.10: Example of the semantic value computation

Challenge 4.2: Which recommendation algorithms can be applied in this context?

Now we focus on how to use the above similarities and recommendation semantics to recom-mend the queries to the user. Figure 4.11 illustrates the workflow of this module.We consider twoapproaches: User and Query metadata-based recommendations and only Query metadata-basedrecommendations.

Figure 4.11: The workflow of Exploration Recommendation Module

User and Query metadata-based recommendations. This approach presumes that the rec-ommendation is done, depending on the user, who performed a query. Thus, for a given user Xwe need to find top-k similar users, i.e., to take all user vectors and build a utility matrix of usercharacteristics and then compute similarity between them. Then we need to find top-k queriesof similar users, i.e., we select all queries for the dataset and build a matrix of query features foreach query, compute a value of similarity between query vectors, compute a value of recommen-dation semantics between query vectors and generate recommendation taking into considerationboth. The query recommendation are generated by ranking the queries, e.g., combine the val-ues of similarity and recommendation semantics and take top-X queries of the ranked list. This

14

computation can be done statically and refreshed periodically or be generated on the fly.

Query metadata-based recommendations. We assume, that recommendations are based onlyon similar queries, disregarding the user characteristics. We directly take top-k similar queries,sowe select all queries for a dataset and build a matrix of features, compute a value of similaritybetween query vectors, compute a value of recommendation semantics between query vectorsand combine values. After all we can give some recommendations on queries, e.g. to rank bycombined value of similarity and semantic value and take top-X queries of the ranked list.

Query metadata-based recommendations on-the-fly. Additionally, we can presume that theuser can be interested in the recommendations only for last input query he/she posed. The rec-ommender system takes for comparison only the last input query and compares with all availablequeries. It can improve the evolution of the user exploration experience.

We do not stress the usage of the particular similarity function, e.g., cosine similarity andgive a freedom to choose the best one. The comprehensive survey [19] on possible functions canbring plenty of options.

5. FormalizationIn this section, we formalize our approach for handling of missing data and synonyms for

dimension hierarchy construction. Further, we denote the proposed RDF model for AM storageand finally we formalize the recommendation algorithms used in the XR Module.

5.1 Completing missing data for dimension hierarchy construc-tion

Primarily, we define the concepts of the MD model for OLAP that are used for a formalization ofthe solution. It is commonly accepted to organize OLAP data as a hypercube, that allows analy-sis of data according to the multiple dimensions, i.e., axes of the hypercube, as defined in [20].Basic concepts to describe our approach are: levels, level members, hierarchies and dimensions.We define them as follows. L (5.1) is a set of all levels and each level li is a set of level mem-bers lmi

1 to lmin, i.e. possible values for given level. For example, <http://.../Spain>

and <http://.../Russia> are level members of the level <http://.../Country>.Levels are organized in hierarchies (5.2), where H is a set of all hierarchies h1 to hj and eachhierarchy is a subset of L with a partial order between levels, i.e., parent-child relationships andwhere the first element of the sequence is a base level. We denote “≺” a partial order betweenlevels. The function (5.3) returns the base level of a given hierarchy. Any partially ordered pairof levels is called a hierarchy step (5.4). A dimension di (5.5) is a set of hierarchies hi

1 to hij

with the same base level and each dimension di belongs to a set of all dimensions D. The partialfunction (5.6) defines a rollup relation between a child level member of lmi

x and a parent levelmembers of lmj

y . For example, it explicitly defines, that <http://.../Spain> is a part of<http://.../region/ECS> (Europe and Central Asia).

15

L = {l1, . . . lj}, s.t. li = {lmi1, . . . , lm

in} and 1 ≤ i ≤ j (5.1)

H = {h1, . . . hj}, s.t. hi = {lij , . . . lim}, s.t. lij ≺ lik ≺ . . . ≺ lim and 1 ≤ i ≤ j (5.2)

fbase : hi1 → li1 (5.3)

hs : lij ≺ lik, s.t. lij , lik ∈ hi (5.4)

D = {d1, . . . dj}, s.t. di = {hi1, . . . h

ij}, and fbase(h

i1) = . . . = fbase(h

ij), 1 ≤ i ≤ j (5.5)

frollup : lmix 9 lmj

y , s.t. lmix ∈ lki , lmj

y ∈ lkj and lki , lkj ∈ hk (5.6)

As we stated previously, in the setting of Linked Open Data the completeness of MD data can notbe guaranteed. We denote li miss the subset of child level members that have no correspondingparent level members from the new added level, li miss ⊂ li and the function (5.6) is undefinedfor members of this set. We propose a following Algorithm 1 to complete missing data: for eachlevel member lmi

n from child level li we check, if it has a parent level member, i.e., if the rollupfunction is defined (line 2). If does not have a parent level member, we add this level memberto li miss (line 3). If li miss is not empty (line 6), we need to create an ’other’ artificial levelmember lmj

other for parent level lj (line 7) and relate all child level member without parent tothis “other” parent level member (lines 8-10).

Algorithm 1: Complete missing level members

Input: li and lj , s.t. li ≺ lj , frollup : lmix 9 lmj

y

Output: frollup : lmix → lmj

y

1 foreach lmin ∈ li do

2 if frollup(lmin) is undefined then

3 li miss = li miss ∪ {lmin}

4 end5 end6 if li miss is not Ø then7 lj = lj ∪ {lmj

other}8 foreach lmi

n ∈ li miss do9 frollup(lmi

n)→ lmjother

10 end11 end12 return frollup : lmi

x → lmjy

5.2 Handling missing synonyms in external data sourcesWe consider the that there are IRIs in the internal (i.e., local) dataset InDS that point to thesame entities in other sources. This enables the enrichment of the cube schema from external

16

data sources ExDSn in the following way. We define the set of synonyms (5.7) as IRIs inexternal dataset that have the same semantics as IRIs in the internal dataset li. Thus, the functionfsyn (5.8) is to map level members with their synonyms. We use some existing propertieslike skos:exactMatch 1 to identify synonyms. If built-in properties of different ontologiesare not present in the dataset, we may use different approaches to find synonyms like ontologymatching, i.e., an automatic discovery of mappings between related concepts, or entity resolutiontechniques (see [21] for more details).

si = {syni1, . . . syn

in} (5.7)

fsyn : lmin 9 syni

n, where lmin ∈ ln and syni

n ∈ sn (5.8)

Algorithm 2 is proposed for handling the case, when some synonyms are missing. If fsyndoes not return a synonym (line 2), we generate a new unique IRI and add it to the set of syn-onyms si (line 3). Then we map level members without synonyms to new synonyms.

Algorithm 2: Complete missing synonyms

Input: li, si, fsyn : lmin 9 syni

n ;Output: fsyn : lmi

n → synin ;

1 foreach lmin ∈ li do

2 if fsyn : lmin is undefined then

3 si = si ∪ synin newIRI ;

4 fsyn : lmin → syni

n newIRI ;5 end6 end7 return fsyn : lmi

n → synin ;

After an execution of Algorithm 2 we start the process of looking for a new level candidatein external data source ExDSn, taking the set Sn instead of Ln and run Algorithm 1.

The artificial synonyms (that we generated) do not have properties as discovered synonyms.Thus, the set of missing rollup properties is not empty and it includes at least all the artificialsynonyms and possible some discovered synonyms without property to rollup. This directlyaffects the detection of new candidate levels, i.e. if the number of missing synonyms is greaterthan HCP parameter, no new candidates can be found. Thus, this reflects the value of HCP ,which we introduced earlier in Section 3.1. We introduce the synonym identification percentage(SIP ), i.e., the percentage of level members that must have synonyms in external dataset, as anew parameter and recalculate the HCP parameter as HCP ′ (5.9)

HCP ′ =HCP

SIP× 100, where HPC ≤ SIP (5.9)

1https://www.w3.org/2009/08/skos-reference/skos.html#exactMatch

17

5.3 AM ModelThe metamodel, defined in [6], is a formalization of the AM artifacts needed for user assistanceand we use it to define our metadata model. We are interested in classes of the metamodel, relatedto the following artefacts: a user, a query and a schema. The metamodel level of abstractionrequires an instantiation to the model and our model is presented in Figure 5.1. We show theprocess of instantiation for a user and a query in AMR.

Figure 5.1: The RDF model of metadata needed for the user assistance

From User Elements to User. In the metamodel, sm4am:User is the main class representingthe user as an entity and described with user characteristics sm4am:UserCharacteristic. It reflectsin our model as follows: mdl:User is an instance of sm4am:User. A set of user characteristics

18

consists of mdl:Position, mdl:AgeGroup, mdl:ExperienceLevel, mld:Country, that are instancesof sm4am:UserCharacteristic. These characteristics stand for a job position, an age group, aprofessional experience level and a country of origin, retrospectively.

From User Action List elements to Query. The class sm4am:UserActions reflects any useraction and is organized as a list of elements, i.e., sm4am:UAList or user action list. Our mdl:Queryis an instance of sm4am:UAList. The mdl:Query consist of a set of mdl:MDOperations. As MDoperations we take following subset of operations (see [22] to check complete set): mdl:RollUp,mdl:Selection, mdl:ChangeBase and mdl:Projection. These operations are instances of sm4am:ManipulationAction, which captures the actions for data handling with following semantics.

– mdl:RollUp stands for an operation of grouping data based on an aggregation hierar-chy. The minimal metadata information about this operation is: which dimension, i.e.,qb:DimensionProperty, and hierarchy, i.e., qb4o:Hierarchy, are used in a given query, andfrom which level to which level a granularity was changed, i.e., a set of qb4o:LevelPropertyand their order (xsd:int) in hierarchy.

– mdl:Selection is an operation, that defines selected subset of level members. The minimalmetadata information about this operation is: which dimension, i.e., qb:DimensionProperty,and hierarchy, i.e., qb4o:Hierarchy, are used in a given query, on which level data were an-alyzed, i.e., qb4o:LevelProperty and its order (xsd:int) in hierarchy and which level mem-bers, i.e., qb4o:LevelMember, participate in the selection.

– mdl:ChangeBase is an operation, which denotes the set of dimensions, i.e., qb:DimensionProperty, and hierarchies, i.e., qb4o:Hierarchy, that used in query.

– mdl:Projection is an operation that defines a selected subset of measures,i.e., qb:MeasureProperty. Additionally, as a part of this operation, we store which aggregate function, i.e.,qb4o:AggregateFunction, is used in this query.

An example of instantiating from model level of abstraction is presented in Appendix B.

5.4 A metadata-based query recommendation approach forthe exploration of Statistical Linked Open Data

The recommendation problem that we address can be defined as follows. Take U as the set ofusers and Q as the set of queries, that had been performed in our system. We consider queriesover the same dataset. The utility function (5.10), that measures the usefulness of a query q ⊂ Qfor a user u ⊂ U , where R is a totally ordered set of recommended queries.

futility : U ×Q→ R (5.10)

Then, the recommendation problem (5.11) is to find for each user u such query qmax,u ⊂ Q,that maximizes the utility function futility

19

∀u ⊂ U , qmax,u = maxq⊂Q(futility(u, q)) (5.11)

A classical recommender system uses three main essential elements which are: users, items(e.g., movies), and ratings (i.e., preference of user to item). Usually this information is repre-sented all together by means of a user-item ratings matrix. In our case, we do not use explicitratings, and our items are queries. Thus, we focus on metadata about users and queries and wecan exploit them.

The user vector ~Ui denotes a vector (5.12) of user characteristic values. Semantically, char-acteristic values of ~Ui are grouped into characteristics ci with values cv1, ..., cvn. The positionsof vector values for each characteristic is predefined in every user vector.

~Ui = ((ck), (cl), ..., (cn)) where ci = (cv1, ..., cvn) and ci is a characteristic and cvi is a value(5.12)

The query vector ~Qi denotes a vector (5.13) of operators values. Our model consists of fourpossible operation and aggregate functions. We keep track if each possible operation (Rollupand etc) is performed over each dimension or measure. That means, that we build a vector,that represents which operation was performed over which exact element of MD schema , i.e.dimension or measure. Thus, operation values ov1, ..., ovn of ~Ui are grouped into operations ci.The positions of vector values for each operation is predefined in every query vector.

~Qi = ((ok), (ol), ..., (on)) where oi = (ov1, ..., ovn) and oi is an operation and ovi is a value(5.13)

Furthermore, we introduce the value of recommendation semantics, that is computed as pre-sented in Algorithm 3. Each ok has priority wk, which is the weight given for each new elementof operation in the vector. For example, we proposed weight = 1 to the operation of new projec-tion or new change base. For two queries ~Qx = (x1, ..., xi) and ~Qy = (y1, ..., yi) we computea value of recommendation semantics semV alue of ~Qy to ~Qx. We compare elements of bothvectors, that have the same position k. If the value yk is more the value xk, that means ~Qy hasnew element (line 3). We check, which is this operation with function getOprationByPosition(line 4) and we check, which priority has this operation (line 5) and add the weight value totemporal value semV alue.

We generate three types of recommendations:

1. User and Query metadata-based recommendations. This type is based on the followingidea: users, whose profiles are similar, potentially have a mutually useful queries.

2. Query metadata-based recommendations. This type is based on the following idea:queries can be potentially useful for a given user, disregarding the profile of user.

3. Query metadata-based recommendations on-the-fly. This type is based on the follow-ing idea: users can be interested in recommendations to the last input query.

20

Algorithm 3: Compute the semantic value between two vectors

Input: ~Qx = (x1, ...xi) and ~Qy = (y1, ...yi), i - length of any vectorOutput: semV alue

1 semV alue = 0;2 for k = 0; k < i; k + + do3 if xk < yk then4 O = getOprationByPosition(k)5 wi = getPtiotity(O)6 semV alue = semV alue + wi

7 end8 end9 return semV alue

User and Query metadata-based recommendations. Thus, we find similar users and fur-ther find similar queries, that belongs to discovered similar users. We divide the process into twosteps.

Step 1: The problem of finding a similar users can be formalized as follows. We assume theexistence of the similarity function, that measures how close to each other user vectors um, un ∈U by comparing their characteristics ~um, ~un ∈ C , which denotes as fsim : U × C → Susim,where Susim is a totally ordered set of similarity values between users. So the recommendationproblem is to find for each user u such user umax,u ⊂ U , that maximizes the similarity functionfsim. The similarity function can be any classical similarity function between 2 vectors, e.g.cosine similarity or correlation. Then, top-N approach can be applied. We used correlationfunction (5.14).

CorrSim[um, un] =(um −Mean[um])(un −Mean[un])

Norm[um −Mean[um]]Norm[un −Mean[un]](5.14)

Step 2: Then for a given user we take the set of queries Qfiltered, which are from similarusers. Thus, Qfiltered ⊂ Q, and we need to rank the belonging queries. We represent a query asa vector in m-dimensional space. And we compute a semantic value between two queries, usingAlgorithm 3.

We use the similarity function, that measures how similar to each other queries qm, qn ∈Qfiltered by their operations O, which denotes as follows fquery sim : Qfiltered ×O → Sqsim,where Sqsim is a totally ordered set of similarity values between queries. We use correlation assimilarity function (5.14).

Furthermore, we use the semantic utility function, that measures how useful to each otherqueries qm, qn ∈ Qfiltered and it denotes as follows: fcomputeSemanticV alue : Qfiltered ×Qfiltered → Ssemval, where Ssemval is a totally ordered set of semantic values.

In order to combine both sets in ordered result set of query recommendations, we sug-gest linear combination of values Sqsim and Ssemval with weight. The query utility func-tion can be defined as follows: for queries qm, qn ∈ Qfiltered the function fquery utility :

21

wi × fquery sim(qm, qn) + wj × fcomputeSemanticV alue(qm, qn), where wi and wj equal to0.5.

Query metadata-based recommendations. We need represent a query profile as a vector inm-dimensional space. Then we use the similarity function, which denotes as follows fquery sim :Q × O → Sqsim, where Sqsim is a totally ordered set of similarity values between queries. Weuse the semantic utility function qm, qn ∈ Q and it denotes as follows: fcomputeSemanticV alue :Q × Q → Ssemval, where Ssemval is a totally ordered set of semantic values between queries.In order to combine both, we can suggest linear combination of values Sqsim and Ssemval withweight 0.5. In general, the query utility function can be defined as follows: for queries qm, qn ∈Q the function fquery utility = wi× fquery sim(qm, qn) +wj × fcomputeSemanticV alue(qm, qn)

Query metadata-based recommendations on-the-fly. We use entire set of queries Q incomparison with one last input query qlast. We again represent a query profile as a vectorin m-dimensional space. Then we use the similarity function, which is denoted as followsfquery sim : Q × O → Sqsim, where Sqsim is a totally ordered set of similarity values be-tween queries. Later we use the semantic utility function qlast, qn ∈ Q and it denotes as follows:fcomputeSemanticV alue : qlast × Q → Ssemval, where Ssemval is a totally ordered set of se-mantic values between queries. In order to combine both, we suggest linear combination ofvalues Sqsim and Ssemval with weight 0.5. In general, the query utility function is defined asfollows: for queries qlast, qn ∈ Q the function fquery utility : wi× fquery sim(qlast, qn) +wj ×fcomputeSemanticV alue(qlast, qn)

The result of any recommendation option is a sorted list of quires and we propose to applythe top-k approach to every of them.

6. Implementation and EvaluationIn this section we explain the implementation details and present an evaluation results. We

first evaluate a proposed solution for completing the missing level members and linking an ex-ternal dimensional data. Then we evaluate a recommendation system we proposed before.

6.1 Implementation

We used following technology stack for a prototype1.The frontend of the Java EE application is built with the web client technologies JSP, CSS,

HTML. We used the JQuery library in order to handle events for an interaction with users like anenrichment actions, a communication with user on querying page and a refreshing of the results.On top of HTML, CSS we deployed the Bootstrap framework.

For backend, the used server is Apache Tomcat v.7. On the top we used Java Servlet technol-ogy with Java 8. The Jena 3.1.0 library is responsible for creating and reading RDF graphs andfor a query engine that supports the SPARQL query execution. Analytical Metadata repositoryand Data store are hosted by Virtuoso 7.2.0 Open-Source Edition.

1The prototype can be accessed via http://quarry.essi.upc.edu:8080/LODex web/

22

6.2 Evaluation of the completing the missing level membersand linking an external dimensional data

We need to evaluate two extensions: the new functionality, that complete missing level membersand the functionality that allows to link external dimensional data. For evaluation we choose thedataset International tourism, number of arrivals2 from The World Bank Open Data3 (denotedas Tourism dataset)

Missing parent level members. Our proposed solution for handling missing values can beevaluated in a following manner. We run the enrichment process for a “Tourism dataset” ninetimes and each time we lower a parameter HCP by 10% and we save obtained schemas inAMR. In every run we can discover and add new properties(i.e., new candidate parent levelmembers) for level “refArea”, which represents countries. In particular we discovered proper-ties “region” and “income-level”(when HCP = 100%), “landing-type” (when HCP = 80%),“admin-region” (when HCP = 50%). This dataset has 233 countries (according to the resultsof query from Appendix A.2 and in the results we denote this level as a child level. For an evalu-ation we need to compare the completeness of MD data before and after usage of the tool EXR.First we count the number of level properties in the dataset before the usage of EXT tool withhelp of SPARQL query from Appendix A.1 that had been run in the data source4. Afterwords,we run the same query for all new levels in our AMR and analyze the changes. In Table 6.1 wepresent results of the evaluation of the hierarchy construction.

Table 6.1: Evaluation of handling the missing level members

HCP Level Number of childlevel members

Initial number ofproperties

Number of addedproperties

Number of parentlevel members

Number of addedparent level members

Initialcompleteness

100% region 233 233 0 8 0 100%

income-level 233 233 0 6 0 100%

80% landing-type 233 202 31 4 1 87%

50% admin-region 233 132 101 6 1 59%

The table summarizes that the EXT tool does not do any changes for the high quality MDdata, like levels “region” and “income-level”. And for levels “landing-type” and “admin-region”the tool adds missing properties and adds one new level member to the parent level. Thus, itcompletes the MD schema by ensuing many-to-one integrity constraints.

Missing synonyms. To verify the solution of handling missing synonyms, we follow the sameapproach: we compare the results from the data source, obtained with SPARQL query with localresults. We run the enrichment process for a Tourism dataset once and during this process wecheck, how many synonyms for level “refArea” are found by tool in four external data sources

2http://worldbank.270a.info/dataset/ST.INT.ARVL3http://data.worldbank.org/4http://worldbank.270a.info/sparql

23

(“dbpedia.org”, “ecb.270a.info”, “eurostat.linked-statistics.org”, “bfs.270a.info”) and comparewith the initial settings (see query presented in Appendix A.3). Results are presented in Table6.2. The analysis indicates that the tool EXT successfully adds missing synonyms and handles

Table 6.2: Handling of the missing synonyms evalutaitonSource Num level members Num synonyms Num synonyms added Initial completeness

dbpedia.org 233 202 31 87%

ecb.270a.info 233 180 53 77%

eurostat.linked-statistics.org 233 202 31 87%

bfs.270a.info 233 177 56 76%

this challenge.

6.3 Evaluation of a metadata-based recommendation systemFor evaluation of recommender system, we present evaluation settings we had, than we discussa system evaluation results based on automatically collected data about user actions. Moreover,we conduct a survey among participants of the experiment and based on responses we do anempirical evaluation.

Evaluation settings. We create a possible user scenario of using EXR tool and set a goal foranalysis, that can be found in Appendix C. The used dataset is http : //fao.270a.info/dataset/RECOFI CAPTURE. On the querying page of the EXR tool we present three types of rec-ommendations: User and Query metadata-based recommendations, Query metadata-based rec-ommendations and Query metadata-based recommendations on-the-fly. We show top-5 queries-recommendations in each category.

The flow of the evaluation process is following. First, we create 7 user profiles, that couldbe extracted from LinkedIn5. Moreover, during the experiment, new users register for the eval-uation. The set of users U consists of IT4BI6 master students and PhD students from IT4BI-DC7.Thus, twelve real users participated in our evaluation.

Next, we generate small subset of queries according to scenario’s guidelines in order to dis-tinguish between the cases where we have “the cold start” challenge and cases where a useralready posed some queries. Additionally, during the evaluation process, users add three or morequeries, which are stored and participated in the future recommendations, thus Q is growing.The total number of queries that participated in the evaluation, is 70.

Furthermore, we send the evaluation guideline with the scenario included to all participants.We automatically collect the following data: recommendations generated for each user, usedrecommendations, and the rank of used recommended query within one category. Additionally,

5https://www.linkedin.com/ i.e., a business-oriented social networking web-service.6http://it4bi.univ-tours.fr/it4bi/7https://it4bi-dc.ulb.ac.be/

24

we ask for a feedback in the survey and evaluate the quality of recommended queries from users’perspective.

System evaluation. In the end of the experiment we see, that 82 times query recommendationswere used by participants. First we check the recommendations types preferences: which type ofrecommendations they used more (User and Query recommendations, Query recommendationsor Query recommendations on-the-fly). According to the log of used recommendations, User andQuery recommendations were used more frequently, in 37.8% of cases. Query recommendationson-the-fly were used in 32.9% cases and Query recommendations were used in 29.3% cases. Weconclude that the first type of the recommendations gained more attention from participants.

Furthermore, we evaluate an interest of users to the ranking of proposed recommendationsand we expect to have the highest rank to be used more frequently. Indeed, the results showa straight dependency between a rank of recommendations and the number of uses. The firstranked queries were used in 36,6% of cases. We interpret the user behavior as follows: theranking order in our recommender system generally satisfies user expectations.

Figure 6.1: Percentage of used recommendations by rank

Moreover, we explore the set of queries that participated in the recommendations. It consistsof 14 unique queries. We check how many times each of them was used and the results show us,that three queries were used the 64.6% of cases (24,16 and 13 times), meanwhile other 5 querieswere used in average 4 times and it is 28% of used recommendations and the rest of queries wereused only once. We additionally checked results and content of the most used queries and wefind out, that they all have some features, that go align with proposed scenario and indeed theycan help to reach the goal of analysis.

Empirical evaluation. We conduct an empirical evaluation based on responses of the survey.First, we asked participants how familiar are they with the Semantic Web technologies and themajority has basic knowledge (66%), while 25% of participants are very familiar. Thus, we canconsider the results representative. The tool is generally estimated as useful by all participants(41.5% as ”Very useful” and 50% as ”Useful”). The users gave good suggestions on how toimprove the interface and we plan to address this in future. The recommender system generally

25

considered as useful (58%) and medium usefulness (25%). According to system evaluation re-sults, the User and Query recommendations had more attention, meanwhile the survey shows thatparticipants found equally interesting Query recommendations on-the-fly and User and Queryrecommendations, i.e., 41.7% voted for each type of recommendations. We believe, that bothcategories can be accepted as interesting for users and the type “Query recommendations” canbe discarded from the prototype later. We believe, the reason of a low attentiveness to this typeis the least customization of recommender results and participants intuitively found other typesas more useful.

7. Conclusion and Future WorkIn this thesis, we propose a recommender system to assist a user in the exploration of datasets,

especially in the settings of ad-hoc analysis of Statistical Linked Open Data. First we enable userto complete schema by adding missing parent level members. We believe that an approach forcompleting MD data and proposal for the enrichment of Linked Open Data with external datasources can be reused in the future. We propose an RDF metadata model for the user assistanceand exploit the collected metadata. The feasibility of our metadata-based query recommendationapproach for the exploration of Statistical Linked Open Data is shown with a prototype namedEXR that we developed.

In future we plan to provide the visualization assistance. Moreover, we plan not only torecommend complete queries, but help users to compose queries (i.e., use query parts). Finally,it can be interesting to investigate the possibility to recommend queries and schemas betweendatasets that having similar structure.

26

A. Appendix SPARQL queries

A.1 Counting the number of a given property in a datasetPREFIX qb: <http://purl.org/linked-data/cube#>SELECT (count(distinct ?levelMember ) as ?count)WHERE {?observation a qb:Observation .?observation qb:dataSet <http://worldbank.270a.info/dataset/ST.INT.ARVL> .?observation <http://purl.org/linked-data/sdmx/2009/dimension#refArea>?levelMember.?levelMember ?PROPERTY ?o }

where ?PROPERTY are: <http://worldbank.270a.info/property/lending-type>, <http://worldbank.270a.info/property/region>, <http://worldbank.270a.info/property/income-level>, <http://worldbank.270a.info/property/admin-region>

A.2 Counting the number of level members in a datasetPREFIX skos: <http://www.w3.org/2004/02/skos/core#>PREFIX owl: <http://www.w3.org/2002/07/owl#>PREFIX qb: <http://purl.org/linked-data/cube#>SELECT (count(distinct ?levelMember ) as ?count)WHERE {?observation a qb:Observation .?observation qb:dataSet <http://worldbank.270a.info/dataset/ST.INT.ARVL> .?observation <http://purl.org/linked-data/sdmx/2009/dimension#refArea>?levelMember. }

A.3 Counting the number of synonyms in a datasetPREFIX skos: <http://www.w3.org/2004/02/skos/core#>PREFIX owl: <http://www.w3.org/2002/07/owl#>PREFIX qb: <http://purl.org/linked-data/cube#>SELECT (count(distinct ?outer) as ?count)WHERE {?observation a qb:Observation .?observation qb:dataSet <http://worldbank.270a.info/dataset/ST.INT.ARVL> .?observation <http://purl.org/linked-data/sdmx/2009/dimension#refArea>?levelMember.

27

{?levelMember skos:exactMatch ?outerFILTER regex(str(?outer), "EXTERNAL_SOURCE")}UNION{?levelMember owl:sameAs ?outerFILTER regex(str(?outer), "EXTERNAL_SOURCE")}}

Where EXTERNAL SOURCE is a domain of external source like “dbpedia.org”, “ecb.270a.info”,“eurostat.linked-statistics.org”, “bfs.270a.info”.

28

B. Example of instance of the metadata modelWe present a example of the SPARQL query and after the metadata about this query, that storedwith the use of the metadata model. The SPARQL query:

PREFIX qb: <http://purl.org/linked-data/cube#>PREFIX qb4o: <http://purl.org/qb4olap/cubes#>PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

SELECT ?plm1 ?plm12(MAX(<http://www.w3.org/2001/XMLSchema#integer>(?m0)) as ?sel )

FROM <http://fao.270a.info/dataset/RECOFI_CAPTURE>FROM <http://localhost:8890/schemas/Schema_9_20160726_071839747>

WHERE { ?o a qb:Observation .?o <http://fao.270a.info/measure/0.1/GENERAL_CONCEPT_SCHEME/OBS_VALUE> ?m0 .?o <http://fao.270a.info/dimension/1.0/CS_FISHSTAT/SPECIES> ?plm0 .?plm0 qb4o:memberOf<http://fao.270a.info/dimension/1.0/CS_FISHSTAT/SPECIES> .?plm0 <http://www.w3.org/2004/02/skos/core#broader> ?plm1 .?plm1 qb4o:memberOf <http://dbpedia.org/ontology/order> .?o <http://fao.270a.info/dimension/0.1/GENERAL_CONCEPT_SCHEME/UN_COUNTRY>?plm11 .?plm11 qb4o:memberOf<http://fao.270a.info/dimension/0.1/GENERAL_CONCEPT_SCHEME/UN_COUNTRY> .?plm11 <http://www.w3.org/2004/02/skos/core#broader>?plm12 . ?plm12 qb4o:memberOf <http://dbpedia.org/ontology/governmentType> .}

GROUP BY ?plm1 ?plm12

The metadata:

# QUERY #<Query_9_20160726_072509989> rdf:type model:Query .<Query_9_20160726_072509989> model:forDataset<http://fao.270a.info/dataset/RECOFI_CAPTURE> .<Query_9_20160726_072509989> model:usesSchema<http://localhost:8890/schemas/Schema_9_20160726_071839747> .<Query_9_20160726_072509989> model:hasTimestamp <20160726_072509989> .

29

<Query_9_20160726_072509989> model:createdBy <9> .

#USED OPERATIONS#

#CHANGEBASE#<ChangeBase_9_20160726_072509989> rdf:type model:ChangeBase .<Query_9_20160726_072509989> model:hasChangeBase<ChangeBase_9_20160726_072509989> .<ChangeBase_9_20160726_072509989> model:usesDimension <dimension3> .<dimension3> model:hasHierarchy <ex:hierarchy3> .<ChangeBase_9_20160726_072509989> model:usesHierarchy <ex:hierarchy3> .<ChangeBase_9_20160726_072509989> model:usesDimension <dimension2> .<dimension2> model:hasHierarchy <ex:hierarchy2> .<ChangeBase_9_20160726_072509989> model:usesHierarchy <ex:hierarchy2> .<dimension2> model:usesBaseLevel <UN_COUNTRY> .<dimension3> model:usesBaseLevel <SPECIES> .

#PROJECTIONS#<Projection_9_20160726_072509989_0> rdf:type model:Projection .<Query_9_20160726_072509989> model:hasProjection<Projection_9_20160726_072509989_0> .<Projection_9_20160726_072509989_0> model:usesMeasure<http://fao.270a.info/measure/0.1/GENERAL_CONCEPT_SCHEME/OBS_VALUE> .<Projection_9_20160726_072509989_0> model:usesAggragateFunction <MAX> .

# ROLLUPS #<Rollup_9_20160726_072509989_0> rdf:type model:RollUp .<Query_9_20160726_072509989> model:hasRollUp <Rollup_9_20160726_072509989_0> .<Rollup_9_20160726_072509989_0> model:forDimension<http://fao.270a.info/dimension/1.0/CS_FISHSTAT/SPECIES> .<Rollup_9_20160726_072509989_0> model:usesLevel<http://fao.270a.info/dimension/1.0/CS_FISHSTAT/SPECIES> .<http://fao.270a.info/dimension/1.0/CS_FISHSTAT/SPECIES> model:hasOrder"0"ˆˆ<http://www.w3.org/2001/XMLSchema#int> .<Rollup_9_20160726_072509989_0> model:usesLevel<http://dbpedia.org/ontology/order> .<http://dbpedia.org/ontology/order> model:hasOrder"1"ˆˆ<http://www.w3.org/2001/XMLSchema#int> .

<Rollup_9_20160726_072509989_1> rdf:type model:RollUp .<Query_9_20160726_072509989> model:hasRollUp <Rollup_9_20160726_072509989_1> .<Rollup_9_20160726_072509989_1> model:forDimension<http://fao.270a.info/dimension/0.1/GENERAL_CONCEPT_SCHEME/UN_COUNTRY> .<Rollup_9_20160726_072509989_1> model:usesLevel

30

<http://fao.270a.info/dimension/0.1/GENERAL_CONCEPT_SCHEME/UN_COUNTRY> .<http://fao.270a.info/dimension/0.1/GENERAL_CONCEPT_SCHEME/UN_COUNTRY>model:hasOrder "0"ˆˆ<http://www.w3.org/2001/XMLSchema#int> .<Rollup_9_20160726_072509989_1> model:usesLevel<http://dbpedia.org/ontology/governmentType> .<http://dbpedia.org/ontology/governmentType> model:hasOrder"1"ˆˆ<http://www.w3.org/2001/XMLSchema#int> .

31

C. Evaluation scenarioThe scenario was published as part of “Guidelines for an evaluation of a Metadata-based Rec-ommender System for Statistical Linked Open Data”:

“Dataset RECOFI Capture Production presents annual statistics allocated by countries, speciesitems, and statistical divisions of capture production in the RECOFI region (Regional Office forNear East and North Africa) for calendar years starting with 1995. The measure of this dataset isnominal catches, in tonnes, e.g. the weight of fish and seafood catch at the time of their capture.Current members of RECOFI are: Bahrain, Iran, Iraq, Kuwait, Qatar, Oman, Saudi Arabia andUnited Arab Emirates. Each Country Member’s yearly share of contribution is USD 5000$.

Assume, that today is 2010 and the Commission RECOFI decided to increase the contributionfee, thus, it asked from members in total 500 000$ in a favor of the environmental protection.However members of the Commission decided to share the fee proportionally to annual nominalcatch in last 10 years (1999-2009).

You will need to analyze, in which proportion to split the fee between countries. What wasthe trend of fish culture production in the decade 1999-2009? Which orders or species of fishesare the most demanded? Please, remember you role.”

32

Bibliography[1] A. Abello, J. Darmont, L. Etcheverry, M. Golfarelli, J. Mazon, F. Naumann, T. B. Pedersen, S. Rizzi,

J. Trujillo, P. Vassiliadis, and G. Vossen, “Fusion cubes: Towards self-service business intelligence,”IJDWM, vol. 9, no. 2, pp. 66–88, 2013.

[2] C. Bizer, T. Heath, and T. Berners-Lee, “Linked data - the story so far,” Int. J. Semantic Web Inf. Syst.,vol. 5, no. 3, pp. 1–22, 2009.

[3] J. Varga, A. A. Vaisman, O. Romero, L. Etcheverry, T. B. Pedersen, and C. Thomsen, “DimensionalEnrichment of Statistical Linked Open Data,” Journal of Web Semantics, vol. In Press, 2016.

[4] L. Etcheverry and A. A. Vaisman, “QB4OLAP: A vocabulary for OLAP cubes on the semantic web,”in Proceedings of the Third International Workshop on Consuming Linked Data, COLD 2012, Boston,MA, USA, November 12, 2012, 2012.

[5] J. Varga, L. Etcheverry, A. A. Vaisman, O. Romero, T. B. Pedersen, and C. Thomsen, “QB2OLAP:enabling OLAP on statistical linked open data,” in 32nd IEEE International Conference on DataEngineering, ICDE 2016, Helsinki, Finland, May 16-20, 2016 [5], pp. 1346–1349.

[6] J. Varga, O. Romero, T. B. Pedersen, and C. Thomsen, “Towards next generation BI systems: Theanalytical metadata challenge,” in Data Warehousing and Knowledge Discovery - 16th InternationalConference, DaWaK 2014, Munich, Germany, September 2-4, 2014. Proceedings, pp. 89–101, 2014.

[7] J. Varga, O. Romero, T. B. Pedersen, and C. Thomsen, “SM4AM: A semantic metamodel for analyt-ical metadata,” in Proceedings of the 17th International Workshop on Data Warehousing and OLAP,DOLAP 2014, Shanghai, China, November 3-7, 2014, pp. 57–66, 2014.

[8] M. Castellanos, U. Dayal, and M. Hsu, “Live business intelligence for the real-time enterprise,” inFrom Active Data Management to Event-Based Systems and More - Papers in Honor of AlejandroBuchmann on the Occasion of His 60th Birthday, pp. 325–336, 2010.

[9] H. Berthold, P. Rosch, S. Zoller, F. Wortmann, A. Carenini, S. Campbell, P. Bisson, and F. Strohmaier,“An architecture for ad-hoc and collaborative business intelligence,” in Proceedings of the 2010EDBT/ICDT Workshops, Lausanne, Switzerland, March 22-26, 2010, 2010.

[10] J. Mazon, J. J. Zubcoff, I. Garrigos, R. Espinosa, and R. Rodrıguez, “Open business intelligence: onthe importance of data quality awareness in user-friendly data mining,” in Proceedings of the 2012Joint EDBT/ICDT Workshops, Berlin, Germany, March 30, 2012, pp. 144–147, 2012.

[11] A. Abello, O. Romero, T. B. Pedersen, R. B. Llavori, V. Nebot, M. J. A. Cabo, and A. Simitsis, “Usingsemantic web technologies for exploratory OLAP: A survey,” IEEE Trans. Knowl. Data Eng., vol. 27,no. 2, pp. 571–588, 2015.

[12] J. Varga, L. Etcheverry, A. A. Vaisman, O. Romero, T. B. Pedersen, and C. Thomsen, “QB2OLAP:enabling OLAP on statistical linked open data,” in 32nd IEEE International Conference on DataEngineering, ICDE 2016, Helsinki, Finland, May 16-20, 2016, pp. 1346–1349, 2016.

[13] T. D. Noia and V. C. Ostuni, “Recommender systems and linked open data,” in Reasoning Web. WebLogic Rules - 11th International Summer School 2015, Berlin, Germany, July 31 - August 4, 2015,Tutorial Lectures, pp. 88–113, 2015.

33

[14] B. Heitmann and C. Hayes, “Using linked data to build open, collaborative recommender systems,” inLinked Data Meets Artificial Intelligence, Papers from the 2010 AAAI Spring Symposium, TechnicalReport SS-10-07, Stanford, California, USA, March 22-24, 2010, 2010.

[15] G. Adomavicius and A. Tuzhilin, “Context-aware recommender systems,” in Recommender SystemsHandbook, pp. 191–226, 2015.

[16] J. Bobadilla, F. Ortega, A. Hernando, and A. Gutierrez, “Recommender systems survey,” Knowl.-Based Syst., vol. 46, pp. 109–132, 2013.

[17] G. Adomavicius and A. Tuzhilin, “Toward the next generation of recommender systems: A survey ofthe state-of-the-art and possible extensions,” IEEE Trans. Knowl. Data Eng., vol. 17, no. 6, pp. 734–749, 2005.

[18] H. Lenz and A. Shoshani, “Summarizability in OLAP and statistical data bases,” in Ninth InternationalConference on Scientific and Statistical Database Management, Proceedings, August 11-13, 1997,Olympia, Washington, USA, pp. 132–143, 1997.

[19] S.-S. Choi, S.-H. Cha, and C. C. Tappert, “A survey of binary similarity and distance measures,”Journal of Systemics, Cybernetics and Informatics, vol. 8, no. 1, pp. 43–48, 2010.

[20] A. A. Vaisman and E. Zimanyi, Data Warehouse Systems - Design and Implementation. Data-CentricSystems and Applications, Springer, 2014.

[21] R. Touma, O. Romero, and P. Jovanovic, “Supporting data integration tasks with semi-automatic on-tology construction,” in Proceedings of the ACM Eighteenth International Workshop on Data Ware-housing and OLAP, DOLAP 2015, Melbourne, VIC, Australia, October 19-23, 2015, pp. 89–98, 2015.

[22] O. Romero and A. Abello, “On the need of a reference algebra for OLAP,” in Data Warehousing andKnowledge Discovery, 9th International Conference, DaWaK 2007, Regensburg, Germany, September3-7, 2007, Proceedings, pp. 99–110, 2007.

34

a metadata-based recommender system for statistical linked open data · 2017-03-01 · and/or...

Documents