visualizing statistical linked knowledge for decision support · e.g., the planetdata datasets4 or...

Semantic Web 1 (2016) 1–25 1IOS Press

Visualizing Statistical Linked Knowledge forDecision SupportEditor(s): Aba-Sah Dadzie, The Open University, UK; Emmanuel Pietriga, INRIA France & INRIA ChileSolicited review(s): Luc Girardin, ETH Zurich, Switzerland; Bernhard Schandl, University of Vienna, Austria; Emmanuel Pietriga, INRIAFrance & INRIA Chile

Adrian M.P. Brasoveanu a,c,∗, Marta Sabou b, Arno Scharl a,c, Alexander Hubmann-Haidvogel a,c, andDaniel Fischl aa Department of New Media Technology, MODUL University Vienna, Am Kahlenberg 1, 1190 Vienna, AustriaE-mail: {adrian.brasoveanu, alexander.hubmann, daniel.fischl}@modul.ac.atb Christian Doppler Laboratory for Software Engineering Integration for Flexible Automation Systems, ViennaUniversity of Technology, Favoritenstrasse 9-11, 1040 Vienna, AustriaE-mail: [email protected] webLyzard technology gmbh, Puechlgasse 2/44, 1190 Vienna, AustriaE-mail: [email protected]

Abstract. In a global and interconnected economy, decision makers often need to consider information from various domains.A tourism destination manager, for example, has to correlate tourist behavior with financial and environmental indicators toallocate funds for strategic long-term investments. Statistical data underpins a broad range of such cross-domain decision tasks.A variety of statistical datasets are available as Linked Open Data, often incorporated into visual analytics solutions to supportdecision making. What are the principles, architectures, workflows and implementation design patterns that should be followedfor building such visual cross-domain decision support systems. This article introduces a methodology to integrate and visualizecross-domain statistical data sources by applying selected RDF Data Cube (QB) principles. A visual dashboard built accordingto this methodology is presented and evaluated in the context of two use cases in the tourism and telecommunications domains.

Keywords: Linked Data, Information Visualization, Decision Support Systems, RDF Data Cube, Data Analytics

1. Introduction

Decision Support Systems (DSS) are typically cus-tomized for specific decisions in a given domain. Ina global economy, external events such as financialcrises or climate change - through observable con-sequences like bankruptcies or hurricanes - can ren-der such domain-specific solutions obsolete. Buildingcomprehensive monitoring systems into DSS tools isa potential solution, but one that can be prohibitivelyexpensive and out-of-reach for smaller companies andresearch groups. One way to build DSS tools that lever-

*Corresponding author. E-mail: [email protected].

age such cross-domain information is to analyze ag-gregated representations of events in the form of sta-tistical data. Such an integration helps answer com-plex questions that require cross-domain data. Draw-ing on economic and sustainability indicators in con-junction with behavioral data from tourism research,for example, allows answering complex questions suchas the following: Do financial crises affect tourist be-havior? Do temperature increases in continental Eu-rope change the annual distribution of arrivals? Canthe failure of specific stocks - e.g., large tour operatorsor hotel stocks - predict a sector-wide crisis? Similarquestions arise in other sectors as well. A telecommu-nications analysts, for example, might want to investi-

1570-0844/16/$27.50 © 2016 – IOS Press and the authors. All rights reserved

2 A.M.P. Brasoveanu et al. / Visualizing Statistical Linked Knowledge

gate longitudinal data to explain unusual peaks in thenumber of calls or text messages sent, or better under-stand how geopolitical trends (migration, aging popu-lation, etc.) influence call data patterns.

Statistical data sources from multiple domains areincreasingly available as linked (open) data follow-ing the publication of the RDF Data Cube Vocabulary(QB).1 Visualization seems to be the de facto methodfor making sense of Linked Data (LD), and various ap-proaches have been developed for navigating the datadeluge [11], but less effort was dedicated to integrat-ing visualizations into analytical platforms for answer-ing complex questions, similar to the ones we have dis-cussed earlier. Fox and Hendler [16] argue that inte-gration and reusability are the most important aspectson which visualization designers need to focus for suc-cessfully controlling the current data deluge throughvisualizations. The success of tools like LOD, QB orD3 [4] has greatly simplified data publishing and vi-sualization, but the problems described by Fox andHendler still persist due to a combination of factors:i) the standards are often taken as guidelines and thereis a lot of improvisation when publishing datasets; ii)integration of SLD is a complex field [56]; iii) projectsare mostly focused on creating individual visualiza-tions (line charts, bar charts, etc) instead than frame-works for integrating multiple types of visualizations;iv) in the context of Big Data, scalability has to betaken into account when designing new systems rightfrom the start.

This article describes the methodology that wasused to remove some of these gaps and to integratecross-domain statistical data sources into a visualdashboard that supports a multiple coordinated viewapproach. A first prototype considered specific projectrequirements in conjunction with recommendationsfrom Dadzie and Rowe [11] and QB (concepts suchas observations and slicing). We have then extracted aset of principles and workflows for integrating and vi-sualizing heterogeneous data sources that we later ap-plied to various use cases (e.g., tourism, telecommuni-cations, etc). We iteratively continued to improve anddeploy new versions of this technology. The currentarticle is focused on the first two generations.

We present use cases from the tourism and telecom-munications domains based on cross-domain datasetsfrom multiple sources including Eurostat2 and the

1www.w3.org/tr/vocab-data-cube2eurostat.linked-statistics.org

World Bank3, and discuss specific types of tasks thatthe visual dashboard helps address. The article’s maincontributions include:

– a set of workflows and visualization principlesusable for visualizing datasets in the RDF DataCube vocabulary (Section 4);

– a collection of visualization scenarios that areuseful for multiple use cases (Section 5);

– visual dashboards developed following these prin-ciples and scenarios (Section 6 and Section 8).

The remainder of this article is structured as follows:Section 2 offers an introduction to the QB vocabularyand formulates the problem statement; Section 3 de-scribes the current state of the art in statistical LD Vi-sualization; Section 4 describes the principles, archi-tecture and workflows we propose to visualize statisti-cal LD using different visual metaphors; Section 5 de-scribes use cases from the tourism and telecommunica-tion domains and how they guided the development ofvisual tools; Section 6 describes the design, implemen-tation, and usage of a tourism dashboard in line withthe use case requirements, which is evaluated in Sec-tion 7. The telecommunications dashboard presentedin Section 8 builds on recommendations derived fromthis evaluation. Section 9 summarizes the lessons welearned and outlines future research avenues.

2. Background and Problem Statement

2.1. Background - RDF Data Cube Vocabulary

The RDF Data Cube Vocabulary is a W3C Recom-mendation for publishing statistical data, supported byindustry and academia as evidenced by the increasingnumber of datasets published using this vocabulary;e.g., the PlanetData datasets4 or the W3C use cases.5

A further advantage of QB is that it is based on a cubemodel that is compatible with the Statistical Data andMetadata Exchange (SDMX) standard and designed tobe general so that it enables the publishing of differenttypes of multidimensional datasets.

The basic building blocks of the cube model aremeasures, dimensions and attributes, collectively re-ferred to as components, and have the following roles:

3worldbank.270a.info/about.html4wiki.planet-data.eu/web/datasets5www.w3.org/tr/vocab-data-cube-use-cases

A.M.P. Brasoveanu et al. / Visualizing Statistical Linked Knowledge 3

– Measure components describe the things or phe-nomena that are observed or measured (e.g.,height, weight, arrivals, bed nights, capacity,number of mobile phone calls).

– Dimension components specify the variables thatare important when defining an individual obser-vation for a measurement (e.g., time and space).

– Attributes help interpret the measured values byspecifying the units of measurement, but also ad-ditional metadata such as the status of the obser-vation (e.g., observed, estimated).

These basic building blocks are then combined intomore complex structures such as slices and datasets:

– Observations are the atomic data units that repre-sent a concrete measured value for a set of con-crete dimension values. Observations correspondto the values from statistical databases. Some-times observations can also contain multiple mea-surements related to the same dimensions.

– Slices are groups of observations with severaldimensions fixed (e.g., the arrivals of Germantourists in Budapest between 2007 and 2013 hasonly one variable dimension: time).

– Datasets are collections of observations with thesame dimensions and measures. Datasets thatcontain observations grouped into slices acrossdimensions constitute a cube.

– A Data Structure Document (DSDs) describes adataset and contains all the required namespacesand components.

– Code lists or dictionaries describe the list of en-tities that are repeated through all datasets of apublisher (e.g., countries, units of measurements).They can also be used to describe complex hier-archies (geopolitical, ISO classification, etc.), andare often described using the SKOS vocabulary.

2.2. Problem Statement

Global economies expose us to various instabili-ties of non-periodic flows similar to those describedby Lorenz [34]. Most domains reflect aggregated pat-terns of human behavior (finance, telecommunica-tions, tourism, culture, etc.), where small changes ofamplitudes can lead to instabilities. To design a DSSfor such dynamic domains, one needs to understandfinancial and cultural profiles (migration patterns, fi-nancial needs, etc.). Such problems are easier to inves-tigate through the lens of statistics. In fact an imme-diate method to reduce the complexity derived from

such phenomena is to use large collections of statisti-cal data such as those provided by the World Bank orEurostat, which are now increasingly available as LD.Such collections help understand macroscopic effectswhen investigating complex economic, environmentalor social phenomena.

By splitting statistical data into cubes of up to threedimensions, the QB vocabulary offers a simple andflexible structure to represent such macroscopic ef-fects. Performing ontology alignment between any QBdatasets is a problem that is usually complicated by anumber of factors - lack of DSDs, failure of SPARQLendpoints, errors in the data or DSD, deviations be-tween QB guidelines and actual implementations, etc.Simply gathering a lot of data will not suffice to un-derstand macro trends, however, and visual methodscan help reduce data complexities during the decision-making process. Building a visual DSS is the firststep towards a full-fledged DSS system, but the outputof the visualizations (e.g., correlations, patterns) doesnot necessarily need to be translated into new knowl-edge (e.g., by creating new annotations or datasets withthese correlations). Even without automatic interpreta-tion of the results, this still complicates the problem, asmost visualizations are built for simple use cases. Whatmethodology needs to be followed to display multiplecoordinated visualizations built from a single query?What are good methods to show both numeric resultsof analytic processes and the corresponding visualiza-tions in a unified view?

Building visualizations is a time-consuming pro-cess, and the desired ability to reuse them poses a num-ber of challenges. What are the best design patternsfor implementing reusable visualizations? Do existinginteraction patterns of existing visualizations need tobe adapted for new datasets? These questions lead tothe main research problem investigated in this article:What are the principles, architectures, workflows andimplementation design patterns needed to build a vi-sual DSS that exploits cross-domain information?

3. Related Work

A survey of Semantic DSS [2] contains an overviewof the systems and a set of interviews with variousresearch and industry partners. It identifies two mainchallenges for future Semantic DSS: a) the lack of flex-ible integration of information (most systems do notintegrate text, data and visualization well) and b) nu-merous issues related to the data analysis (cleaning,


querying, aggregation, abstraction, etc) and scalabil-ity. Semantic Web (and Linked Data, by extension)and DSS can be viewed as application areas of Arti-ficial Intelligence (AI), and in many cases the resultof research in such an applied field is a system. How-ever, effective AI systems need to use a variety of tech-nologies to deliver their best results. In Semantic Web,for example, there is an increased wave of hybridiza-tion with Natural Language Processing (NLP), Ma-chine Learning (ML), and Information Retrieval (IR),even the most popular systems such as Watson sub-scribing to this trend [27,53]. Another possibility isto use Human-Computer Interaction (HCI) techniquessuch as visualization to navigate the data flow. The re-mainder of this section is focused on the visualizationof Linked Data as a major means of sense-making.

3.1. Linked Data Visualization

We can distinguish two large domains of LD visu-alization: ontology visualization (TBox visualizations)and instance data visualizations (ABox visualizations).Systems that offer both are also possible. In bothcases, the main goal of the visualizations is to helpunderstand the relations between the various ontologyclasses or instances. We also discuss several LinkedData Visualization Models.

Ontology Visualization. The evolution of ontol-ogy visualization can be traced through several sur-veys about the early days of Semantic Web visualiza-tion [18,30]; the role of ontologies in building userinterfaces [38] and OWL visualizations [14]. The pa-per by Dudas [14] also examines the types of tasksneeded in an ontology visualization system, how thesetasks are supported in the current systems, and show-cases an Ontology Visualization Recommender tool.RDF based languages that allow visualizing ontolo-gies and data are now being used to visualize on-tologies. RDFS/OWL Visualization Language (RVL)[39] was designed to create simple mappings betweenRDFS/OWL and D3.js [4] visualizations. It is a declar-ative language that allows creating visualizations fromboth the TBox and the ABox of a dataset. Another de-velopment is a visual language called VOWL2 [33]geared towards helping users visualize ontologies.

Instance Data Visualization. The early survey ofLD visualization techniques from Dadzie and Rowe[11] predates the release of the RDF Data Cube Vo-cabulary. It contains the first coherent set of principlesfor visualizing LD, and divides existing visualizationtools into two groups: text-based and LD browsers that

offer visualization options. A later survey of LD ex-ploration systems [35] starts with a list of search taskcharacteristics and links them to features already im-plemented in LD browsers. The survey identifies threetypes of LD exploration systems: LD browsers; LDRecommenders; and LD-based exploratory search sys-tems. It offers a summary of best-practice systems in-cluding their IR and HCI features. Similar to the caseof ontology visualization, there is the possibility to usedeclarative LD for creating LD instance data visualiza-tions with RVL [39].

Linked Data Visualization Models. A numberof formal models describe LD visualization work-flows, some of them also being associated with pro-totype implementations. De Vocht’s [50] Visual Ex-ploration Workflow is an executable model for vi-sualizing graphs that contains four types of views(overview groups, narrowing views, coordinated viewsand broadening views). Brunetti’s [7] Linked Data Vi-sualization Model (LVDM) extends Chi’s data statereference model [10] and consists of a series of trans-formation stages built on top of RDF and non-RDFdata: a) data transformation; b) visualization trans-formation; c)visual mapping transformation. Helmich[22] implements this model in Payola for visualizingthe Czech LOD cloud. Ba-Lam Do’s Linked Widgetsplatform [13] is a pipeline for creating mashups.

All these models and workflows resonate well withSchneiderman’s Visual Information Seeking Mantra:Overview first, zoom and filter, then details-on-demand[48]. Shneiderman’s taxonomy actually goes beyondthis mantra and contains additional tasks: relate, his-tory and extract, as well as a list of the quantitative vi-sualization types. A recent list of visualization typescan be found in Heer’s visualization zoo [20], whilean extension of the task types taxonomy for interac-tive dynamic analysis can be found in Heer and Shnei-derman [21]. The updated taxonomy contains twelvetypes of tasks split into three groups. Data and viewspecification tasks (visualize, filter, sort, derive) for ex-ploring large datasets tend to focus on the selection ofvisual encodings rather than the actual visualization.View manipulation tasks (select, navigate, coordinate,organize) are used for highlighting and coordinatinginteresting items and represent the core tasks describedin the original Information Seeking Mantra. Since to-day’s visualizations are typically related to multipledatasets or articles, the last category of tasks is re-lated to process and provenance tasks (record, anno-tate, share, guide).


An extensive treatment of the various reusable quan-titative visualizations can be found in [54]. The bookpresents a grammar of graphics that allows buildingany 2D scientific visualization from a set of simpleprimitives such as points, lines, scales or shapes. Re-cent visualization libraries built on top of D3 such asggD36 or Vega7 are following this philosophy. Thenext step in the evolution of LD visualization systemsis to design pipelines and systems capable of exploit-ing the dataset structure and the underlining data struc-tures. The next section is focused only on those sys-tems able to visualize QB datasets.

3.2. Statistical Linked Data Visualization

In statistical LD visualization, instances will typi-cally belong to or be associated with QB datasets. Ifthe SLDs have a more complex structure, the corre-sponding ontologies or DSDs might need to be visu-alized as well. There are three types of Statistical LDVisualizations systems:

– tools and packages that offer basic LD visualiza-tions (tables, charts, maps) of QB datasets, withor without aggregations;

– dashboards or complex tools that integrate sev-eral visualizations typically using Multiple Coor-dinated Views (MCV);

– LD platforms that might contain visualizations.

Basic Visualizations and Aggregations. The LOD2project developed a Statistical Workbench that reflectsvarious phases of the statistical LOD consumption cy-cle, e.g. triplification via CSV2DataCube, validationthrough the RDF Data Cube Validation tool and vi-sualization with CubeViz[15,44]. CubeViz [44] is anRDF Data Cube Browser which can be used to queryboth resources and observations from QB datasets, anddisplay the results in the form of several classic charttypes. The OpenCube toolkit offers tools to managethe statistical LOD lifecycle [26] and includes com-ponents for Extract-Transform-Load (ETL) operations(Grafter framework), data conversion (TARQL adap-tation) and data publishing (D2RQ extensions). It alsocontains tools for consuming the data: the OpenCubeBrowser for table-based views, an R package for sta-tistical analysis, a widget for slicing data cubes, a cat-alog management component and a tool for interac-tive map-based visualizations. Vital [12] uses visual-

6benjh33.github.io/ggd37trifacta.github.io/vega

izations to help in the analysis and debuggging processfor QB datasets publication. The automated Visualiza-tion Wizard described in [36] offers support for vocab-ulary mappings, considers the possible combinationsof dimensions and measures for RDF Data Cubes, andoffers a choice between several visualization pack-ages (D3.js [4] and Google Charts). Another paper re-lated to the same project [24] presents the Linked DataQuery Wizard which uses a table-based approach toselecting query results from QB datasets, and classicchart types or mind maps to visualize the results. Ba-Lam Do [13] developed a visualization pipeline fo-cused on creating Linked Widgets like lines, bars, pies,and especially maps, from QB datasets. He also iden-tified two main problems for statistical LD visualiza-tions: a) the challenge of analyzing and aligning multi-ple datasets due to the fact that most publishers use theQB vocabulary as a guideline rather than as a specifi-cation and almost always come up with some changesto it; b) the challenge of creating tools for consumingstatistical LD.

Dashboards. There are several dashboards based onthe RDF Data Cube format (QB) and its predecessor,the Statistical Data and Metadata eXchange (SDMX)format - the ISO standard for statistical data represen-tation currently used by large institution such as theUnited Nations, Eurostat, the International MonetaryFund, and the World Bank. The dashboards of Jern[25] and Hienert [23] used SDMX as they were builtbefore 2012. One of the first QB dashboard examplesby Sabol et al. [40] extends earlier work [36,24] andallows brushing over multiple coordinated visualiza-tions. Sabol’s paper analyzes two scenarios (search andanalysis over LOD, analysis of scientific publications),describes the underlying workflow, and the resultingvisualizations implemented in the extensions of the Vi-sualization Wizard tool. The framework presented inthis article is based on a Multiple Coordinated View(MCV) architecture to synchronise multiple visualiza-tions [45].8 This approach addresses several challengesidentified in the Semantic DSS study [2].

Even though not directly related to visualizations,the work of Kämpgen and Harth [28] focused on in-terrogating multiple QB datasets via the OLAP4LDframework, which can be used as a starting point fordelivering data to complex dashboards.

8The Media Watch on Climate Change is a news and social me-dia aggregator on climate change and related environmental issues,which serves as a public showcase of the presented MCV approach(www.ecoresearch.net/climate).


Linked Data Platforms (LDPs).9 The idea behindLDPs is the delivery of data in various formats usingREST APIs (Application Programming Interfaces fol-lowing the Representational State Transfer standard).Some platforms also allow to build full-featured inter-faces that can contain maps or pictures, typically usingtemplating solutions such as Velocity,10 Elda,11 Car-bon LDP,12 Apache Marmotta,13 Graphity,14, LDP4j[19] and Virtuoso15 are several exponents of this trend.Many of the applications developed following LDPbest practices16 include maps or other types of visual-izations. The main reason to include them as a sepa-rate type of visualization is the fact that many of theseplatforms are used to publish QB datasets.

In conclusion, many visualization workflows aregeared towards creating simple charts, and little ef-fort (with the exception of MCV dashboards) is dedi-cated to complex analytic solutions. Without combin-ing datasets and synchronizing multiple visualizationswith the integrated repository, it will remain a chal-lenge to clearly present complex use cases like thosedescribed in Kämpgen and Harth’s work or at the be-ginning of this article. The next section presents a setof principles and a workflow to support the creation ofcomplex visualisations from statistical linked data.

4. Visualizing Statistical Linked Data

Before describing the design of visualizations fol-lowing QB principles, this section outlines the work-flow of statistical LD visualizations, as well as thetasks and visual metaphors involved.

4.1. RDF Data Cube Visualization Principles

The principles outlined in this section do not funda-mentally change those presented by Dadzie and Rowe[11], but extend them in the context of visualizingstatistical LD. The goal is to create a set of linkedviews for analyzing the relations between slices frommultiple datasets in order to identify correlations andother patterns. The iterative process of defining these

9www.w3.org/tr/ldp10velocity.apache.org11www.github.com/epimorphics/elda12www.carbonldp.com13marmotta.apache.org14www.github.com/graphity15virtuoso.openlinksw.com16dvcs.w3.org/hg/ldpwg/raw-file/default/ldp-bp/ldp-bp.html

principles started with the design guidelines for QBdatasets, iteratively expanding and refining them dur-ing dashboard development. The following list sum-marizes these principles for visualizing statistical LD.

– Linked Views for Statistical LD. Visualizationsshould reflect the linked nature of the data andsupport switching between visualizations whennavigating the underlying datasets, or at the veryleast reflect changes across several visualizationson a single screen using multiple coordinatedviews technology. We refer to this principle asLinked Visualizations for Linked Data, and en-courage it regardless of the nature of the linkeddatasets to be visualized. Linked visualizationsare an obvious choice for statistical LD as statisti-cians tend to use multiple graphics to understandstatistical phenomena.

– Integrate Data Analysis and Visualizations.Since statistics GUIs like R also tend to integratecode, data and visualizations, we also recommendto integrate the data analysis and the visualiza-tion taks. Code should not be integrated, exceptif the GUI is dedicated to programmers. Support-ing views that do not necessarily contain visual-izations while displaying slices of datasets is agood way to apply this principle - e.g., a list oftop customers can be arranged after certain crite-ria or a table can be used to display the results ofa statistical test.

– Visualize Slices Instead of Datasets. When visu-alizing particular datasets, one needs to take intoaccount their structural characteristics. Since sta-tistical LD datasets will rarely (if ever) be visual-ized in their entirety, systems require the abilityto visualize slices instead of datasets. The RDFData Cube Vocabulary identifies the dataset it-self (qb:dataset), its structure (qb:structure) anddimensions (qb:dimension), as well as the ac-tual measures (qb:measure) and observations(qb:observation). An observation about bed nightsoccupied by German tourists in Prague from atourism dataset, for example, will include di-mensions such as market (Germany), destination(Prague) and time interval (January 2010), andmeasures such as the number of bed nights. Thiscorresponds to the structure of observations re-ported by statistics agencies and is equally suitedfor any type of experiment that tracks data overtime (psychology, sociology, physics, etc). Visu-alizing slices instead of entire datasets in a spe-


cific context (together with text or data, for ex-ample) also increases the value of the informationpresented to the user.

– Apply Flexible Mechanisms for Selecting Slices.Slices are collections of observations, in which atleast one dimension remains fixed, approximat-ing the way humans tend to query datasets, forexample: Identify all data about Austria’s GDPbetween 2008 and 2014 (Austria represents thefixed dimension) or Find all observations relatedto bookings by German tourists in Prague be-tween 2008 and 2014 (Germany and Prague arefixed dimensions). In the second example, it isdifficult to predict if the user is actually inter-ested in data related to the German clients, orto data related to people who visited Prague, orboth. Therefore the best way to present the resultsis to take into account both dimensions and pro-vide two separate views or a single view whichis updated whenever the user wants to change thequery. Implementing switching mechanisms forthe fixed dimensions allows for flexibility in thechoice of slices to be visualized.

– Highlight Particular Observations. The "High-light links" principle from Dadzie and Rowe’swork[11] needs to be extended to take into ac-count the structure of the datasets. When usingmultidimensional datasets (e.g., tourists visitinga particular destination) we also need to high-light specific (best, worst) observations, not justthe links. To differentiate these top observations,they could be aggregated by location and color-coded by performance indicators. Charts coulduse heatmaps to emphasize importance, while intables row/column coloring or fonts can achieve asimilar effect.

– Normalize Indicator Values. If one wants to plotmultiple indicators in order to observe correla-tions between them, it is not only useful, but rec-ommended to normalize their values so that theycan all be displayed in the same plot. Normal-ization is often a part of visualization processes,but in the case of statistical indicators it is al-most always needed. In fact without normaliza-tion SLD visualization would not be possible atall. Indicator selection is extremely important, aseven though after normalization any indicator canbe plotted against any indicator, perhaps it wouldbe better to plot only things that mean somethingin a certain use case.

– Provide Highly Customizable Temporal Con-trols. The temporal dimension plays an importantrole in SLD, therefore temporal controls shouldprovided but not at the expense of overcrowdingthe interface. While this might seem obvious, it isa principle that was overlooked when creating theinitial SLD dashboard for the ETIHQ project (seeSection 5.1), since time was being perceived im-portant just for slicing datasets. However, whenone needs to understand data properly, time is es-sential, as things like seasonality, peaks or valleyscan only be understood when using different timeperspectives.

– Extract and Share. One of the main princi-ples behind LD is its accessibility in multi-ple machine-readable formats. A quick way toachieve this through visualizations is to exportthe data slices into various formats. Customizableimage export functions should support the dis-semination of new research insights, for example,and reflect the last principle we propose, that ofextracting and sharing visual knowledge.

4.2. Architecture and Workflow

Many state-of-the-art applications emphasize au-tomation and reuse - creating, using and replacing vi-sualizations as part of an iterative process. Such a life-cycle can be expressed as a series of visualizationpipelines [13,26], which requires developers to followcertain workflows. When visualizing statistical LD,such workflows will necessarily include both LD tasks(selection of indicators, ontology alignment, etc.) andvisualization tasks (data wrangling, interaction, etc.).

Building on best-practice examples reported in theliterature, we propose a workflow for creating visual-isations that follows the logical sequence of develop-ing statistical LD applications. We have closely fol-lowed these steps when implementing the dashboardsdescribed in Sections 6 and 8.

1. Requirements need to be well-understood toproduce a good narrative and a first set of visu-alization ideas (even if there is limited informa-tion about how the data looks like at this point).It is recommended to describe these require-ments through scenarios or user stories. A sce-nario offers a high-level view of important ap-plication features - including the motivation andresearch questions, important queries to be ex-plored, example data sources, and the type of vi-


(a) QB

(b) SDMX

Conversion

(c) OTHER FORMATS

ONTOLOGY

AND DATA

ALIGNMENT

STORAGE

AND

INDEXING

TRANSFORMATION

(QUERIES) INTERACTION

INDICATORS

GROUP BY PUBLISHER

REQUIREMENTS

DATA EXPORT

CHART EXPORT

VISUALIZATION

Fig. 1. Generic, reusable workflow for visualizing Statistical Linked Data (QB, SDMX and other formats)

sualizations appropriate to address the researchquestions. Scenarios are a good fit for describ-ing larger applications with multiple types of in-teraction, many data sources, and most likelysignificant customization efforts. User storiesare a good fit for smaller applications and typ-ically describe only one feature. Ideally, sce-narios should be split into several user stories.During requirements collection special emphasisshould be placed on how to link visualizations inthe context of an application. Another importantpoint is to understand what kind of workflow bestdescribes a new use case, especially the changesin existing workflows required to add the newfeatures requested by a project or client.

2. Discovery and Indicator Selection. If datasources were already identified in the require-ments phase, this step will require to select theneeded indicators and convert them to a formatthat is easy to use for visualizations (e.g., a fla-vor of RDF or JSON). The discovery of the rightindicators is not necessarily a straight-forwardprocess, as one will have to not only identify theright indicator, but also the version that is bestfor the application (e.g., there can be hundredsof variants of GDP indicators published by largestatistical data publishers). To get the right ver-sion of the indicator one will have to check forthe name, granularity (yearly, monthly, daily),geolocation (it frequently happens that some in-

dicators are not available for all locations or thatthe label of the location is different from onedataset to another) or additional clues (WhichGDP indicator is needed - GDP, GDP growth orGDP per capita?). In general this step almost al-ways requires either the use of LD and domainexperts (a strategy we have deployed in the ini-tial phases of building the ETIHQ dashboard (seeSection 5.1) when the experts provided a set ofURIs to index), or the creation of an automatedtool to discover and import the data (a strategywe have started to deploy later in the process,once we had a better understanding of the data).Grouping the indicators according to their prove-nance and category helped in the later stages ofthe process. In this stage lies the key to devel-oping multiple types of workflows based on thedata provenance and format, as it can easily beseen in Figure 1. The current article is mostly fo-cused on the QB workflow. RDF workflows needan extra cubification step, while other formatssuch as Comma-Separated Values (CSV) requireboth cubification and conversion steps. While wedo not explicitly show additional cubifications inFigure 1, the next steps of the workflow assumethe data is in a QB or QB inspired format (e.g., aJSON representation of QB datasets).

3. Ontology and Data Alignment. Running a se-quence of SPARQL queries can yield an abun-dance of data, but to create real value, the var-


ious dimensions of the datasets under consider-ation need to be analyzed, aligned, augmentedor aggregated in order to fit particular visual-ization scenarios. The ontology alignment phaseonly refers to the analysis and alignment of thedata. There are a number of important steps thatneed to be taken into account when performingontology alignment between QB datasets: fix-ing or avoiding broken or missing DSDs, fail-ure of SPARQL endpoints, broken dumps, miss-ing code lists. All these have to be includedinto alignment queries or scripts. The strategywe used in order to ease the ontology align-ment process was to examine sets of RDF dumpsfrom large statistical data publishers (e.g., exam-ine several random World Bank or Eurostat QBdatasets) and determine upfront if we needed ad-ditional data (e.g., code lists). By focusing on thepublishers instead of particular datasets, we areable to easily ingest all the datasets from eachpublisher since they will all follow the same rules(e.g., the DSDs will follow the same format, thecode lists will be similar, etc).

4. Indicator Storage and Retrieval. Storage ad-dresses the problem of failing SPARQL end-points, and coupled with effective indexing strate-gies in conjunction with established platformssuch as Elasticsearch, Lucence or Sindice is es-sential when building IR applications. For pub-lishers with large number of datasets we indexthe RDF dumps. We prefer to index triples dueto the advantages offered by modern search en-gines like Elasticsearch (speed, availability, sim-ple document structure). Familiarity plays an im-portant role in this phase, as it is important tochoose triple stores and search engines that areknown by the development team.

5. Transformation. This is one of the most impor-tant steps of the workflow, as it allows to spec-ify data wranglers [29] (scripts that transformdata into formats suited for particular visualiza-tions), queries or aggregations. Since the dataitems were already indexed using a search serverthere was no need for data wrangling scripts asthe indexer already performs this mapping func-tion. However, in this step we wrote the queriesand aggregations. We view the transformationstep as a first part of Heer and Shneiderman’sdata and view specification (filter, derive), eventhough derive tasks can also appear in subse-quent steps [21].

6. Visualization. Once the data is indexed, anyquery has to lead to at least one visualization.As opposed to approaches that focus on creat-ing a single chart [40], the goal was to gener-ate a set of linked visualizations. This simplifiesthe process for first-time users, who do not needto choose a particular representation of the datarepresentation, but can look at the data from dif-ferent perspectives. Since data has already beenaligned in a previous step, all the visualizationsgot similar input regardless of provenance. Eachvisualization module contains all the functional-ity one would expect from a visualization gram-mar, therefore we view them as the second part ofHeer and Shneiderman’s data and view specifica-tion (visualize, sort) [21] in our implementation.While it might not seem important at this stage,the taxonomy used in the interfaces (e.g., entriesin the menus) needs to be clear enough for theusers so that they do not need to experiment toomuch, otherwise they might perceive the learningprocess of the tool as a complex task.

7. Interaction. An interaction layer (selections,zoom, pan, transitions, synchronization) is usu-ally built on top of the visualization layer. Someinteractive features are already built into most ofthe visualizations (e.g., selections, tooltips), butthe features found in this layer are those that areessential to the global look and feel of the inter-face. This level corresponds to Heer and Shnei-derman’s view manipulation [21].

8. Reuse and Sharing. These processes can oc-cur on multiple levels, from the indicators orindexes, to specific charts, APIs or entire plat-forms. Reuse should be an integral part of thedesign process, parts of it corresponding to pro-cess and provenance in Heer and Shneiderman’staxonomy [21]. Users should be able to shareboth the visual results, as well as the underly-ing datasets. Chart Export (PNG, SVG) and DataExport (CSV, XLS) create the possibility to eas-ily automate reporting (a feature that is essentialfor business analytics), while also offering usersthe opportunity to quickly share their findings.

Due to the increased specialization of certain layers(e.g., alignment or interaction) and the wide variety ofchoices when it comes to storage and indexing, by us-ing the previous steps as a guideline, one can imple-ment a variety of workflows.


The next section will introduce several use cases forstatistical LD. It will outline the analysis of user re-quirements, show how to transform these requirementsinto visualization scenarios, and discuss how to imple-ment these scenarios using the previously mentionedprinciples and steps.

5. Decision Support Use Cases

The main strength of LD technology lies in the sim-plified integration of various data sources either byaligning identical entities (e.g., statistical indicators,people, organizations, locations), or by explicitly stat-ing the relation between similar things (e.g., one statis-tical indicator being narrower than another).

The following use cases present independent visualanalytics platforms for different domains, which in-tegrate statistical linked data with real-time contentstreams from news and social media channels.

5.1. Tourism Domain

Tourism analytics is a complex field drawing on dif-ferent statistical data sources (Eurostat, World Bank,etc.) and a wide range of indicators incorporated fromthese sources (bed nights, arrivals, capacities, etc.).The ETIHQ project [42] investigated such sourcesand their value for visual tools and real-world deci-sion support scenarios. Typical users in such scenar-ios are tourism professional (e.g., DMO managers,travel consultants, researchers) interested in questionsrelated to seasonality, country and city profiles dur-ing peak tourist season, points of interests, arrivalsin tourism destinations, or number of occupied bednights. Tourism professionals will be interested in nu-anced answers to such questions in order to better un-derstand why tourists choose a certain destination in agiven time period.

Many existing tools do not support the creation ofscenarios for visualizing linked models as part of a uni-fied view, or to easily reuse visualizations by changingtheir input data. Answering complex questions, how-ever, often requires combining heterogeneous indica-tors from multiple sources. One has to specify not onlythe possible combinations of dimensions and measureswithin a visualization, but also the temporal granular-ity of the datasets (e.g. monthly, weekly or daily datapoints), their provenance, or statistical tests that areneeded to validate the underlying models. To combinethis heterogeneous data into meaningful visualizations,

visual tools need to use MCV or similar design pat-terns to synchronize multiple visualizations [45].

To specify requirements in line with the scope ofthe ETIHQ project, we have initially conducted struc-tured interviews with colleagues from the Tourism andService Management and Applied Statistics and Eco-nomics departments from MODUL University Vienna,and also conducted a practitioner’s survey [43]. Theevaluation of the interface described in Section 7 tookplace after the project had ended - its results were in-strumental for designing the telecommunications dash-board prototype, as outlined in the next section.

5.2. Telecommunications Domain

The effective integration of structured and unstruc-tured data from multiple sources, both open and pro-prietary, is of particular importance in the telecom-munications industry. Pursuing such an integrated ap-proach, the ASAP Project (see Section 8) collects andannotates the public dialog about regional and nationalevents in the form of Web documents and social me-dia content, and combines the resulting repository withCall Data Records (CDRs) related to voice, SMS andmobile traffic, aggregated and fully anonymized to pre-serve customer privacy in line with European privacyprotection laws.17

A telecommunications analyst who wants to com-pare CDR data across cities, for example, can use theaggregated representations of online media media cov-erage from the observed regions, and correlate peaksin the number of calls with co-occurring events such asmusic concerts, sports events and political campaigns.Statistical indicators from the respective cities can helpanalysts understand related geopolitical trends such asmigration, an aging population, or a decreasing GrossDomestic Product (GDP).

To cope with the requirements of such scenarios,a state-of-the-art visualization engine and dashboardneeds to include not just a set of appropriate visualmethods, but also components that support: (i) the par-allel processing of a wide variety of data types, in-cluding semantic data types like geographic location,sentiment, timestamp, etc; (ii) the remix of data froma wide variety of data sources regardless of domain,type (structured or unstructured), or provenance; (iii)and the possibility to extract aggregated statistics ofthe most important entities, and means to select, sortand summarize the data accordingly.

17ec.europa.eu/justice/data-protection/article-29/index_en.htm


5.3. Classification of Decision Support Scenarios

To better structure use case descriptions, we havedevised a theoretical framework (see Table 1) thattakes into account the provenance of the indicators,and several possible scenario types (ST). These sce-nario types allow telling different stories, and mix thevisualizations according to the hypothesis we want tocheck, but also with respect to data provenance.

The 1:1 Scenario (one indicator, one source) is themost common case that inspects one indicator from asingle source, e.g. showing the TourMIS18 bed nightsindicator over a period of time. Showing a single in-dicator hardly restricts the visualization design space,as arrivals from different markets for the same des-tinations, can be shown via a large number of visualmetaphors (line charts, bar charts, pie charts, arc dia-grams, hive plots, etc). Different selections of the sameindicator can be displayed on the same graph. By fix-ing destination, we can show values for different mar-kets (e.g., United Kingdom and Germany) and answersimple questions (What are the top markets for cer-tain cities?). By fixing market, we can show valuesfor different destinations (UK arrivals to Vienna vs.Linz vs. Graz) with the goal of comparing destinationperformance. We can easily ask the same questions atcountry-level instead of city-level, by using the aggre-gation operators.

The N:1 Scenario (two or more indicators, samesource) allows inspecting multiple indicators from thesame source - e.g., by displaying bed nights and ar-rivals from the same market to a destination one couldinfer the percentage of the arriving tourists who sleptin hotels. It is rarely used in practice, but it is usefulwhen we need a list of all indicators related to a cer-tain topic from a single source (e.g., we want to knowwhich type of arrival indicators appear in TourMIS -arrivals inside the city, arrivals at city borders, arrivalsat hotels, etc) or when we need correlations betweenindicators from the same source.

The 1:N Scenario (one indicator, multiple sourcescan send the wrong message to the user, but can be ofinterest for dataset publishers. Inspecting values of thesame indicator (e.g., arrivals) from two (or more) datasources is the general use case for this scenario, (e.g.,comparing arrival indicator values from TourMIS andthe World Bank). It must be ensured that the indica-tor in the two data sources is measured in the same

18www.tourmis.info

way, i.e., it has same (or comparable) meaning andit has same (or comparable) semantics for its dimen-sions. This scenario could sometimes lead to problem-atic cases by suggesting to users that the indicator datafrom one source is incorrect. This might not even betrue, as in some cases there could be differences be-tween the data collection methodologies.

The M:N Scenario (multiple indicators, multiplesources) covers the most interesting cases, often ad-dressing interdisciplinary questions such as: How arethe arrivals from a certain market influenced by theGDP growth in a market country? Do CO2 emissionsof a destination city affect its arrivals per capita? Vary-ing the settings of an indicator can reveal interestingcorrelations - e.g., comparing performance on city vs.country levels, or investigating seasonal variations.

From a tourism research perspective, cross-domainindicator comparisons are the most relevant cases. LDtechnologies support integrated visualizations that aredifficult to obtain by means of traditional database sys-tems. When implementing such scenarios it is impor-tant that the two indicators are linked based on thevalue of one of their dimensions, that is the same orcompatible (e.g., if one has cities and the other countrydata, city data from that country can be added up). Ad-ditionally, indicator value ranges should be the same,or compatible in the sense that higher granularity datacan be obtained from lower granularity data by addi-tions (e.g., month vs. year, city vs. country).

6. Visual Analytics Dashboard

The visual dashboard19 (Figure 2) is a visual seman-tic DSS that uses multi-domain knowledge in tourism.The dashboard combines information from TourMIS,World Bank and EuroStat. Its design is based onthe scenarios discussed in Section 5. It currently al-lows decision makers to select and concurrently visu-alize tourism, economic and sustainability indicators,though the number of indicators can be extended toany number of domains of interest for which statisticalLD exists. While TourMIS provides European tourismindicators, we select economics and sustainability in-dicators from the other two sources. Data from Tour-MIS/ETIHQ rarely overlaps with Eurostat or WorldBank data, therefore scenarios that compare same in-dicator from multiple sources (1:N Scenarios) are notpresent in this dashboard.

19etihq.weblyzard.com


Table 1

Overview and examples of decision support scenarios depending onthe number of combined data sources and indicators

Sources / Indicators 1 indicator 2 (+) indicators

1 source 1:1 Scenario: Inspect one indicator from one source

– e.g., how do the arrivals from UK and JP in Viennacompare?

– e.g., where do more UK tourists arrive when com-paring Vienna and Linz?

N:1 Scenario: Inspect at least two indicators from thesame source

– e.g., which percentage of tourists arriving in Viennaactually sleep there? (as a delta between arrivalsand bed nights)

2 (+) sources 1:N Scenario: Inspect one indicator from at least twosources

– e.g., How do arrivals to Vienna compare as recordedin TourMIS and World Bank?

– e.g., Is GDP for a specific country (Austria) thesame in Eurostat and World Bank?

M:N Scenario: Contrast at least two indicators fromat least two data sources

– e.g., How does the GDP of a market country (e.g.,Japan) correlate with arrivals/bed nights in one (ormore) cities (e.g., Vienna vs. Amsterdam)?

– e.g., How does tourism impact the environment ofthe host country?

Our dashboard has two large components:

– An indexer package that represents the LinkedData components and produces an Elasticsearchindex.

– A set of reusable visualization components thatare linked together to form a dashboard.

After discussing the design of the scenarios thatwere important for this dashboard, we will examinehow each component implements the workflow fromSection 4.2.

6.1. The Linked Data Layers

In order to implement the use cases described inSection 5, one needs access to several indicators fromvarious data publishers. The tourism data we have usedrepresents the dumps of an updated version of Tour-MISLOD [41] named ETIHQ which contains tourismdata about arrivals, capacities, bed nights, points ofinterest and shopping items in QB format. For Eu-rostat and World Bank data we have used dumps ofeconomics and sustainability indicators published inthe 270 Linked Dataspaces repositories20. Some de-tails about the publishing process of these RDF dumpscan be found in [8,9].

Currently several issues need to be solved by anyonetrying to build large scale RDF or Linked Data visual-ization engines: (i) SPARQL repositories still have se-rious availability and scalability issues and in order tofederate data one will have to replicate all the needed

20www.270a.info

datasets, vocabularies and Knowledge Bases locally;(ii) somewhat related to the previous issue, in orderto be able to run queries against data from multiplesources one either has to perform data matching onthe fly or convert those sources to a common format;(iii) the data matching is complicated by the fact thatdataset designers do not strictly follow the rules fromguidelines, therefore in the case of RDF Data Cubes,we frequently have missing code lists / dictionariesor DSDs, object properties that are labeled heteroge-neously [56]; (iv) visualizations are usually realisedwith JavaScript libraries like D3, which require JSON-based formats in order to quickly process and visual-ize any type of data. These issues suggested indexingthe data rather than to provide the users with on-the-flyintegration and visualization of LD sources, as all op-erations should take less than a second if they are to beintegrated into the portal.

As already noted, currently the QB standard is takenmostly as a recommendation, some implementationsskipping code lists or data structure documents (espe-cially if they only need to publish one dataset) or us-ing their own heterogeneous naming conventions forobject properties. As explained in [56], if location isnamed through geo, location, geolocation and manyother names in several datasets, a matching system willhave to take into account all these variations. Standard-izing naming conventions for RDF Data Cubes wouldbe one way to address some of these issues, but itwould only work if the guidelines would be strictly en-forced (e.g., via custom validation). Perhaps even moretroublesome is the fact that some elements of the QBvocabulary like slices or observation groups are rarely


Fig. 2. The ETIHQ Dashboard showing bed nights occupied by German tourists in various European destinations (Budapest, Dublin, Venice),plotted against the GDP growth of Germany

used. While this perhaps makes sense for slices whichcan be created automatically later, behind the existenceof an observation group there might be reasons that arenot immediately obvious if not documented. For exam-ple, medical data from certain months or years can bepublished as an observation group because there wasan epidemic in a set of countries. In such a scenario weface a loss of information if the observation group ismissing.

While RDF Data Cubes do offer a simple way tomerge lots of datasets together and create large datacubes with statistical indicators, this is true only ifthe data publishers follow the specifications point bypoint. The fact that some components such as codelists are optional and not necessarily well-understoodleads to additional complexity. We have for exampleencountered several situations related to code lists: (i)they do not exist, which is not a problem since theyare optional; (ii) they exist, work fine and respect thespecifications; (iii) they exist, but they contain am-biguous names or URIs (which should not happen) -therefore requiring a pre-processing step since it can-

not be anticipated what they will contain. In fact, somesmall changes to the specification of the optional com-ponents of the RDF Data Cubes and a better valida-tion process for these components are needed beforeon-the-fly validation and visualization in a reasonableamount of time (sub second) can be achieved, espe-cially if a system needs to be able to integrate any kindof SLD source. We insist on the sub second loadingtimes to avoid suboptimal user experiences.

We have used several approaches for collecting thedata for visualization. One approach was to use Feder-ated SPARQL, but quite often it resulted in queryTime-outs. Another approach was to write SPARQL queriesor bash scripts (a combination of cat and grep com-mands can return all the URIs that respect a certainpattern, for example) and run them against the dumpscollected from the three services. As a final solution,we indexed the data using a search server (Elastic-search) and creating an LD indexer API that gets thedata from all the data sources. The indexing servicewe created provides all the functionality for the LDlayers we envisioned (Selection of Indicators; Ontol-


ogy and Data Alignment; Storage and Indexing). Aslong as the running time of SPARQL queries is severalseconds, we will prefer the fastest method of indexingdata and recomputing the queries each time instead ofclassic data cube methods like full or partial material-ization of cuboids (slices in the current QB terminol-ogy). In fact even though full or partial materializationof cuboids is certainly useful, it is not always needed.A common use case where it is not really needed iscreating a new index on top of the current indexes.Creating derived indicators, a central problem in Sta-tistical Linked Data, requires complex computations,therefore in general it is better to use the full powerof a programming language like Java and some of itsDSLs instead of plain SPARQL. Since we do plan toinclude derived indicators in a future release, this wasan additional reason to consider indexing.

After indexing the datasets, each observation corre-sponds to one document. The document structure fora QB observation only takes into account the essen-tial information that needs to exist in a dataset so thatit can be visualized: the observation value, the unit ofmeasure, the geographic location (if it exists), and soon. This allows indexing a huge number of datasetsfrom many publishers therefore enabling the creationof scalable visualization solutions.

Since the URIs from Eurostat and World Bank pub-lished in the 270 Linked Data Space 21 are well-designed, for the discovery and selection of indicatorsall that is needed to find an indicator is to have an ideaabout the name or part of the name of the desired indi-cator. If the indicator name and URIs are known, thenit is sufficient to directly provide the URIs for the newdatasets. In the first phase, the indexer will harvest alltriples from that location that match the selected crite-ria (for example, only the data for indicators that cor-respond to real geographic entities, and no entities thatwere invented for statistics (like Germany+France orEU-Germany); or only data for the last 10 years).

A simple process of harvesting the triples that matchcertain criteria would have not offered enough in-formation for a visualization. Some additional tasksthat are performed are usually those related to ontol-ogy alignment. One such example of alignment is thegeospatial alignment performed by the indexer: Geon-ames [55] and DBpedia [32] URIs are used instead ofthe names of the actual locations, as the real namesof the location might suffer from various issues such

21www.270a.info

as spelling mistakes, wrong encoding or even differ-ent name variants. Another example is the alignmentof various units of measurements which was done us-ing the DSDs (where they were available, else we tookno units of measurements into consideration). We havenot performed any alignment based on granularity ofthe temporal data (month, quarter, years), but insteadused a convention: each observation corresponds to adata point in a graphic. The granularity information isadded to each observation, and it can be used wheneverit is needed (for complex aggregations at query time,for example).

When indexing the data, we kept all the informa-tion (including the links) from the actual RDF dumpsso that any observation or slice can be recreated ifneeded. From the first set of indexed datasets we haveextracted a QB-inspired JSON data format in order toease the validation of further datasets. The requiredfields of this format are those expected to be found inany QB dataset (dataset, observation URI, observationvalue, date, etc.), while optional fields can accommo-date dataset-specific information such as geographiclocation or the unit of measurement

The added information such as granularity or addedURI is only used for visualization purposes. It can besaid that an indexer, in addition to the processing forthe LD layers, also provides some of the functionalitytypically found on a transformation layer.

The first version of the indexer contained smallfunctions that allowed indexing any type of datasetor code lists from a certain publisher (e.g., Eurostat,World Bank, TourMIS). This was possible due to thefact that each publisher follows the same style ofdataset design for their datasets (e.g., if they do notuse code lists, none of the datasets from that publisherwill have references to code lists). Therefore by writ-ing several lines of code we were able to automaticallyindex hundreds of datasets from a single publisher.While this worked well, we still needed new data for-mats from time to time (date or time formats that werenot included in the initial list, for example), as almosteach dataset producer only loosely followed the W3CRecommendation when designing RDF Data Cubesand introduced small variations. Due to this fact, a cus-tom API has been developed to provide third-party ac-cess (see Section 8).

The functionality for all these layers (selection of in-dicators, ontology alignment) is included into a singlesoftware package that was initially written in Pythonand later rewritten in Java for better performance.


6.2. Cross-Domain Visualization Layers

All the visualization layers are grouped in the actualdashboard product. The visualizations were designedtaking into account the requirements presented in theprevious sections (see Section 5).

Transformation. The transformation layer containsthe various queries and aggregations needed to feed thedata into particular visualizations. There was no needfor a data wrangling component as the data from theElasticsearch index was already in the format neededby visualizations. As mentioned in the previous sec-tion, the indexer already performed some of the tasksusually found in this layer.

The reusable visualizations are written in JavaScriptwith jQuery and d3.js, following the conventions of thed3 reusable chart pattern22. All visualizations are pre-sented in a single-screen interface, synchronized usingthe multiple coordinated views design pattern.

The two layers that host the interface componentsare the visualization and interaction layers. In realitythey often cannot be separated, as often certain typesof interaction are easier to implement directly into aspecific visualization, as opposed to external modules.It also helps to present the workflow that needs to befollowed when constructing particular dashboard visu-alizations.

The current dashboard is targeting analysts and de-cision makers. This can be inferred directly from theuse cases presented in the previous section. A Destina-tion Management Office (DMO) manager that wantsto understand the influence of the financial crisis onthe traveling behavior of German tourists, needs onlyto add some indicators to a chart, namely the variableshe is interested in. Some of the functionalities of thedashboard were created with researchers in mind (e.g.,data export).

Adding and Visualizing Slices. The user can startexploring new questions by adding several slices. Inorder to add a slice one will have to choose a dateinterval (via the calendar button from Figure 2), thenproceed to complete all the needed information aboutprovenance and dimensions in the Advanced Searchdialog and add it to the General menu.

A good method to start could be adding an indica-tor from TourMIS that shows the data slice represent-ing the number of beds reserved by German tourists inBudapest. The definition of an indicator in the visual

22bost.ocks.org/mike/chart

interface is a slice of data that covers the selected datesand in which the market and the destination are fixed.Pushing the gear icons button in the General pane (Fig-ure 2) will uncover the menu where we will select Addtopic. A topic corresponds to an indicator, that is a sliceof the data in the respective interval (the time intervalof interest must be selected in the upper-part of the in-terface) with market (source) and destination (target)as fixed dimensions. From the same menu we can sortthe data from a chart alphabetically or by frequency.

It is recommended to create a meaningful namingconvention for the topics / indicators, as shown in Fig-ure 2, because the display space for menus will al-ways be limited. Generally, we recommend that thenames consist of the abbreviation of the indicator’sdata source (TO stands for TourMIS, ES for Eurostatand WB for World Bank), the name of the indicator(i.e., Bednights) and the dimension values that are cho-sen (in the shown example, these would be DE for Ger-many and Bud for Budapest). So, for this example in-dicator we provide the TO Bednights DE Pra name.While some users might not adopt this convention (in-deed some of the users who participated in the evalu-ation described in the next section have not), it is nev-ertheless good to provide guidelines about this namingconvention in the tutorials or user manuals.

Once named, a new indicator (or topic) is added onthe right-hand panel of the portal, under the Generalheading. We then proceed to define the topic. By hov-ering over the new topic and clicking the gear icon onthe right, the chart view in the top-middle pane of theinterface will be replaced with a dialog field that al-lows defining the topic, as shown in Figure 3. It en-ables selecting the data source (currently, World Bank,Eurostat, TourMIS), indicators (the indicators from themenu), markets and destinations (both can be cities orcountries). A description of the selected indicator ap-pears near the Save button. Once the relevant selectionshave been made, Save will close the dialog box.

The General pane, the Advanced Search dialog, andthe date selection mechanisms, allow the users to cre-ate most of the operations from the data and view spec-ification layers suggested by Heer and Shneiderman[21]. The General pane allows to filter the indicatorsand sort them via the advanced search menu, and trig-gers the visualizations. By looking at the charts we canalso derive new knowledge, this being the main pur-pose of designing a visual DSS.

As soon as the Advanced Search dialog box isclosed, the data related to this topic is retrieved and vi-sualized in the charts view (entitled Indicators). The


Fig. 3. Advanced search dialog for creating slices, accessible via the topic’s gear icon

first time a topic’s data is visualized, the correspond-ing trend line is a dashed line. The current search canalso be observed in the Current Search box, under theGeneral menu.

The newly added topic also triggers various changesin the rest of the interface. The data displayed in thetables (middle pane) changes. This pane will createas many sub-panes as the number of dimensions forthe visualized indicators. For the presented example,the TourMIS Bednights indicator has two dimensions,namely source and target, so two panes will be cre-ated corresponding to these dimensions (see the ta-ble in Figure 2). The Targets table, keeps the sourcevalue fixed (Germany) and varies the values for theTarget cities, thus displaying the number of Germantourists going to all European destinations. The tablecan be sorted based on the value field, thus allowing toquickly identify the most/least popular destination forGermans - it appears for example that Venice is a verypopular destination for German tourists. Similarly, theSource table keeps the target fixed to Budapest, forexample, but varies the source markets, thus allowingdetecting those tourist groups that go to Budapest themost/the least. World Bank and Eurostat indicators arefrom the economic and sustainability areas, and there-fore have a single dimension, that of the country/cityof interest. In this case (as shown in the left side of Fig-ure 4) a single table, called Targets, is created. The Tar-gets table only contains data about the main marketsfor the indicator of interest.

A click on the pane name will trigger a change inthe Geo Map (right pane of the interface), which dis-plays the tabular data visually. The data for a particu-lar market is summed up (from months to yearly data),and a visual representation of the connection betweenmarkets and destinations (arrows) is created (bigger ar-rows mean more tourists in the selected interval). Themap from Figure 2, shows various destinations thatwere top choices for German tourists. For the Euro-

stat data (Air Transport indicator), the right side ofFigure 4 (choropleth map), displays the markets usingcolor coding (darker shades correspond to higher val-ues), and the tooltips contain totals and averages of theselected indicator for the currently hovered country.

Interaction. Since from the previous analysis Bu-dapest does not necessarily stand out as a populartourist destination for Germans (which is normal giventhe fact that it is not compared with anything), newtopics can be added that contain Bednights of Germantourists to other locations (Dublin, Venice, etc). Thesenew topics can be added through the topic definitioninterface as explained before.

The previous steps allow exploring the behavior ofGerman tourists in terms of their visitor volume to Bu-dapest and also to other European cities. To understandwhether this behavior correlates with the economicsituation in Germany, we can continue by selectingan economic indicator as a new topic. A good eco-nomic indicator is GDP Growth from World Bank (dis-played as a brown line in the Figure 2). Figure 2 super-imposes German GDP (from World Bank) as wellas Bednights occupied by German tourists in Dublin,Venice and Budapest, as these indicators have been se-lected for visualization in the General pane (the coloron the right side of a topic corresponds to the graphcolor on the chart - e.g., light blue for German Bed-nights to Budapest). The values displayed in the chartare normalized so that it is easy to compare them.

The resulting chart shows that there is a certain sea-sonality of the German visits in Budapest. The peakfor each year is October (Are Germans escaping fromOktoberfest?). By inspecting German arrivals to sev-eral locations such as Prague, Dublin, Venice and Bu-dapest, it appears that German tourists seem to be in-fluenced more by the seasonality of the business year(more visits during summer) than the crisis, as the pat-terns seem consistent from the end of 2008 to the endof 2014 and unaffected therefore by the slight GDP


Fig. 4. Tabular and geographic map views of Eurostat data

drop from 2009. Adding more destinations (Copen-hagen, Dubrovnik, Venice) confirms our hypothesis ofGerman tourist behavior being influenced by seasonal-ity, as opposed to GDP fluctuation.

These interconnected tables and charts correspondto Heer and Shneiderman’s view manipulation logic[21]. We can select items from the tables and triggernew searches using them as parameters, or select var-ious observations from the line chart and display ad-ditional information in tooltips. The geographic mapallows users to see summaries of the various destina-tions visited by tourists from a certain country. Userscan organize their workspace as they please, and areable to coordinate the views to explore the data in ameaningful way. Since everything happens on a singlescreen, navigation is reduced to several clicks in thevarious views.

Sharing. Pushing the Export button, opens a sidemenu that allows selecting from two groups of op-tions: Chart Data (XLS or CSV formats) and Di-agrams (Line Chart, Geographic Map). They allowusers to share their work, create guides for their usersor clients. The commercial implementation also allowsto record and analyze the Search History (instead ofthe Current Search available in this version). These op-tions represent our version of Heer and Shneiderman’s[21] process and provenance functionality, and are oneof the most popular features.

6.3. Scalability

It generally takes less than a minute to load two tofive typical datasets via the API, and this results in anindex with the size of around a hundred MB. An index

of 1000 random datasets with data from Eurostat andthe World Bank has 27,2 GB and 242 million docu-ments (= data points). We do not index data about com-posite geographical entities, regardless of them beingreal (like European Union) or made up for that specificindicator (like DE+FR or EU-25 or Vienna+Salzburg),only data that contains real geographical coordinatesor corresponds to actual geopolitical entities (coun-tries, regions, cities) or points of interests (e.g., histor-ical sites, parks, museums). Covering the full datasetswould likely result in sizes that are up to 1.5 timesbigger. Elasticsearch has no latency issues even whenhosting indexes several times this size. The initial loadof the portal takes about 1.5 to 5 seconds. Each subse-quent query and the visualization of results is typicallyperformed in less than a second, regardless of the in-dex size. Future versions will also support the displayof composite geographical entities.

Some datasets contain millions of points, as alreadymentioned, but those millions of points contain all pos-sible slices. When visualizing datasets with the ETIHQdashboard, we already visualize slices instead of en-tire datasets (as explained in Section 6.2), therefore re-stricting the data size to only several hundreds pointsat most. This is the main reason why the AdvancedSearch dialog is necessary before being able to visu-alize new slices. A slice for the GDP of Austria forthe last 10 years only has 10 points at most (if all thedata points are available). Even when visualizing sev-eral slices there will be less than 120 points. If someof these datasets do have monthly data (e.g., TourMIS)there will be at most 120 points for the last ten yearsfor a single slice and less than a 1200 when visualiz-


ing 10 such slices (the maximum number of slices thatcan be visualized with this tool). However, for sourceswith monthly or daily data, the more the time intervalis increased, the chances to end up with a visualiza-tion where data is already aggregated (daily data dis-played as monthly totals, and monthly data displayedas yearly totals) also grow.

7. Dashboard Evaluation

The tourism dashboard prototype showcases cross-domain data analytics functions that are feasible overdatasets integrated through Linked Data. An exploratoryevaluation of this prototype has been conducted andreported in [43] with the focus of understanding theusefulness of the tool for tourism practitioners. In thissection we provide a summary of that evaluation andrefer the interested readers to details in [43].

7.1. Evaluation Design

Evaluation goals were (a) identifying potential newfunctionalities that the tool could enable; (b) assess-ing the current usability of the prototype and deriv-ing ideas for future usability-level improvements; and(c) obtaining indicative performance values when us-ing this prototype as opposed to current practices forsolving cross-domain data analytics.

Participants. Participants to the evaluation wereselected from the two major stakeholder groups thatcould benefit from a cross-domain data analytics in-frastructure: researchers in the tourism domain as wellas tourism practitioners, working primarily for Desti-nation Management Offices (DMO’s). The 16 partici-pants were divided randomly in two groups (Group_Aand Group_B), both containing an equal number andmixture of researchers and practitioners, i.e., five prac-titioners and three researchers per group.

Evaluation Setup. Prior to the evaluation itself,each participant received a tutorial that explained thefeatures of the tool, and included practical exercises toensure a basic familiarity with the tool (e.g., how tocreate and define indicators, how to visualize and com-pare their values). Evaluations were performed at thedesk of each participant and at a time that best fitted theparticipant’s schedule. This allowed to maintain a real-istic work environment and to avoid bias potentially in-troduced by requesting the use of a new work environ-ment, such as in lab-based evaluation settings. Addi-tionally, such a setup was the only option that allowed

involving DMO employees from across Europe (thisdesign setup was suitable for an exploratory study, butfuture baseline comparisons will consider a more con-trolled lab-based setup). The evaluation included thefour following activities:

– Activity 1. Participants performed three tasks us-ing the Dashboard and recorded the time takento perform each task and the results they havereached. See Table 2 for an overview of the tasksas well as their assignment to the two groups.Group_A performed tasks T1-T3 with the Dash-board, while Group_B focused on tasks T4-T6.

– Activity 2. To gather insights about potential newuses of the tool, participants were asked to cre-ate and perform two tasks of their own using theDashboard. They noted the tasks they preformedand the insights they gained.

– Activity 3. To collect information about how theevaluated tasks would be typically performed instate of the art settings, participants performedthree tasks without using the Dashboard. Partici-pants were allowed to adopt their usual data col-lection and analytics approach. Participants notedtheir findings, the time taken to perform each taskas well as the approach and tools they made useof. In this activity, the role of the groups wasinverted, with Group_A performing tasks T4-T6without the Dashboard, while Group_B focusedon tasks T1-T3.

– Activity 4. To get an insight into the usabilityof the tool, participants answered the ten ques-tions that make up the System Usability Scale(SUS), the most used questionnaire for measuringperceptions of system usability [5]. Additionally,they provided feedback about the most and leastuseful features of the tool, as well as their recom-mendations for future extensions of the tool.

Evaluation Tasks. Six tasks were proposed to theparticipants (Table 2). To measure improvements interms of time savings and quality of the answers, eachtask was performed either with the dashboard, or usingtraditional desktop spreadsheet applications. Averagesfor the time spent on each task as well as the accu-racy of the provided answers were measured and com-pared. Participants were instructed to abandon a task ifthey failed to complete it within 15 minutes. This fa-cilitated time management and followed similar rulesreported in the literature [37], considering that in prac-tice longer tasks would often be abandoned.


Table 2Assignment of tasks to the two evaluator groups (GrA, GrB);Y = dashboard use; N = other means, e.g. spreadsheet applications

Task Description GrA GrB

1 In which month was the number of Y Nbed nights of tourists from the USAto Germany the highest?

2 Explore possible similarities between Y Nthe American economy (as indicatedby GDP Growth) and the bed nightsspent by American tourists in Germany.

3 Continuing your exploration from Y Ntask 2, compare how Germany andAustria have been affected (in termsof bed nights of American tourists) bythe economic situation in the USA.

4 In which month was the arrival of N YJapanese tourists to Vienna the lowest?

5 Could the GDP of Japan have had N Yany influence on the number of arrivalsof Japanese tourists in Vienna?

6 Continuing your exploration from N Ytask 5, compare how Budapest andVienna have been affected (in terms ofarrivals of Japanese tourists) by theeconomic situation in Japan.

The tasks were formulated to cover two explorationscenarios. Task 1 to 3 investigate the influence of theAmerican economy on bed nights of American touriststo different European countries. Similarly, tasks 4-6cover an exploration of how Japanese economic indi-cators might influence arrivals of Japanese tourists toEuropean cities. Both scenarios were investigated inthe same time period (January 2005 - December 2014).

7.2. Evaluation Results

Evaluation results have shown that the ETIHQDashboard provides considerable improvements overmanual approaches to answering a range of complexquestions. By comparing the outcomes of Activity 1and 3, an average time improvement of 29.76% wasobtained (average task execution times without thedashboard were 5.74 minutes and with the dashboard3.93), while answer quality, in terms of precision, wasclearly inferior when using manual approaches (under71%) as opposed to using ETIHQ (over 63% and up to100%). Based on Activity 2 it became clear that man-

ual approaches to cross-domain analytics are currentlythe norm and participants provided ideas for new tasksto be supported with the Dashboard. Details on allthese results are available in [43].

A third important aspect of the evaluation was thefocus on the usability of the tool and the collection offuture features, based on the results from Activity 4.We report these findings and extend them beyond [43].

Based on the responses to the SUS questionnaire(first part of Activity 4), an overall SUS value of 64was computed, which on the adjective rating scale [1]is satisfactory (OK). While this results indicate thatimprovements are needed in terms of system design,learnability and usability [6], it is an encouraging re-sult for a prototype, if we also factor in that the free textcomments (see below) were really positive, regardlessof the participants being from the academy or indus-try. The fact that almost all of the participants chose vi-sualizations and data integration as the top features ofthe ETIHQ dashboard (see Table 3) is a good indica-tor that they understood the purpose of dashboard andconsider it useful.

The second part of Activity 4 was dedicated to cur-rent and future features that users consider useful, col-lecting the users’ answers as free form text.

Most Useful Tool Features. As expected, the abil-ity to easily visualize slices from multiple data sourcesin the same chart was considered the most useful fea-ture (see Table 3). Normalization was also appreci-ated as a good idea, since all users wanted to be ableto easily observe correlations. The ability to previewthe variables or slices before saving them was consid-ered the best feature for exploratory research. The factthat the system automatically creates multiple linkedgraphics like the best statistical tools currently avail-able was also noticed by most of the users. Overall, theusers appreciated the ease of use of the system and theadvanced customization features. The system’s exportfunctionality was considered a nice bonus, and someusers (especially researchers) use it as a first step intheir data collection strategy for future research arti-cles or projects. We have not however received suchan article until the date of the current submission. Themajority of users appreciated the Advanced Search andmenu-based selection mechanisms.

Open Challenges. Search for indicators was con-sidered problematic - while the ability to create cus-tomized slices is much appreciated (see Table 3), userswould like to perform indicator searches directly, with-out the Advanced Search dialog. Tables were men-tioned as well, but contrary to what some of the survey


Table 3Aggregated user feedback on specific dashboard features

Feature Count Positive Comments

Charts 14 Clear information visualizationData Integration 14 Seamless integration of data sources

Comparison (Line Chart) 12 Benchmarking with several crucial indicatorsSlice Creation 5 Option to create and customize variables

Automatic Plots 4 Automated rendering and scaling of the plots

Feature Count Negative Comments

Temporal Controls 4 Additional options to select time intervalsSearch for Indicators 4 Inability to search without defining the source and the target

Country Totals 3 The computation of country totals should be managed automaticallyMissing Data 3 Important countries do not deliver data

Tables 3 Cannot use the table to identify the most relevant data

participants have indicated, they do allow users to sortthe data and identify the most important elements. Per-haps without visual highlighting via colored columnsor bold typeface, this is not obvious to first time users.Table 3 might lead to the assumption that data is miss-ing and datasets were not fully indexed, but it is a prob-lem that stems from the ingested datasets themselves.Future work could address this feedback by additionalvalidation steps, automatically identifying missing val-ues in other sources, or integrating LD Quality Assess-ment models [57].

Technology Roadmap. Users recommended to usedomain-specific terminology - e.g., market and des-tination for common tourism indicators, instead ofgeneric terms such as indicator and source. Somechanges to the data aggregation features were sug-gested to simplify the workflow: aggregated annualdata; fewer data items in the top targets or top sourcestables; adding controls to change data granularity inthe trend charts; improve time selection mechanisms.

Participants also suggested to extend the system byincluding other data sources: news media, stock ex-changes, flight connections, GDP indicators, exchangerates, and events (the latter not being a statisticaldataset). Another requested upgrade were additionalanalytic functions (e.g., calculation of explicit corre-lations; regression analysis; more data summaries) tocomplement the current visual comparison, and to re-duce the need to export data to other tools for detailedmathematical analysis. Other users suggested minoruser interface changes such as different fonts or geo-graphic base layers, and an interactive tutorial built di-rectly into the tool.

8. Integrated Analytics for Structured andUnstructured Data

The ASAP FP7 Research Project23 develops a dy-namic execution framework for scalable data analyticsbased on structured and unstructured data from a rangeof sources, open as well as proprietary. This includesautomatically annotated news and social media con-tent, statistical indicators from linked data sources, andanonymized Call Data Records (CDRs). The concep-tual development of the ASAP dashboard to analyzedata across these sources benefited from the evaluationresults summarized in the preceding section. The dash-board is implemented as part of the telecommunica-tions use case (Section 5.2), paying particular attentionto (i) on-the-fly data composition and visualization, (ii)the display of temporal data and interactive controls toquickly select desired time intervals, and (iii) the pro-vision of open APIs for uploading datasets and annota-tions, querying the knowledge repository, and embed-ding visualizations into third-party applications.

Temporal Controls. Addressing user feedbackgathered during the evaluation, we have added severalnew alternatives for temporal slicing. Perhaps most im-portant for SLD is the Granular Overview Overlay de-scribed below, as it allows users to understand the tem-poral distribution of large datasets. It is complementedby a classic timeline and an interactive date range se-lector, which consists of a time-span bar in conjunctionwith a sliding window to quickly get a visual overviewand select the desired time frame.

23www.asap-fp7.eu


Fig. 5. Prototype of the ASAP dashboard for the integrated analysis of unstructured Web content and structured telecommunications data

Geographic Map. A revised module to render ge-ographic projections in different resolutions benefitsfrom the adaptive context menus shown in Figure 5.The new implementation supports the creation of cus-tomized maps and multiple base layers including a richset of styling and formatting options for data layers,points of interests, and labels.

Adaptive Tooltips. The statistical visualizationcomponents were updated to reflect the design princi-ples of the webLyzard Web intelligence platform[46,47]. Adaptive tooltips and context menus enrich thefunctionality of the statistical components and ensurea unified user experience. Based on the current contextof the analyst, for example in the form of a countryshape or point of interest, the tooltip of the geographicmap displays filtered information and a context menuto either drill down or extend the search. If the tooltipcorresponds to a country, it will display basic infor-mation about that country (see Figure 5). Highlightingfunctions include visual cues to show specific groupsof data points when users hover over related compo-nents in the ASAP dashboard, or rule-based color cod-ing of these data points. The adaptive tooltips take into

account the size of the container and display feweritems if a component is minimized.

Granular Overview Overlay. The temporal distri-bution of large datasets is shown using a new version ofthe GROOVE visualization introduced by Lammarschet al. [31] and seen under the line chart in Figure 5.The current prototype comes with integrated focus andcontext based on granularities and recursive pattern ar-rangement. It is particularly useful for visualizing dailydata from third parties, and for identifying peaks orchanges in behavior triggered by various events. Eachdata element is presented by a square that might beonly the size of one pixel, but automatically expandsif more space is available. The pixels (or squares) arearranged according to the days of a year, similar to acalendar. The example from Figure 5 shows six calen-dar months (each row corresponds to a quarter), posi-tioning the days of each month (each day with a whiteborder) inside a larger rectangle (the larger areas with-out border). Color-coding schemes are used for dailyoccurrences, sentiment, peak values. Monthly averagesare mapped to the colors of the rectangles that sur-round the days and represent the months.


Data and Visualization Services. The webLyzardAPI Specification24 bundles together several interfacesto create a uniform framework for the rapid integra-tion of multiple data sources into a scalable visualiza-tion processing pipeline following a Visualization asa Service (VaaS) approach. The goal is to provide aunified interface through which to expose all data andvisualization services. The Document API ingests un-structured text data, for example crawled Web pagesor digital content from document management sys-tems. The main objects are Documents, Sentences andAnnotations. This API can be used for sharing docu-ments regardless of their provenance, as well as anno-tations from knowledge extraction services includingsentiment analysis [51] and named entity recognition[52]. The Statistical Data API ingests structured datafollowing the RDF Data Cube philosophy. This APIsupports the full workflow presented in Figure 1. TheSearch API returns a set of query results in the formof unstructured text documents or time series data. TheEmbeddable Visualization API provides a mechanismto integrate visualizations into third-party applications,typically based on the results of a search query.

9. Conclusion and Future Work

Visual tools for representing statistical linked data(LD) can support policy experts and decision makersin a wide range of domains such as telecommunica-tions, travel and tourism, financial markets, health careservices, or sustainable development. Facilitated by theadoption of LD technologies, applications that seam-lessly integrate and visualize statistical LD from mul-tiple sources have started to appear. The large-scale in-tegration of statistical LD technologies at semantic andsyntactic level is still at an early stage and calls forimproved methods to align, link and visually exploredatasets from heterogeneous sources.

This article describes innovations from two Euro-pean research projects that advance the state of the artin terms of: (1) workflow and design principles to de-velop statistical LD visualizations for heterogeneousdata from multiple sources, (2) use cases and sce-narios for visualizing statistical data, (3) visual DSSsthat support these scenarios across application do-mains, and (4) multiple coordinated view technologyfor LDPs that embeds data analysis and visualizationprocesses in a flexible and reusable framework.

24www.weblyzard.com/api

The presented workflows and principles can also beapplied to other types of data: scientific data (often sta-tistical in nature, but not always published using QBstandards), personal data (e.g., curricula vitae or FOAFprofiles), or historical data (e.g. comparing trends indifferent historical periods). At this stage, individualresearch communities develop their own guidelines,which eventually should be merged into a general setof principles for visualizing any kind of LD content.

For the use cases, we acquired statistical indicatorsfrom international organizations. One of the main ben-efits of using such LD sources is access to a largenumber of indicators (the Eurostat endpoint alone pro-vides 6538 datasets) covering a wide range of top-ics (telecommunications, economy, ecology, travel andtourism, health, etc.). Publishers typically provide in-dicators in standard formats such as RDF Data Cube(QB) and Statistical Data and Metadata eXchange(SDMX). The datasets are interlinked, even thoughtheir alignment might represent a challenge due toheterogeneous property labels or deviations from theW3C recommendations.

QB datasets and related formats support structuredqueries that consider complex hierarchies exposed ascode lists (e.g., geographical locations), which is notpossible when using open data provided in Comma-Separated Values (CSV) format. A side-effect of us-ing RDF Data Cube was the creation of a QB-inspiredJSON data format for including data into the we-bLyzard visualization engine that can be reused forany type of statistical dataset. The required fields fromthis format are those that are generally used to de-scribe datasets and observations with the QB vocabu-lary (dataset, observation URI, observation value, date,etc.), while optional fields can accommodate dataset-specific information such as geographic location or theunit of measurement.

The validity of the underlying datasets is of partic-ular importance and a fruitful avenue for future re-search. This article is based on datasets published bytrusted third parties, but large-scale LD analytics so-lutions require verification techniques and certificationprocedures to address the open nature of LD and en-sure data quality including assessments of veracity andconformity with security standards. Data compositionand visualization services will enable users to createnew indicators on-the-fly, compare them with simi-lar values from other indexes [17], and integrate theminto complex analytical models - to automatically ver-ify knowledge [49], for example, or to identify rumorsspread via social media [3].


Future work will add decision support metrics ex-tracted from news and social media coverage to theseanalytic models, resulting in novel solutions for theintegrated management and analysis of statistical LD.This will not only unlock significant commercial op-portunities, but also enable us to better understand (andact upon) systemic issues on a society-wide scale, forexample when addressing environmental problems orpursuing sustainable development goals [45].

Acknowledgements

The research presented in this article has receivedfunding from the European Union’s 7th FrameworkProgramme for Research, Technology Developmentand Demonstration as part of the ASAP25 and ETIHQ26

(PlanetData) projects under the Grant Agreements No.619706 and 257641, respectively. The authors wouldlike to thank K. Wöber and I. Önder for their tourismdomain expertise, and R. Kamolov, W. Rafelsbergerand T. Lammarsch for developing some of the pre-sented visualizations.

References

[1] A. Bangor, P. Kortum, and J. Miller. Determining what indi-vidual sus scores mean: Adding an adjective rating scale. J.Usability Studies, 4(3):114–123, May 2009.

[2] E. Blomqvist. The Use of Semantic Web Technologies for De-cision Support - A Survey. Semantic Web, 5(3):177–201, 2014.10.3233/SW-2012-0084.

[3] K. Bontcheva, M. Liakata, A. Scharl, and R. Procter. 1st In-ternational Workshop on Rumors and Deception in Social Me-dia: Detection, Tracking, and Visualization (RDSM-2015). InA. Gangemi, S. Leonardi, and A. Panconesi, editors, Proceed-ings of the 24th International Conference on World Wide WebCompanion, WWW 2015, Florence, Italy, May 18-22, 2015 -Companion Volume, pages 945–946. ACM, 2015.

[4] M. Bostock, V. Ogievetsky, and J. Heer. D3 Data-Driven Documents. IEEE Transactions on Visualiza-tion and Computer Graphics, 17(12):2301–2309, 2011.10.1109/TVCG.2011.185.

[5] J. Brooke. SUS-A Quick and Dirty Usability Scale. UsabilityEvaluation in Industry, 189(194):4–6, 1996. 10.1.1.232.5526.

[6] J. Brooke. SUS: A Retrospective. Journal of Usability Studies,8(2):29–40, 2013.

[7] J. M. Brunetti, S. Auer, R. García, J. Klímek, and M. Necaský.Formal Linked Data Visualization Model. In E. R. Weippl,M. Indrawan-Santiago, M. Steinbauer, G. Kotsis, and I. Khalil,editors, Proceedings of the 15th International Conference on

25www.asap-fp7.eu26www.etihq.eu

Information Integration and Web-based Applications & Ser-vices, IIWAS ’13, Vienna, Austria, December 2-4, 2013, pages309–318. ACM, 2013. 10.1145/2539150.2539162.

[8] S. Capadisli, S. Auer, and A.-C. Ngonga Ngomo. LinkedSDMX Data. Semantic Web - Interoperability, Usability, Ap-plicability, 6(2):105–112, 2015. 10.3233/SW-130123.

[9] S. Capadisli, S. Auer, and R. Riedl. Linked Statistical DataAnalysis. In S. Capadisli, F. Cotton, R. Cyganiak, A. Haller,A. Hamilton, and R. Troncy, editors, Proceedings of the 1stInternational Workshop on Semantic Statistics co-located with13th International Semantic Web Conference (ISWC 2013).Sydney, Australia. October 11th, 2013., volume 1549 of CEURWorkshop Proceedings. CEUR-WS.org, 2013.

[10] E. H. Chi. A Taxonomy of Visualization Techniques Using theData State Reference Model. In IEEE Symposium on Infor-mation Visualization, pages 69–75. IEEE, 2000. 10.1109/IN-FVIS.2000.885092.

[11] A. Dadzie and M. Rowe. Approaches to VisualisingLinked Data: A Survey. Semantic Web, 2(2):89–124, 2011.10.3233/SW-2011-0037.

[12] E. Daga, M. d’Aquin, A. Gangemi, and E. Motta. Early Anal-ysis and Debugging of Linked Open Data Cubes. In S. Ca-padisli, F. Cotton, A. Haller, A. Hamilton, M. Scannapieco,and R. Troncy, editors, Proceedings of the 2nd InternationalWorkshop on Semantic Statistics co-located with 14th Interna-tional Semantic Web Conference (ISWC 2014). Riva del Garda- Trentino, Italy. October 19th, 2014., volume 1550 of CEURWorkshop Proceedings. CEUR-WS.org, 2014.

[13] B. Do, T. Trinh, P. Wetz, A. Anjomshoaa, E. Kiesling, andA. M. Tjoa. Widget-based Exploration of Linked StatisticalData Spaces. In M. Helfert, A. Holzinger, O. Belo, and C. Fran-calanci, editors, DATA 2014 - Proceedings of 3rd InternationalConference on Data Management Technologies and Applica-tions, Vienna, Austria, 29-31 August, 2014, pages 282–290.SciTePress, 2014. 10.5220/0005110102820290.

[14] M. Dudás, O. Zamazal, and V. Svátek. Roadmapping andNavigating in the Ontology Visualization Landscape. InK. Janowicz, S. Schlobach, P. Lambrix, and E. Hyvönen, ed-itors, Knowledge Engineering and Knowledge Management.19th International Conference, EKAW 2014, Linköping, Swe-den, November 24-28, 2014. Proceedings., volume 8876 ofLecture Notes in Computer Science, pages 137–152. Springer,2014. 10.1007/978-3-319-13704-9.

[15] I. Ermilov, M. Martin, J. Lehmann, and S. Auer. LinkedOpen Data Statistics: Collection and Exploitation. In P. Kli-nov and D. Mouromtsev, editors, Knowledge Engineering andthe Semantic Web. 4th International Conference, KESW 2013,St. Petersburg, Russia, October 7-9, 2013. Proceedings, vol-ume 394 of Communications in Computer and InformationScience, pages 242–249. Springer Berlin Heidelberg, 2013.10.1007/978-3-642-41360-5_19.

[16] P. Fox and J. Hendler. Changing the Equation on Scien-tific Data Visualization. Science, 331(6018):705–708, 2011.10.1126/science.1197654.

[17] J. E. L. Gayo, H. Farhan, J. C. Fernández, and J. M. Á. Ro-dráguez. Representing Verifiable Statistical Index Computa-tions as Linked Data. In S. Capadisli, F. Cotton, A. Haller,A. Hamilton, M. Scannapieco, and R. Troncy, editors, Proceed-ings of the 2nd International Workshop on Semantic Statisticsco-located with 14th International Semantic Web Conference(ISWC 2014). Riva del Garda - Trentino, Italy. October 19th,


2014., volume 1550 of CEUR Workshop Proceedings. CEUR-WS.org, 2014.

[18] V. Geroimenko and C. Chen, editors. Visualizing the Seman-tic Web. XML-Based Internet and Information Visualization.Springer London, 2002. 10.1007/1-84628-290-X.

[19] M. E. Gutiérrez, N. Mihindukulasooriya, and R. García-Castro.LDP4j: A Framework for the Development of InteroperableRead-Write Linked Data Applications. In R. Verborgh andE. Mannens, editors, Proceedings of the ISWC DevelopersWorkshop 2014, co-located with the 13th International Seman-tic Web Conference (ISWC 2014). Riva del Garda, Italy, Octo-ber 19, 2014., volume 1268 of CEUR Workshop Proceedings,pages 61–66. CEUR-WS.org, 2014.

[20] J. Heer, M. Bostock, and V. Ogievetsky. A Tour Through theVisualization Zoo. Commununications of the ACM, 53(6):59–67, 2010. 10.1145/1755884.1780401.

[21] J. Heer and B. Shneiderman. Interactive Dynamics for Vi-sual Analysis. Communications of the ACM, 55:45–54, 2012.10.1145/2133416.2146416.

[22] J. Helmich, J. Klímek, and M. Necaský. Visualizing RDFData Cubes Using the Linked Data Visualization Model. InV. Presutti, E. Blomqvist, R. Troncy, H. Sack, I. Papadakis,and A. Tordai, editors, he Semantic Web: ESWC 2014 Satel-lite Events. ESWC 2014 Satellite Events, Anissaras, Crete,Greece, May 25-29, 2014, Revised Selected Papers, volume8798 of Lecture Notes in Computer Science, pages 368–373.Springer International Publishing, 2014. 10.1007/978-3-319-11955-7_50.

[23] D. Hienert, B. Zapilko, P. Schaer, and B. Mathiak. Web-BasedMulti-View Visualizations for Aggregated Statistics. CoRR,abs/1110.3126, 2011. arxiv.org/abs/1110.3126.

[24] P. Höfler, M. Granitzer, E. E. Veas, and C. Seifert. LinkedData Query Wizard: A Novel Interface for Accessing SPARQLEndpoints. In C. Bizer, T. Heath, S. Auer, and T. Berners-Lee,editors, Proceedings of the Workshop on Linked Data on theWeb co-located with the 23rd International World Wide WebConference (WWW 2014), Seoul, Korea, April 8, 2014., volume1184 of CEUR Workshop Proceedings. CEUR-WS.org, 2014.

[25] M. Jern. Collaborative Web-Enabled GeoAnalytics Appliedto OECD Regional Data. In Y. Luo, editor, Cooperative De-sign, Visualization, and Engineering. 6th International Confer-ence, CDVE 2009, Luxembourg, Luxembourg, September 20-23, 2009. Proceedings, volume 5738 of Lecture Notes in Com-puter Science, pages 32–43. Springer, 2009. 10.1007/978-3-642-04265-2_5.

[26] E. Kalampokis, A. Karamanou, A. Nikolov, P. Haase, R. Cy-ganiak, B. Roberts, P. Hermans, E. Tambouris, and K. Taraba-nis. Creating and Utilizing Linked Open Statistical Data forthe Development of Advanced Analytics Services . In S. Ca-padisli, F. Cotton, A. Haller, A. Hamilton, M. Scannapieco,and R. Troncy, editors, Proceedings of the 2nd InternationalWorkshop on Semantic Statistics co-located with 14th Interna-tional Semantic Web Conference (ISWC 2014). Riva del Garda- Trentino, Italy. October 19th, 2014., volume 1550 of CEURWorkshop Proceedings. CEUR-WS.org, 2014.

[27] A. Kalyanpur, B. Boguraev, S. Patwardhan, J. W. Murdock,A. Lally, C. Welty, J. M. Prager, B. Coppola, A. Fokoue-Nkoutche, L. Zhang, Y. Pan, and Z. Qiu. Structured Data andInference in DeepQA. IBM Journal of Research and Develop-ment, 56(3):10, 2012. 10.1147/JRD.2012.2188737.

[28] B. Kämpgen and A. Harth. OLAP4LD - A Framework forBuilding Analysis Applications Over Governmental Statistics.In V. Presutti, E. Blomqvist, R. Troncy, H. Sack, I. Papadakis,and A. Tordai, editors, The Semantic Web: ESWC 2014 Satel-lite Events. ESWC 2014 Satellite Events, Anissaras, Crete,Greece, May 25-29, 2014, Revised Selected Papers, volume8798 of Lecture Notes in Computer Science, pages 389–394.Springer International Publishing, 2014. 10.1007/978-3-319-11955-7_54.

[29] S. Kandel, A. Paepcke, J. Hellerstein, and J. Heer. Wran-gler: Interactive Visual Specification of Data TransformationScripts. In D. S. Tan, S. Amershi, B. Begole, W. A. Kellogg,and M. Tungare, editors, Proceedings of the International Con-ference on Human Factors in Computing Systems, CHI 2011,Vancouver, BC, Canada, May 7-12, 2011., pages 3363–3372.ACM, 2011. 10.1145/1978942.1979444.

[30] A. Katifori, C. Halatsis, G. Lepouras, C. Vassilakis, andE. G. Giannopoulou. Ontology Visualization Methods- A Survey. ACM Computing Surveys, 39(4), 2007.10.1145/1287620.1287621.

[31] T. Lammarsch, W. Aigner, A. Bertone, J. Gartner, E. Mayr,S. Miksch, and M. Smuc. Hierarchical temporal patternsand interactive aggregated views for pixel-based visualiza-tions. In 13th International Conference on InformationVisualisation, IEEE VIS 2009, pages 44–50. IEEE, 2009.10.1109/IV.2009.52.

[32] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas,P. N. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer,and C. Bizer. DBpedia - A large-scale, multilingual knowledgebase extracted from Wikipedia. Semantic Web, 6(2):167–195,2015. 10.3233/SW-140134.

[33] S. Lohmann, S. Negru, F. Haag, and T. Ertl. VOWL 2:User-Oriented Visualization of Ontologies. In K. Janowicz,S. Schlobach, P. Lambrix, and E. Hyvönen, editors, KnowledgeEngineering and Knowledge Management. 19th InternationalConference, EKAW 2014, Linköping, Sweden, November 24-28, 2014. Proceedings., volume 8876 of Lecture Notes in Com-puter Science, pages 266–281. Springer International Publish-ing, 2014. 10.1007/978-3-319-13704-9_21.

[34] E. N. Lorenz. Deterministic Nonperiodic Flow. Journal of theatmospheric sciences, 20(2):130–141, 1963. 10.1175/1520-0469(1963)020<0130:DNF>2.0.CO;2.

[35] N. Marie and F. L. Gandon. Survey of Linked Data BasedExploration Systems. In D. Thakker, D. Schwabe, K. Kozaki,R. Garcia, C. Dijkshoorn, and R. Mizoguchi, editors, Proceed-ings of the 3rd International Workshop on Intelligent Explo-ration of Semantic Data (IESD 2014) co-located with the 13thInternational Semantic Web Conference (ISWC 2014). Riva delGarda, Italy, October 20, 2014., volume 1279 of CEUR Work-shop Proceedings. CEUR-WS.org, 2014.

[36] B. Mutlu, P. Höfler, V. Sabol, G. Tschinkel, and M. Granitzer.Automated Visualization Support for Linked Research Data.In S. Lohmann, editor, Proceedings of the Posters and Demon-strations Track, I-SEMANTICS 2013, volume 1026 of CEURWorkshop Proceedings, pages 40–44. CEUR-WS.org, 2013.

[37] F. Osborne, E. Motta, and P. Mulholland. Exploring Schol-arly Data with Rexplore. In H. Alani, L. Kagal, A. Fokoue,P. T. Groth, C. Biemann, J. X. Parreira, L. Aroyo, N. F. Noy,C. Welty, and K. Janowicz, editors, The Semantic Web - ISWC2013 - 12th International Semantic Web Conference, Sydney,NSW, Australia, October 21-25, 2013, Proceedings, Part I, vol-


ume 8218 of Lecture Notes in Computer Science, pages 460–477. Springer Berlin Heidelberg, 2013. 10.1007/978-3-642-41335-3_29.

[38] H. Paulheim and F. Probst. Ontology-Enhanced UserInterfaces: A Survey. International Journal on Se-mantic Web and Information Systems, 6(2):36–59, 2010.10.4018/jswis.2010040103.

[39] J. Polowinski. Towards RVL: A Declarative Language forVisualizing RDFS/OWL Data. In D. Camacho, R. Akerkar,and M. D. Rodríguez-Moreno, editors, Proceedings of the3rd International Conference on Web Intelligence, Mining andSemantics, WIMS 2013, Madrid, Spain, June 12-14, 2013.,page 38. ACM, 2013. 10.1145/2479787.2479825.

[40] V. Sabol, G. Tschinkel, E. E. Veas, P. Höfler, B. Mutlu, andM. Granitzer. Discovery and Visual Analysis of LinkedData for Humans. In P. Mika, T. Tudorache, A. Benrstein,C. Welty, C. A. Knoblock, D. Vrandecic, P. T. Groth, N. F.Noy, K. Janowicz, and C. A. Goble, editors, The Semantic Web- ISWC 2014 - 13th International Semantic Web Conference,Riva del Garda, Italy, October 19-23, 2014. Proceedings, PartI, volume 8796 of Lecture Notes in Computer Science, pages309–324. Springer International Publishing Switzerland, 2014.10.1007/978-3-319-11964-9_20.

[41] M. Sabou, I. Arsal, and A. Brasoveanu. TourMISLOD: ATourism Linked Data Set. Semantic Web - Interoperability, Us-ability, Applicability, 4(3):271–276, 2013. 10.3233/SW-2012-0087.

[42] M. Sabou, A. Brasoveanu, and I. Önder. Linked Data forCross-Domain Decision-making in Tourism. In I. Tussya-diah and A. Inversini, editors, Information and Communica-tion Technologies in Tourism 2015. Proceedings of the Inter-national Conference in Lugano, Switzerland, February 3 - 6,2015, pages 197–210. Springer, 2015. 10.1007/978-3-319-14343-9_15.

[43] M. Sabou, I. Önder, A. M. P. Brasoveanu, and A. Scharl.Towards Cross-domain Data Analytics in Tourism: a LinkedData based Approach. Information Technology andTourism, page Forthcoming (Accepted for publication), 2016.10.1007/s40558-015-0049-5.

[44] P. E. R. Salas, M. Martin, F. M. D. Mota, K. Breitman,S. Auer, and M. A. Casanova. Publishing Statistical Dataon the Web. In Proceedings of 6th International IEEE Con-ference on Semantic Computing, IEEE ICSC 2012, Palermo,Italy, September 19-21, 2012, pages 282–292. IEEE, 2012.10.1109/ICSC.2012.16.

[45] A. Scharl, D. Herring, W. Rafelsberger, A. Hubmann-Haidvogel, R. Kamolov, D. Fischl, M. Föls, and A. We-ichselbraun. Semantic Systems and Visual Tools to Sup-port Environmental Communication. IEEE Systems Jour-nal, page Forthcoming (Accepted 31 July 2015), 2016.10.1109/JSYST.2015.2466439.

[46] A. Scharl, A. Hubmann-Haidvogel, M. Sabou, A. Weichsel-braun, and H.-P. Lang. From Web Intelligence to Knowl-edge Co-Creation - A Platform to Analyze and Support Stake-holder Communication. IEEE Internet Computing, 17(5):21–29, 2013. 10.1109/MIC.2013.59.

[47] A. Scharl, A. Weichselbraun, M. Göbel, W. Rafelsberger, andR. Kamolov. Scalable Knowledge Extraction and Visualizationfor Web Intelligence. In 49th Hawaii International Conferenceon System Sciences (HICSS-2016), pages 3749–3757. IEEE,

2016.[48] B. Shneiderman. The eyes have it: A task by data type tax-

onomy for information visualizations. In Proceedings of theIEEE Symposium on Visual Languages, IEEE VL 1996, pages336–343. IEEE, 1996. 10.1109/MSP.2014.80.

[49] T. Tarasova. News Fact-checking: One Practical Applicationof Linked Statistics. In S. Capadisli, F. Cotton, A. Haller,A. Hamilton, M. Scannapieco, and R. Troncy, editors, Proceed-ings of the 2nd International Workshop on Semantic Statisticsco-located with 14th International Semantic Web Conference(ISWC 2014). Riva del Garda - Trentino, Italy. October 19th,2014., volume 1550 of CEUR Workshop Proceedings. CEUR-WS.org, 2014.

[50] L. D. Vocht, A. Dimou, J. Breuer, M. V. Compernolle, R. Ver-borgh, E. Mannens, P. Mechant, and R. V. de Walle. A Vi-sual Exploration Workflow as Enabler for the Exploitation ofLinked Open Data. In D. Thakker, D. Schwabe, K. Kozaki,R. Garcia, C. Dijkshoorn, and R. Mizoguchi, editors, Proceed-ings of the 3rd International Workshop on Intelligent Explo-ration of Semantic Data (IESD 2014) co-located with the 13thInternational Semantic Web Conference (ISWC 2014). Riva delGarda, Italy, October 20, 2014., volume 1279 of CEUR Work-shop Proceedings. CEUR-WS.org, 2014.

[51] A. Weichselbraun, S. Gindl, and A. Scharl. Enriching Se-mantic Knowledge Bases for Opinion Mining in Big DataApplications. Knowledge-Based Systems, 69:78–86, 2014.10.1016/j.knosys.2014.04.039.

[52] A. Weichselbraun, D. Streiff, and A. Scharl. ConsolidatingHeterogeneous Enterprise Data for Named Entity Linking andWeb Intelligence. International Journal on Artificial Intelli-gence Tools, 24(2):1–31, 2015. 10.1142/S0218213015400084.

[53] C. Welty. Semantic Web and Best Practice in Watson. InS. Coppens, K. Hammar, M. Knuth, M. Neumann, D. Ritze,H. Sack, and M. V. Sande, editors, Proceedings of the Work-shop on Semantic Web Enterprise Adoption and Best PracticeCo-located with 12th International Semantic Web Conference(ISWC 2013) Sydney, Australia, October 22, 2013., volume1106 of CEUR Workshop Proceedings. CEUR-WS.org, 2013.

[54] L. Wilkinson. The Grammar of Graphics (Statistics and Com-puting). Statistics and Computing. Springer-Verlag New York,Secaucus, NJ, USA, 2005. 10.1007/0-387-28695-0.

[55] M. Yoshioka and N. Kando. Issues for linking geographicalopen data of geonames and wikipedia. In H. Takeda, Y. Qu,R. Mizoguchi, and Y. Kitamura, editors, Semantic Technol-ogy, Second Joint International Conference, JIST 2012, Nara,Japan, December 2-4, 2012. Proceedings, volume 7774 ofLecture Notes in Computer Science, pages 375–381. Springer,2013. 10.1007/978-3-642-37996-3_32.

[56] B. Zapilko and B. Mathiak. Object Property Matching Utiliz-ing the Overlap between Imported Ontologies. In V. Presutti,C. d’Amato, F. Gandon, M. d’Aquin, S. Staab, and A. Tordai,editors, The Semantic Web: Trends and Challenges. 11th Inter-national Conference, ESWC 2014, Anissaras, Crete, Greece,May 25-29, 2014. Proceedings, volume 8465 of Lecture Notesin Computer Science, pages 737–751. Springer InternationalPublishing, 2014. 10.1007/978-3-319-07443-6_49.

[57] A. Zaveri, A. Rula, A. Maurino, R. Pietrobon, J. Lehmann, andS. Auer. Quality Assessment for Linked Data: A Survey. Se-mantic Web, 7(1):63–93, 2016. 10.3233/SW-150175.