a parallel approach for improving geo-sparql query...

21
Full Terms & Conditions of access and use can be found at http://www.tandfonline.com/action/journalInformation?journalCode=tjde20 Download by: [University of Connecticut] Date: 28 September 2015, At: 07:56 International Journal of Digital Earth ISSN: 1753-8947 (Print) 1753-8955 (Online) Journal homepage: http://www.tandfonline.com/loi/tjde20 A parallel approach for improving Geo-SPARQL query performance Tian Zhao, Chuanrong Zhang, Luc Anselin, Weidong Li & Ke Chen To cite this article: Tian Zhao, Chuanrong Zhang, Luc Anselin, Weidong Li & Ke Chen (2015) A parallel approach for improving Geo-SPARQL query performance, International Journal of Digital Earth, 8:5, 383-402, DOI: 10.1080/17538947.2014.904012 To link to this article: http://dx.doi.org/10.1080/17538947.2014.904012 Published online: 03 Apr 2014. Submit your article to this journal Article views: 65 View related articles View Crossmark data Citing articles: 2 View citing articles

Upload: ledan

Post on 01-May-2018

216 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A parallel approach for improving Geo-SPARQL query performancegis.geog.uconn.edu/personal/paper1/journal paper/2015... ·  · 2015-09-28query performance Tian Zhao, ... introduces

Full Terms & Conditions of access and use can be found athttp://www.tandfonline.com/action/journalInformation?journalCode=tjde20

Download by: [University of Connecticut] Date: 28 September 2015, At: 07:56

International Journal of Digital Earth

ISSN: 1753-8947 (Print) 1753-8955 (Online) Journal homepage: http://www.tandfonline.com/loi/tjde20

A parallel approach for improving Geo-SPARQLquery performance

Tian Zhao, Chuanrong Zhang, Luc Anselin, Weidong Li & Ke Chen

To cite this article: Tian Zhao, Chuanrong Zhang, Luc Anselin, Weidong Li & Ke Chen (2015)A parallel approach for improving Geo-SPARQL query performance, International Journal ofDigital Earth, 8:5, 383-402, DOI: 10.1080/17538947.2014.904012

To link to this article: http://dx.doi.org/10.1080/17538947.2014.904012

Published online: 03 Apr 2014.

Submit your article to this journal

Article views: 65

View related articles

View Crossmark data

Citing articles: 2 View citing articles

Page 2: A parallel approach for improving Geo-SPARQL query performancegis.geog.uconn.edu/personal/paper1/journal paper/2015... ·  · 2015-09-28query performance Tian Zhao, ... introduces

A parallel approach for improving Geo-SPARQL query performance

Tian Zhaoa, Chuanrong Zhangb*, Luc Anselinc, Weidong Lib and Ke Chena

aDepartment of Computer Science, University of Wisconsin–Milwaukee, Milwaukee, WI, USA;bDepartment of Geography & Center of Environmental Sciences and Engineering, University ofConnecticut, Storrs, CT, USA; cSchool of Geographical Sciences & Urban Planning, Arizona State

University, Tempe, AZ, USA

(Received 24 October 2013; accepted 10 March 2014)

Geospatial Semantic Web promises better retrieval geospatial information for DigitalEarth systems by explicitly representing the semantics of data through ontologies.It also promotes sharing and reuse of geospatial data by encoding it in Semantic Weblanguages, such as RDF, to form geospatial knowledge base. For many applications,rapid retrieval of spatial data from the knowledge base is critical. However, spatial dataretrieval using the standard Semantic Web query language – Geo-SPARQL – can bevery inefficient because the data in the knowledge base are no longer indexed tosupport efficient spatial queries. While recent research has been devoted to improvingquery performance on general knowledge base, it is still challenging to supportefficient query of the spatial data with complex topological relationships. This researchintroduces a query strategy to improve the query performance of geospatial knowledgebase by creating spatial indexing on-the-fly to prune the search space for spatial queriesand by parallelizing the spatial join computations within the queries. We focus onimproving the performance of Geo-SPARQL queries on knowledge bases encoded inRDF. Our initial experiments show that the proposed strategy can greatly reduce theruntime costs of Geo-SPARQL query through on-the-fly spatial indexing and parallelexecution.

Keywords: Geo-SPARQL; parallel geocomputation; geospatial semantic web

1. Introduction

Digital Earth as a mechanism for integrating data from multiple sources has been putforward for more than 10 years (Gore 1998), and significant progress has been made toimplement the Digital Earth systems (Guo 1999; Guo et al. 2009). Semantic interoper-ability is a core research topic for integrating, interlinking, and retrieving vast geo-referenced, multi-perspective geospatial knowledge through the Digital Earth systems.Geospatial Semantic Web offers the support of semantic interoperability to the DigitalEarth systems and extends the Digital Earth vision from a data archive and infrastructureto a knowledge engine, which enables more powerful reasoning and informationretrieving from heterogeneous and contradicting conceptual models and scientific datain Digital Earth systems. Geospatial Semantic Web promises better retrieval geospatialinformation for the Digital Earth systems by explicitly representing the semantics of datathrough ontologies, which can be understood and processed by computers. It also

*Corresponding author. Email: [email protected]

International Journal of Digital Earth, 2015

Vol. 8, No. 5, 383–402, http://dx.doi.org/10.1080/17538947.2014.904012

© 2014 Taylor & Francis

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

7:56

28

Sept

embe

r 20

15

Page 3: A parallel approach for improving Geo-SPARQL query performancegis.geog.uconn.edu/personal/paper1/journal paper/2015... ·  · 2015-09-28query performance Tian Zhao, ... introduces

promotes sharing and reuse of spatial data for a wide variety of applications by usingstandardized Semantic Web languages such as RDF to encode spatial data. However,representing structured geospatial data in these languages can result in inefficient dataaccess. One of the main obstacles that prevent efficient and distributed query ongeospatial knowledge base is the lack of indexing on spatially related data objects. Thisproblem is inherent in the RDF representation of spatial data, which consists of looselyconnected data objects related by object properties. Even if spatial objects are originallystored in related database tables, once they are transformed to RDF objects, the spatialindices are lost. It is possible to recreate indices for RDF objects with spatial attributes.However, pre-computing spatial indices does not guarantee performance improvementsince the RDF queries are much more flexible than database queries and it is difficult topredict which spatial objects should be indexed and how. Thus, it is necessary toimplement suitable extensions of the RDF query engine to take advantage of the creatingspatial indexing on-the-fly.

The Geo-SPARQL protocol was proposed by Open Geospatial Consortium (OGC) asan extension of SPARQL for querying geographic RDF data. Geo-SPARQL queries aredominated by spatial join operations due to the fine-grained nature of RDF data model.Lack of spatial indices causes additional performance problems for Geo-SPARQLqueries. One reason for the poor performance problems is caused by the way that spatialattributes are stored in RDF datasets. Spatial attributes are usually stored as string literalsthat conform to certain formats such as WKT or GML. The Geo-SPARQL query enginethat implements spatial operators and filter functions has to parse these strings to recoverthe spatial coordinates for spatial computation. A naïve implementation of a spatialoperator or a filter function in Geo-SPARQL treats its spatial inputs as plain strings andhas to parse the strings to retrieve spatial contents such as x and y coordinates. Repeatedparsing of the spatial inputs imposes a very large runtime overhead. The second reasonfor the poor performance problems is due to lack of parallelization. Since spatial objectsare not indexed, Geo-SPARQL query engine cannot partition ontology data into subsetsto be processed in parallel. As a result, Geo-SPARQL query can only be processed as asingle-threaded program. Even with pre-computed spatial indices, partitioning spatialontology data is not easy since the targeted data may not be evenly distributed in theindices.

This research introduces a new parallel approach for improving the query perform-ance of geospatial ontology in a Geo-SPARQL query by separating spatial and non-spatial components. In fact, different parallel approaches have been widely used forimproving the query performance for a long time in literature. However, past research onimproving query performance using parallelization has been centered on relationaldatabases (e.g. Boral et al. 1990; DeWitt et al. 1986; Kitsuregawa et al. 1983).Optimizing techniques for parallel relational databases do not specialize on the triplemodel of RDF and triple patterns of SPARQL queries for query engines based on theRDF- and SPARQL-specific properties (Groppe and Groppe 2011). Although there arestudies to query heterogeneous relational databases using SPARQL and parallelalgorithms (e.g. Miao and Wang 2009; Castagna, Seaborne, and Dollin 2009; Karjalainen2009), parallel relational databases have inherent limitations such as scalability. SPARQLquery can be parallelized by treating each triple statement in the query as a parallel taskand the results of all the triple statement sub-queries can be joined together after all theparallel tasks have completed (Groppe and Groppe 2011). Unfortunately, this approachdoes not work efficiently when spatial predicates exist in the triple sta0tements. There are

384 T. Zhao et al.

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

7:56

28

Sept

embe

r 20

15

Page 4: A parallel approach for improving Geo-SPARQL query performancegis.geog.uconn.edu/personal/paper1/journal paper/2015... ·  · 2015-09-28query performance Tian Zhao, ... introduces

also studies to propose methods for efficiently parallelizing joint query of RDF data usingMap-Reduce systems (e.g. Ravindra, Kim, and Anyanwu 2011; Kim, Ravindra, andAnyanwu 2011; Anyanwu 2013). However, to the best of our knowledge, there is nostudy to deal with parallelizing spatial join computations to support efficient spatial RDFquery, which is an important issue for the development of a Geospatial Semantic Web(Zhang, Li, and Zhao 2007; Zhao et al. 2008, 2010b; Zhang, Zhao, and Li 2010a,2010c, 2013;).

In this study, we propose a new approach to optimize and parallelize spatial joins inGeo-SPARQL queries. The novel idea of the proposed parallel approach is to separatespatial and non-spatial components in a Geo-SPARQL query. Instead of pre-computingspatial indices for geospatial ontology and implementing spatial extensions of a queryengine to use the indices, we propose a strategy for a query engine implementation byseparating spatial components from non-spatial components in a Geo-SPARQL query andprocessing spatial sub-query after non-spatial sub-queries have been completed. The mainbenefit of this approach is that a smaller set of ontology objects can be obtained after non-spatial sub-queries so that their spatial attributes can be cached for subsequent spatialcomputation including on-the-fly spatial indexing and spatial joins. Since the parsedspatial attributes are cached, the overhead caused by repeatedly parsing of spatial literalstrings can be avoided. We expect that the results of this research will facilitate the accessto spatial ontology information for multiple users through highly intensive geo-computation processes over a Geospatial Semantic Web, particularly for time-criticalapplications such as disaster response.

2. A framework for improving Geo-SPARQL query performance

The proposed approach focuses on improving the performance of spatial computations inGeo-SPARQL queries. Geo-SPARQL is a geographic query language for RDF that extendsSPARQL with a standard vocabulary for spatial information, query functions for spatialcomputation, and query rewriting rules to expand feature-feature query to geometry query.In our framework, we separate the spatial and non-spatial components of a query. Based onthe spatial component of a query, we construct spatial indices for the spatial objects returnedfrom the non-spatial component of the query. For parallel processing, we partition thespatial objects based on the indices and process spatial operations in parallel for each datapartition. The results of the parallel computation are compiled as the final result ofthe original query.

Figure 1 shows the detailed query procedure using the proposed approach. There arefour major components that play important roles in the query procedure: a parser, a staticanalyzer, a query optimizer, and a spatial query parallelizer. Geo-SPARQL is used as aquery language to search the needed geospatial information from heterogeneous datasources over the Web. The parser converts a Geo-SPARQL query to an abstract syntaxtree form, which is checked by the static analyzer for potential errors. A Geo-SPARQLquery itself usually is just a few lines. We use the SPARQL query engine of an ontologylibrary, Jena, to parse a Geo-SPARQL query to an abstract syntax tree (AST) form. Thestatic analyzer checks the AST for any potential errors programmatically.

The result of the static analyzer is an ordered collection of primitive sub-queriesthat will be processed by the optimizer so that the sub-queries can be processedmore efficiently. The optimizer divides sub-queries into spatial and non-spatial queries.The non-spatial query component will be answered first since they are usually less time

International Journal of Digital Earth 385

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

7:56

28

Sept

embe

r 20

15

Page 5: A parallel approach for improving Geo-SPARQL query performancegis.geog.uconn.edu/personal/paper1/journal paper/2015... ·  · 2015-09-28query performance Tian Zhao, ... introduces

consuming and the non-spatial results will be the inputs for answering the spatialquery component. The spatial queries are potentially computational intensive. Therefore,they will be parallelized based on spatial indices. The spatial queries are parallelized bysplitting them into disjoint parallel query tasks, which are answered independently. In theend, the results of the parallel query tasks are integrated to form the final answer to theoriginal Geo-SPARQL query.

We extend the Jena library’s implementation of SPARQL query engine by processingfiltering operations on spatial objects in parallel fashion. While the strategy ofparallelization by splitting data using spatial index is not new, the idea of adopting thisstrategy to support efficient Geo-SPARQL query is never done before.

The main advantage of this framework is that it improves the runtime performance ofGeo-SPARQL queries by caching the spatial attributes parsed from ontology literals, on-the-fly indexing, and parallel spatial joins. The overall goal is to increase the efficiency ofGeo-SPARQL queries so that applications of a Geospatial Semantic Web can process thespatial queries within a reasonable amount of time.

In the following sections, we introduce the primary technologies applied in theframework. These include Geo-SPARQL and query rewriting and parallelization.

2.1. Geo-SPARQL

In our framework, we use the OGC Geo-SPARQL (OGC11-052r4 2012) protocol forquerying geospatial data on a Geospatial Semantic Web. Geo-SPARQL representsgeospatial data using RDF (Resource Description Framework) representation. It queriesgeospatial data by extending the general SPARQL query language to process geospatialdata (Battle and Kolas 2012). Figure 2 illustrates the major components of Geo-SPARQL.

Figure 1. The query procedure using the proposed optimization strategy.

386 T. Zhao et al.

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

7:56

28

Sept

embe

r 20

15

Page 6: A parallel approach for improving Geo-SPARQL query performancegis.geog.uconn.edu/personal/paper1/journal paper/2015... ·  · 2015-09-28query performance Tian Zhao, ... introduces

The core component defines top-level RDFS/OWL (RDF Schema/ Web OntologyLanguage) classes for spatial objects. The Geometry component describes the geometryvocabulary and non-topological query functions for geometry objects. The Topologicalvocabulary component expresses RDF properties for asserting topological relationsbetween spatial objects. The Geometry topology component identifies topological queryfunctions. The Query rewrite component defines transformation rules for computingspatial relations between spatial objects based on their associated geometries. Finally, theRDFS entailment component introduces a mechanism for matching implicit RDF triplesthat are derived based on RDF and RDFS semantics.

Geo-SPARQL defines a small ontology to represent features and geometries.Specifically, geo:SpatialObject and geo:Feature are the two main classes that representgeospatial features. The single root geometry class called geo:Geometry or the propertiesgeo:hasGeometry and geo:defaultGeometry that associate with geospatial features areused for encoding geometry information. Geo-SPARQL also defines a number oftopological and non-topological query predicates and functions to support geospatial dataqueries. Geo-SPARQL includes a set of terms for topological relations such as geo:equals, geo:disjoint, geo:intersects, geo:touches, geo:crosses, geo:within, geo:contains,geo:overlaps, which allow users to perform geospatial reasoning and formulate queriesbased on topological relations between spatial objects. Geospatial reasoning is critical fora Geospatial Semantic Web application such as disaster response. For example, in disasterresponse when concerned with damages around a given town, for example, Mansfield,this allows questions such as ‘Which residential homes are contained within the damagedarea of Mansfield?’ to be answered efficiently. This query requires a topologicalcomparison between the geometries of the residential homes and the geometries of the

Figure 2. Major components of Geo-SPARQL.

International Journal of Digital Earth 387

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

7:56

28

Sept

embe

r 20

15

Page 7: A parallel approach for improving Geo-SPARQL query performancegis.geog.uconn.edu/personal/paper1/journal paper/2015... ·  · 2015-09-28query performance Tian Zhao, ... introduces

damaged area of Mansfield. The property geo:hasGeometry can be used to connectthe features to their geometries, and the topology function geo:within can be used toevaluate the topological relationships. The following lists some sample codes to carry outthe example query:

Geo-SPARQL also supports non-topological query functions such as geof:distance, geof:buffer, geof:convexHull, geof:intersection, geof:union, geof:difference, geof:symDifference,geof:envelope, and geof:boundary. This allows users to make inference and link multipledatasets together to solve a given problem. For example, in disaster response, consider ascenario where a hurricane struck the Town of Groton. To take immediate rescue actions,the emergency responders need to find evacuation routes. The evacuation routes must notgo through possibly flooded areas. So they need to combine data such as transportationroad data, non-flooded areas, and political boundaries of the town of Groton together toidentify potential evacuation routes. The topology relation geo:touches and the non-topologyfunction geof:union can be used to find all route features (?r) that touch the union ofthe feature non-flooded areas (?flood) and the feature political boundaries of the townGroton (geo:Groton). The following lists some sample codes to implement the associatedquery:

All the ontology classes and functions are derived from OGC standards, which ensureinteroperability. Geo-SPARQL allows data to be properly indexed and queried fromspatial RDF stores. In addition, it is intended to interoperate with both quantitative andqualitative spatial reasoning systems (Battle and Kolas 2012). With a quantitative spatialreasoning system, Geo-SPARQL explicitly calculates distances and topological relationsamong concrete geometries of features. With a qualitative geospatial reasoning system,Geo-SPARQL allows RCC-type topological inferences for features where the geometriesare either unknown or cannot be made concrete (Grütter and Bauer-Messmer 2007). For

388 T. Zhao et al.

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

7:56

28

Sept

embe

r 20

15

Page 8: A parallel approach for improving Geo-SPARQL query performancegis.geog.uconn.edu/personal/paper1/journal paper/2015... ·  · 2015-09-28query performance Tian Zhao, ... introduces

example, if there are assertions that a hospital is inside the town of Groton, and Groton iswithin a flooded area, a qualitative reasoning system should be able to infer throughtransitivity that the hospital is within a flooded area.

2.2. Query rewriting and parallelization

To illustrate our query processing framework, we consider a restricted form of SPARQLquery with the syntax shown in Figure 3.

A query Q consists of a set of selection variables v and a triple pattern P, which isa conjunction of a set of triple statements. A triple statement consists of a subject s, apredicate p, and an object o. The subject is either a variable or a URI, and a predicate isa URI, while an object can be a URI, a variable, or a literal. A URI includes a prefix and ashort name to identify an ontology resource. In this framework, we only consider the URIthat refers to an ontology class or a property. The answer to the query is a set of functions,where a function σ maps each variable to a literal or a URI (Figure 4).

The query to a triple statement P returns a set of σ such that the knowledge base(denoted by K below) entails σ(P).

queryðs p oÞ ¼ fr jK � rðs p oÞgwhere K � s p o means that the ontology K entails the triple s p o

ð1Þ

The query of two triple statements P. P' returns the natural join of the answers to Pand P'.

queryðP: P0Þ ¼ fr [ r0j r 2 queryðPÞ ^ r0 2 queryðP0Þ ^ Fðr; r0Þg ð2Þ

Figure 3. Simplified syntax of Geo-SPARQL.

International Journal of Digital Earth 389

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

7:56

28

Sept

embe

r 20

15

Page 9: A parallel approach for improving Geo-SPARQL query performancegis.geog.uconn.edu/personal/paper1/journal paper/2015... ·  · 2015-09-28query performance Tian Zhao, ... introduces

The function F ensures that the two solutions to a query are consistent.

Fðr; r0Þ ¼ true iff 8?x : ðrð?xÞ ¼ l ^ rð?xÞ ¼ l0Þ ) l ¼ l0

Processing triple statements P and P' separately may be very inefficient since the answersto P and P' will be based on the entire knowledge base. However, if we process P first,we may be able to greatly reduce the solution space for P'. Therefore, we will separate thetriple statements of a query into two sets with one set Ps for spatial queries and one set Pn

for non-spatial queries. The separation is based on the predicates (spatial Ps versus non-spatial pn) of the triples. The solution to Pn will be found first since it does not involveexpensive spatial computations. Note that we use this strategy since we consider the caseswhere the spatial query components are more computation intensive.

Thus, we redefine the query solution as below to use the solution query (Pn) to non-spatial query Pn to restrict the set of solutions to each spatial triple spso and then join theresults of spatial triples as the final solution. The solution for a spatial triple is defined byK �s s ps o.

queryðPn: s ps oÞ ¼ fr j ðK �s rðr0ðs ps oÞÞ Þ ^ r0 2 queryðPnÞg ð3ÞA knowledge base K entails a spatial triple spso (written as K �s s ps o) if K entails atriple with subject s, a triple with object o, and s and o satisfy the relation fpsðs; oÞ, wherefps implements ps.

K �s s ps o iff ðK � s Þ ^ ðK � oÞ ^ fpsðs; oÞFor queries with multiple spatial triples, we join the result of each spatial query.

queryðPn: P1s :P

2s Þ ¼ queryððPn:P

1s Þ: ðPn:P

2s ÞÞ ð4Þ

To find the solution for a spatial triple spso, we need a strengthened entailment relation �s

for the knowledge base. In particular, a query engine should implement spatial extensions fpsto perform the computations specified by the spatial predicate ps. For example, to answerthe query of the triple ? x geo : touches ? y, we need to know all the spatial objects in theknowledge base that touches each other. The geo : touches relation is not stored in theknowledge base, nor is it inferable based on the description logic. Thus, we need an extensionfgeo : touches to the query engine to implement the predicate geo : touches so that it takestwo arguments and returns true if the two arguments are spatial objects that touch eachother. In a query engine implementation such as Jena library, the extension function such asfgeo : touches is called for every possible solution to s and o. This is very inefficient.

The inefficiency comes from several sources. One is due to the fact that geometries ofspatial objects are stored in as string literals so that each time the objects are passed to the

Figure 4. Definition of query solution.

390 T. Zhao et al.

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

7:56

28

Sept

embe

r 20

15

Page 10: A parallel approach for improving Geo-SPARQL query performancegis.geog.uconn.edu/personal/paper1/journal paper/2015... ·  · 2015-09-28query performance Tian Zhao, ... introduces

extension function, the geometry literal has to be parsed. This is very costly for spatialjoins. For example, if we want to find out pairs of objects that touch each otherand there are N objects, then we have to make N�ðN�1Þ

2 calls to fgeo : touches and the same Ngeometry literals are parsed N�ðN�1Þ

2 times. The extension function itself is stateless and itis unable to cache the literal that it has parsed. Thus, to avoid this source of inefficiency,we need to avoid passing geometry literals to the extension function altogether. Instead,we cache the parsed values of the geometry literals, which may be used as inputs to thespatial functions that only pass the parsed values.

Another source of inefficiency is due to the way RDF query is processed. Incalculating the entailment of a knowledge base, the query engine will conduct anexhaustive search to find the triples that are answers to a triple query statement. A spatialindex is not a consideration in SPARQL queries. For a Geo-SPARQL query, it is possibleto have pre-computed spatial indices for certain spatial objects so that the extensionfunctions can take advantage of the indices to avoid a linear search. For example, if wehave indexed the geo-names in a knowledge base using an R-tree, then a query such as ?x geo : nearby (43, −88) can be answered by searching the index much more efficiently.However, this pre-computed spatial index approach only works in limited cases. Forexample, it can be used when the object of the triple query is a literal. For a query such as? x geo : nearby ? y, the index may not be useful if the knowledge base contains multipletypes of spatial objects. The reason is that the definition of nearby relation depends on thegeometries that are being compared while the available indices may be suitable forlocating objects nearby a point but not a line or a polygon. To allow more efficient queryand yet remain flexible, we will create on-the-fly spatial indexing for spatial objects. Theactual indexing will be dependent on the spatial predicates and the geometry types of thespatial objects. For instance, if we are going to find out the nearby streets of several highschools, we can index the streets based on x-coordinates so that streets are indexed to theclosest high school based on the x-coordinates. After indexing, we can greatly reduce thenumber of calls to fgeo : nearby. Since our indexing is on-the-fly and specific to each spatialpredicate, we can implement it as a pre-processing function associated with fgeo : nearby. Theknowledge base itself does not need to include any spatial indices. Note that even thoughthe current experiment had defined a specific indexing strategy for lines nearby points,we can generalize this for every combination of extension functions and their input types.

The last source of inefficiency is the lack of parallelism in processing Geo-SPARQLqueries. Even though we can answer triple statements in parallel and join their results, theperformance gain is not significant since the runtime of the triple queries is often veryuneven and in some cases, answering all triple statements sequentially may be faster thananswering each triple statement in parallel if the triple statements are highly correlated.Thus, in this framework, we will answer the non-spatial queries first and use theirsolutions to trim the solution space for the spatial queries.

We focus on performance improvement for each spatial triple. Given a triple s pso, ifeither s or o is a literal, then the query can be answered efficiently without parallelization. Inthe case when both s and o are variables, we first create indices for the spatial objects thatare in any potential solution σ to s and o, then we partition the solutions into k portionsσ1, σ2,…, σk, and finally we compute fpsðriðsÞ; riðoÞÞ in parallel based on the partitions. Theresults of the parallel threads are then aggregated to form the final solution to s pso. Formultiple spatial triple statements, we can execute them in parallel as well and join the resultsof each individual query. However, performance gain will be query dependent and is not aspredictable as parallelizing the query of an individual triple statement.

International Journal of Digital Earth 391

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

7:56

28

Sept

embe

r 20

15

Page 11: A parallel approach for improving Geo-SPARQL query performancegis.geog.uconn.edu/personal/paper1/journal paper/2015... ·  · 2015-09-28query performance Tian Zhao, ... introduces

To summarize, in this framework, we introduce improvements to an ontology queryengine by (1) separating the non-spatial queries from the spatial queries, (2) caching theinputs to spatial extension functions to avoid repeatedly parsing of geometry literals, (3)creating indices for certain inputs of spatial extensions on-the-fly, and (4) parallelizing thespatial joins using data partitioned based on the spatial indices, where a spatial join isrepresented by a triple of the form s pso in the query and ps is a predicate correspondingto a topological relation. We report some of the experimental results of our preliminaryimplementation in the next section.

3. Performance evaluation

To share the heterogeneous data of GIS databases at the semantic level, we published theoriginal formats of GIS data in Shapefiles using WFS services across different sources.We then converted spatial features from the WFS services into JSON (JavaScript ObjectNotation) files. The JSON files were then converted through our Java-based converterinto RDF files, which were loaded into memory through Jena library API, to share theheterogeneous of GIS databases at the semantic level.

Under this environment we conducted a limited set of experiments on parallelizingthe Geo-SPARQL queries using the RDF files. The experiments were run on aworkstation with a dual Intel Core i5-3320M CPU at 2.6 GHz with 4 hyper-threads.The allocated memory of the workstation is 2 GB. The queries were executed bothsequentially and in parallel using a Java program. The standard JDK threading librarywas used to develop the parallel programs. The parallel program used a shared memorymodel where all threads have access to the same memory so that the experiments wererun on a single machine. To use clustered machines, distributed memory model such asMap-Reduce has to be used. This is achievable but not necessary to demonstrate thespeed-up of our experiments in this study.

Here we show some of the evaluation results of two experiments that we conductedbased on two different sizes of datasets: a small dataset for New Haven, CT and a largerdataset for the coastal region of Connecticut. The first experiment was conducted on thesmall dataset, which consists of two map layers, one for schools and one for streets, asillustrated in Figure 5. It contains 54 schools, shown as red squares, and 3449 streets,shown as blue lines.

In both experiments, we executed the query Q1 to select the nearby streets of each school.

The query was executed both sequentially and in parallel. We used the Jena library API toload the ontology that contains the spatial features of New Haven and to answer the Geo-SPARQL queries. The schools are point features while the streets are poly-line features.To decide whether there is a ‘nearby’ relation between a point and a poly-line, we rewrote

392 T. Zhao et al.

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

7:56

28

Sept

embe

r 20

15

Page 12: A parallel approach for improving Geo-SPARQL query performancegis.geog.uconn.edu/personal/paper1/journal paper/2015... ·  · 2015-09-28query performance Tian Zhao, ... introduces

the triple ? street geo : nearby ? school to the triple ? school ct : geom ? g1, ? street ct :geom ? g2, and the statement filter(distance(? g1, ?g2) < 200) as above. We added anextension to the Jena API to compute the distances between the spatial objects.

The workflow of query rewriting and parallelization is illustrated in Figure 1, wherethe static analyzer rewrites the original query to expand some triple statements to moreprimitive triple and filter statements using the pre-defined query rewriting rules. Theprimitive queries are grouped into spatial and non-spatial sub-queries. The solutions tothe non-spatial sub-queries are indexed and partitioned for answering the spatial sub-queries in parallel. The query rewriting rules are pre-defined logic inference rules, witheach rule consisting of a head and a set of conditions. For example, the rewriting rule forthe nearby relation is defined as follows:

This rule can be applied to ? school geo : nearby ? street since it can be unified with thehead of the rule ? x geo : nearby ? y with the most general unifier σ = {? x ↦ ? school,? y ↦ ? street}. We then apply σ to the conditions of the rules and add them to theoriginal query. Note that the definition of nearby relation is subjective and ourinterpretation of the relation is manifested in the filter function of the rewriting rule,which is subject to user modification.

To parallelize the query, we group the query statements into two sub-queries: the firstsub-query consists of the triple statements such as ‘?school rdf:type ct:school’ and the

Figure 5. The map layers of schools and streets in New Haven, CT.

International Journal of Digital Earth 393

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

7:56

28

Sept

embe

r 20

15

Page 13: A parallel approach for improving Geo-SPARQL query performancegis.geog.uconn.edu/personal/paper1/journal paper/2015... ·  · 2015-09-28query performance Tian Zhao, ... introduces

second sub-query consists of the filter statements. In this experiment, the first sub-querywas executed in parallel by sending triples of the same subject variable to the samethread. The final results of the triple threads were aggregated as a set of solutions to thevariables (?school, ?g1, ?street, ?g2). The results from the triple sub-queries were usedfor processing the filter sub-query.

We executed the filter sub-query in parallel by dividing the inputs into equalproportions and sending them to each thread. For this sub-query, we divided the inputs top equally sized blocks where each block contains N/p number of streets and N is the totalnumber of streets and p is the number of threads. While the number of threads to executea triple sub-query is bounded by the number of the subject variables involved, the numberof threads to execute the filter sub-query is not bounded. We observed that the runtime ofthe Geo-SPARQL query may be dominated by the filter sub-query. Therefore, it ispossible to reduce the execution time of a query by increasing the number of threads forexecuting the filter functions. The performance gain of the parallel execution is limited byunderlying parallel architecture and the overhead of threading.

Figure 6 shows the runtime statistics of the query Q1, where the sequential query timeis over 4 seconds. If we divided the query into a sub-query of triple statements and a sub-query of filter statements and executed the two sub-queries in sequence, then the totalruntime was reduced substantially to 488–759 ms (milliseconds) depending on thenumber of threads used to run the filter statement. We explain the two components of theruntime cost:

(1) The triple sub-queries were run on two threads only and it took 56 ms in total.Since the number of schools is much smaller than the number of streets, thethread for querying streets took 54 ms while the thread for querying schools onlytook 6 ms.

(2) We ran the filter sub-query using 1, 2, 4, and 8 threads and the average runtime ofeach thread decreased as expected and the total runtime of the sub-query is 703,

Figure 6. Runtime (milliseconds) of Q1 to find the nearby streets of each school in New Haven.

394 T. Zhao et al.

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

7:56

28

Sept

embe

r 20

15

Page 14: A parallel approach for improving Geo-SPARQL query performancegis.geog.uconn.edu/personal/paper1/journal paper/2015... ·  · 2015-09-28query performance Tian Zhao, ... introduces

460, 432, and 434 ms, respectively. The lack of performance gain after 4 threadsmay be due to the fact that the experiment was run on a CPU with 2 cores and 4hyper threads. As we increased the number of Java threads, the runtime per threaddecreased as expected but there was no larger performance gain after the numberof threads exceeded four. The reason that we ran the filter sub-query using onethread is that we want to obtain a measure on the overhead associated withparallel execution of the filter sub-query. Note that in all parallel runs, the triplesub-queries have been done in parallel and the parsed geometries of streets andschools are cached. Thus, even with a single thread, the query ran much fasterthan it did before optimization.

In the second experiment, we ran the query Q1 on a much larger dataset to include allschools in Connecticut and the streets in the coastal regions of Connecticut. As illustratedin Figure 7, there are 967 schools, shown as green triangles, and 66,435 streets, shown asred lines. We did not conduct the experiment on all streets (199,636) in Connecticut. Thelimitation of the size of the dataset was only constrained by the memory size of ourworkstation since we used an in-memory RDF model. The limitation will not exist if weused a file-based RDF model.

This experiment demonstrated the dramatic difference between the original Geo-SPARQL query and the optimized query. Figure 8 shows the runtime statistics of thequery Q1, where the original Geo-SPARQL query took about 365,797 ms (or over 6minutes). If we divided the query into a sub-query of triple statements and a sub-query offilter statements and executed the two sub-queries in sequence, then the total runtime was

Figure 7. The maps of the schools in CT and streets in coastal regions of CT.

International Journal of Digital Earth 395

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

7:56

28

Sept

embe

r 20

15

Page 15: A parallel approach for improving Geo-SPARQL query performancegis.geog.uconn.edu/personal/paper1/journal paper/2015... ·  · 2015-09-28query performance Tian Zhao, ... introduces

reduced substantially to 1119–4157 ms depending on the number of threads used for thefilter statement. We summarize the components of the runtime cost below:

(1) The triple sub-queries were run on two threads only and it took 756 ms in total.(2) The cost of parsing the outputs of the triple sub-queries is 314 ms.(3) The filter sub-query was run on 1, 2, 4, and 10 threads and the total runtime of the

sub-query is 3097, 172, 139, and 179 ms, respectively. The cost of on-the-flyindexing (if applicable) is included in this cost component. Note that the totalcosts go up slightly when we use 10 threads and this is due to the constraint of the4 physical threads.

We observed that the cost of the triple sub-queries was a large portion of the totalquery time. The triple sub-query for schools took 396 ms while the triple sub-queryfor streets took 745 ms. Since they were run in parallel, the total time for thetriple sub-queries is 746 ms. In addition, the cost of parsing the results of the triplesub-queries was significant. Note that the geometries of the schools and the streets inthe RDF model are represented as string literals of the forms Point (x, y) and LineString(x1y1, x2y2,…,xnyn).

The extension function for calculating object distances needs to parse the geometryliterals into coordinates. In the optimized implementation, the literals were parsed onlyonce and then cached in object arrays. In the original query engine, the geometry literalhad to be parsed each time the distance function was called. The overhead of parsing isone of the reasons for the poor performance of the original implementation. Note thateven though we can represent the geometries of streets and schools as RDF instances withtheir coordinates as typed literals, which may reduce the cost of parsing, it is difficult todefine an extension function to operate on these RDF instances with a lower runtimeoverhead.

Figure 8. Runtime (millisecond) of the query Q1 applied to Connecticut using the optimized queryengine. The runtime of Q1 with naïve implementation took over 365,797 ms (not shown).

396 T. Zhao et al.

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

7:56

28

Sept

embe

r 20

15

Page 16: A parallel approach for improving Geo-SPARQL query performancegis.geog.uconn.edu/personal/paper1/journal paper/2015... ·  · 2015-09-28query performance Tian Zhao, ... introduces

When we calculated the distance between every pair of schools and streets using onethread, the runtime of the filter sub-query, as shown in Figure 8, is 3097 ms. However,not all pairs of schools and streets need to be compared since we only need to check theschools and streets that are relatively close to each other. Thus, we only applied thespatial indexing to the inputs of the filter sub-query so that the inputs could be divided ina way that results in a more efficient parallel execution.

We first divided the schools into several sets based on their x-coordinates. For eachstreet s, we checked if the x-coordinate of s falls into the range of the x-coordinates of allschools in a set (we extended the range with a buffer on each end to include all possiblenearby streets). If it did, then we calculated the distances between s with each school inthe set. In this experiment, we divided the schools into 10 sets, where each set containsschools that are close to each other. This roughly reduced the number of comparisonsbetween schools and streets by a factor of 10. In fact, the runtime statistics shows that thefilter sub-query time is 172 ms with 2 threads and 139 ms with 4 threads. When weincreased the number of threads further, the performance decreased slightly. When weincreased the number of sets of schools, the overhead of dividing up the sets andpartitioning the streets became significant and we did not have much performance gain.

Lastly, the total runtime of the query with on-the-fly indexing was dominated by thetriple sub-queries, which could not be improved further without partitioning the RDFmodel. We plan to examine how to improve the performance of the triple sub-queries inthe future study.

4. Discussion

In the experiments, the ontology data were parsed by the Jena library when they wereloaded. However, the data loading was just one time and once data were loaded asontology model (as a memory model in the experiments), there was no more parsing onthe original data. One problem with ontology representation of spatial data is that there isno good way of representing spatial objects such as line strings as ontology instances. Forexample, if a street has many segments and each segment is a line, then we were facedwith the choice of representing the street as a typed literal (e.g. a sequence of line objects)or just as an un-parsed string. The application programming interface support forcomposite and typed liberal is poor in SPARQL engines such as Jena library, and thereare no suitable ways to go through the individual elements on a literal with a complexstructure. Thus, we were forced to represent the street objects as string literals, whichrequired a repeatedly parsing each time a spatial join was applied to streets.

The results are limited since they were based on the use of a small- and a medium-sized spatial dataset. Nevertheless, we are able to gain some insights into how theexecution of Geo-SPARQL queries can be improved through a parallel processing. Notethat the purpose of our experiments is to evaluate the performance gain of the proposedoptimization algorithms rather than evaluate a particular parallel platform. The proposedalgorithm in this study could be used on different architectures, which include not onlythe shared memory architecture (as in our case) but also the distributed memoryarchitecture (such as Map-Reduce on clusters or Amazon Web Services). Increasingmemory size, using persistent RDF models, or adopting a distributed memory modelcould accommodate much larger datasets used in the experiments. The currentexperiments, however, still have sufficiently demonstrated that the proposed strategy isable to drastically improve performance of Geo-SPARQL queries.

International Journal of Digital Earth 397

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

7:56

28

Sept

embe

r 20

15

Page 17: A parallel approach for improving Geo-SPARQL query performancegis.geog.uconn.edu/personal/paper1/journal paper/2015... ·  · 2015-09-28query performance Tian Zhao, ... introduces

From the runtime statistics, we learned that the biggest performance gain wasachieved by separating the triple statements from the filtering statements that involvedspatial computations, which tended to be very costly. A spatial indexing can help partitionthe inputs to the filter functions for parallel processing. This provided an additionalperformance gain since the inputs to the filter functions are often Cartesian products ofseveral sets of geometries, which can be very large. Partitioning the inputs to the filterfunctions can reduce the input size and runtime costs significantly.

We also learned that performance of answering the triple statements may be improvedwith parallelization though the performance gain may not be significant as compared tothat of the filter statements. By spatial indexing and by increasing the number of threadsused in processing the filter statements, we were able to reduce the total runtimesubjected to the limitation of the underlying architecture. In this experiment, the CPU hasfour hyper threads, which limited the performance gain to that of four threads. When weincreased the number of threads to more than four, the average runtime per thread wasreduced but the total runtime remained stable. However, this is not the performancebottleneck because after we reduced the runtime cost of the filter sub-query, the runtimeof the triple sub-queries dominated the total runtime again. It is not straightforward toincrease performance of the triple sub-queries through parallelization since we are onlyable to assign at most one thread to one triple statement. To further improve performanceof the triple sub-queries, we need to partition the RDF model, which is an issue neededfor further research.

Ontology data allow flexible query but an ontology query is very inefficient becauseof the lack of indexing. Indexing cannot be easily done on loosely connected ontologydata without knowing what the query is. A big portion of performance gain in theexperiments was actually from the spatial indexing itself, which does not requireparallelization. The indexing was done on-the-fly and not before. Note that the ontologycontains a collection of triples and there is no good way to index them before knowingwhat the query is. While the ideas of data parallel execution and partitioning spatialobjects based on spatial indices are not new, there is no study to apply these strategies tospatial ontology data for efficient spatial ontology query in the literature. Moreover,although there is similarity between SQL filters and SPARQL filters, the ways that thefilters are implemented are quite different. This study is the first explorative study on thisissue.

4.1. Other query processing strategies

Query answering with large spatial datasets represented in Semantic Web languagesconsumes a large amount of physical memory and is computational intensive. As a result,distributed query processing strategies such as those using Map-Reduce frameworks toimprove query performance become attractive options. Map-Reduce framework is aprogramming model for processing and generating large datasets (Dean and Ghemawat2004). In the Map-Reduce model, programs written in functional style are automaticallyparallelized and executed on a large cluster of commodity machines. Map-Reduceconcept was first proposed by Google to support its large distributed computing across alarge number of machines on its huge databases (Dean and Ghemawat 2004). Recently, anumber of studies have addressed the implementation of distributed SPARQL queryengines using the Map-Reduce implementations such as the Hadoop framework (e.g.Husain et al. 2009; Choi et al. 2009; Kulkarni 2010).

398 T. Zhao et al.

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

7:56

28

Sept

embe

r 20

15

Page 18: A parallel approach for improving Geo-SPARQL query performancegis.geog.uconn.edu/personal/paper1/journal paper/2015... ·  · 2015-09-28query performance Tian Zhao, ... introduces

Hadoop is one of the most popular implementations of Map-Reduce model. AlthoughHadoop provides a simple and powerful mechanism to implement distributed applicationswhile hiding details such as instantiation of jobs in cluster, fault tolerance, or datadistribution, the binary relational operators such as join, Cartesian product, and setoperations are difficult to implement in a pure Hadoop framework (Mazumdar 2011).Currently, Hadoop supports only partition parallelism in which a single operator executeson different partitions of data across the nodes. As a result, the existing Hadoop-basedsystems with the relational style join operators translate multi-join query plans into alinear execution plan with a sequence of multiple Map-Reduce cycles. This significantlyincreases the overall communication and I/O overhead involved in RDF graph processingon Map-Reduce platforms (Ravindra, Kim, and Anyanwu 2011). In addition, files onHadoop now cannot be modified randomly, which may limit many features such asupdate operation for RDF applications (Sun and Jin 2010).

Increasingly, distributed systems such as Cloud Computing (Cui, Wu, and Zhang2010; Liu et al. 2009; Yang et al. 2011a, 2011b; Yang, Xu, and Nebert 2013; Liu et al.2013; Huang et al. 2013; Wen et al. 2013; Kim and Tsou 2013; Yue et al. 2013), cyber-GIS (Wang 2010; Wang et al. 2013), or spatial cyber-infrastructure (Wright and Wang2011) have been suggested as a solution to overcome the scalability and performanceproblems of the Web-based GIS systems. Cloud computing is a recent paradigmdeveloped to search, access, and utilize large volumes of geospatial data for manygeospatial science applications. Hadoop, as an emerging cloud computing tool, issupported by many cloud service providers such as Amazon.

The proposed parallel query processing strategy may be adapted to run in a Map-Reduce framework. The non-spatial query component of a query can be readily executedas data parallel map tasks, the results of which can be aggregated as a reduce task. Afterindexing and partitioning of the results of the non-spatial queries, the spatial querycomponent can be executed as data parallel map tasks as well. Using a Map-Reduceframework, we should be able to overcome the restriction of limited memory in sharedmemory platforms so that we can handle much larger datasets. However, the potentialperformance improvement remains to be investigated considering the overhead of a Map-Reduce framework.

Lastly, the technology of GPU (Graphic Processing Unit) computing is not a goodchoice for our experiments because of the difficulty in interacting with GPU model(copying back and forth between global CPU memory and GPU memory). In addition,GPU memory is also too limited for each GPU unit to handle the parallel tasks of spatialqueries.

4.2. Advantages

Geospatial data are increasingly being made available on a Geospatial Semantic Webusing the Resource Description Framework (RDF). The OGC Geo-SPARQL aims toaccess and query geospatial data represented by RDF over Geospatial Semantic Web.While the Geo-SPARQL enables users to retrieve more precisely the data they neededbased on the semantics associated with these data, slow performance is an important issuefor retrieving data from big geospatial datasets. To facilitate real-time spatial queries overGeospatial Semantic Web, we proposed a new optimization strategy for improving theruntime performance of Geo-SPARQL queries, which can accurately retrieve the data by

International Journal of Digital Earth 399

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

7:56

28

Sept

embe

r 20

15

Page 19: A parallel approach for improving Geo-SPARQL query performancegis.geog.uconn.edu/personal/paper1/journal paper/2015... ·  · 2015-09-28query performance Tian Zhao, ... introduces

explicitly representing semantics of data during the search process. The proposedapproach has several advantages.

Firstly, by separating spatial from non-spatial query components, we are able toreduce the size of the input dataset for the spatial query component, which tends to betime consuming. A naive implementation of ontology query engine would process thetriple statements in sequence, which may cause spatial queries to run on an unnecessarilylarge dataset.

Secondly, by caching the geometries that are inputs to the spatial extension functionsto the query engine, we are able to drastically reduce the cost associated with parsinggeometry literals. This cost is significant in queries that involve spatial joins of large setsof spatial objects.

Finally, by creating index for spatial objects on-the-fly, we are able to greatly reducethe sets of geometry inputs to the spatial extension functions of the ontology queryengine. This means that we are able to answer a computation-intensive spatial ontologyquery in a reasonable amount of time.

5. Conclusion

Obtaining spatial information quickly from disparate sources is a critical need for aGeospatial Semantic Web application. Although advances in a Geospatial Semantic Webfacilitate geospatial data sharing at semantic level, performance issues still hamperefficient and effective utilization of spatial information. This study proposes a newoptimization strategy for improving query performance in the Geospatial Semantic Web.It uses on-the-fly spatial indexing to partition spatial objects referenced in an ontologyknowledge base, followed by the parallel execution of spatial joins, thereby improvingthe performance of Geo-SPARQL queries.

The initial experimental results show that the proposed approach can greatly reducethe runtime cost of Geo-SPARQL queries compared with that of straightforwardimplementation of ontology query engine. As future work, we will extend the proposedstrategy to distributed platforms using cloud-based web services and cluster platforms.Further study is needed to improve the proposed approach. For example, the currentpartitioning scheme does not consider the distribution of the spatial data. How are we ableto produce a balanced partitioning for a skewed distribution of the data? Similarly, howcan we build a cloud service or cyber-infrastructure based on the proposed framework?How scalable will the system be when we add different types of data? These are someexamples of the challenges we face to further improve the proposed approach for theGeospatial Semantic Web applications.

AcknowledgmentAnselin’s research was supported in part by award OCI-1047916, SI2-SSI from the US NationalScience Foundation.

ReferencesAnyanwu, K. 2013. “A Vision for SPARQL Multi-Query Optimization on Map-Reduce.” ICDE

Workshops, pp. 25–26, Brisbane, Australia.Battle, R., and D. Kolas. 2012. “Enabling the Geospatial Semantic Web with Parliament and Geo-

SPARQL.” http://www.semantic-web-journal.net/sites/default/files/swj176_3.pdf.

400 T. Zhao et al.

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

7:56

28

Sept

embe

r 20

15

Page 20: A parallel approach for improving Geo-SPARQL query performancegis.geog.uconn.edu/personal/paper1/journal paper/2015... ·  · 2015-09-28query performance Tian Zhao, ... introduces

Boral, H., W. Alexander, L. Clay, G. Copeland, S. Danforth, M. Franklin, B. Hart, M. Smith, and P.Valduriez. 1990. “Prototyping Budda: A Highly Parallel Database System.” IEEE Transactionson Knowledge and Data Engineering, 2 (1): 4–24, March.

Castagna P., A. Seaborne, and C. Dollin. 2009. “A parallel Processing Framework for RDF Designand Issues.” http://www.hpl.hp.com/techreports/2009/HPL-2009-346.pdf.

Choi, H., J. Son, Y. Cho, M. K. Sung, and Y.D. Chung, 2009. “SPIDER: A System for Scalable,Parallel/Distributed Evaluation of Large-scale RDF Data.” Proceeding 18th ACM Conferenceon Information and Knowledge Management (CIKM 09), ACM, pp. 2087–2088. doi:10.1145/1645953.1646315.

Cui, D., Y. Wu, and Q. Zhang. 2010. “Massive Spatial Data Processing Model Based onCloud Computing Model.” In Proceedings of the Third International Joint Conferenceon Computational Sciences and Optimization, IEEE Computer Society, Los Alamitos, CA,pp. 347–350, 28–31. May. Anhui: Huangshan.

Dean, J., and S. Ghemawat. 2004. “Map-Reduce: Simplified Data Processing on Large Clusters.”OSDI ’04: 6th Symposium on Operating Systems Design and Implementation, pp. 137–149,Berkeley, CA. https://www.usenix.org/legacy/events/osdi04/tech/full_papers/dean/dean.pdf.

DeWitt, D. J., R. H. Gerber, G. Graefe, M. L. Heytens, and K. B. Kumar. 1986. “GAMMA –A High Performance Dataflow Database Machine.” Kyoto: Very Large Data Bases (VLDB).

Gore, A. 1998. “The Digital Earth: Understanding Our Planet in the 21st Century.” Presented at theCalifornian Science Center, Los Angeles, CA, January 31.

Groppe J., and S. Groppe. 2011. “Parallelizing Join Computations of SPARQL Queries for LargeSemantic Web Databases.” Proceeding SAC ‘11 Proceedings of the 2011 ACM Symposium onApplied Computing Pages 1681–1686. http://dl.acm.org/citation.cfm?doid=1982185.1982536.

Grütter, R., and B. Bauer-Messmer. 2007. “Combining Owl with RCC for SpatioterminologicalReasoning on Environmental Data.” http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-258/paper17.pdf.

Guo, H. D. 1999. Earth observation and Digital Earth. Beijing: Chinese Science Press.Guo, H., X. Fan, and C. Wang. 2009. “A Digital Earth Prototype System: DEPS/CAS.”

International Journal of Digital Earth 2 (1): 315.Huang, Q., C. Yang, K. Benedict, S. Chen, A. Rezgui, and J. Xie. 2013. “Utilize Cloud Computing

to Support Dust Storm Forecasting.” International Journal of Digital Earth 6 (4): 338–355.doi:10.1080/17538947.2012.749949.

Husain, M. F., P. Doshi, L. Khan, and B. Thuraisingham. 2009. “Storage and Retrieval of LargeRDF Graph Using Hadoop and Map-Reduce.” In CloudCom 2009, Beijing, China, LNCS 5931,edited by M.G. Jaatun, G. Zhao, and C. Rong, 680–686.

Karjalainen, M. 2009. “Uniform Query Processing in a Federation of RDFS and RelationalResources.” In Proceedings of the 2009 International Database Engineering and ApplicationsSymposium, pp. 315–320. Calabria, Italy. http://dl.acm.org/citation.cfm?id=1620469.

Kim, H., P. Ravindra, and K. Anyanwu. 2011. “From SPARQL to Map-Reduce: The Journey Usinga Nested TripleGroup Algebra.” PVLDB 4 (12): 1426–1429.

Kim, I.-H., and M.-H. Tsou. 2013. “Enabling Digital Earth Simulation Models Using CloudComputing or Grid Computing – Two Approaches Supporting High-Performance GISSimulation Frameworks.” International Journal of Digital Earth 6 (4): 383–403. doi:10.1080/17538947.2013.783125.

Kitsuregawa, M., H. Tanaka, and T. Motooka. 1983. “Application of Hash to Data Base Machineand Its Architecture.” New Generation Computing 1 (1), 63–74. doi:10.1007/BF03037022.

Kulkarni, P. 2010. “Distributed SPARQL Query Engine Using Map-Reduce.” http://www.inf.ed.ac.uk/publications/thesis/online/IM100832.pdf.

Liu, Y., W. Guo, W. Jiang, and J. Gong. 2009. “Research of Remote Sensing Service Based onCloud Computing Mode.” Application Research of Computers 26 (9): 3428–3431.

Liu, Y., A. Y. Sun, K. Nelson, and W. E. Hipke. 2013. “Cloud Computing for Integrated StochasticGroundwater Uncertainty Analysis.” International Journal of Digital Earth 6 (4): 313–337.doi:10.1080/17538947.2012.687778.

Mazumdar, P. 2011. “Complex SPARQL Query Engine for Hadoop Map-Reduce.” www.csi.ucd.ie/files/u1450/SM_Query_RDf.ps.

International Journal of Digital Earth 401

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

7:56

28

Sept

embe

r 20

15

Page 21: A parallel approach for improving Geo-SPARQL query performancegis.geog.uconn.edu/personal/paper1/journal paper/2015... ·  · 2015-09-28query performance Tian Zhao, ... introduces

Miao, Z., and J. Wang. 2009. “Querying Heterogeneous Relational Database Using SPARQL.” InEighth IEEE/ACIS International Conference on Computer and Information Science, Shanghai,China.

OGC 11-052r4. 2012. “OGC Geo-SPARQL – A Geographic Query Language for RDF Data.”http://www.opengis.net/doc/IS/Geo-SPARQL/1.0.

Ravindra, P., H. Kim, and K. Anyanwu. 2011. “An Intermediate Algebra for Optimizing RDFGraph Pattern Matching on Map-Reduce.” http://link.springer.com/chapter/10.1007%2F978-3-642-21064-8_4.

Sun, J., and Q. Jin, 2010. “Scalable RDF Store Based on HBase and Map-Reduce.” In AdvancedComputer Theory and Engineering (ICACTE), 2010 3rd International Conference on, vol. 1,pp. V1-633, V1-636, August 20–22, Chengdu, China.

Wang, S. 2010. “A Cyber GIS Framework for the Synthesis of Cyber Infrastructure, GIS, andSpatial Analysis.” Annals of the Association of American Geographers 100 (3): 535–557.doi:10.1080/00045601003791243.

Wang, S., L. Anselin, B. Badhuri, C. Crosby, M. Goodchild, Y. Liu, and T. Nyerges. 2013. “CyberGIS Software: A Synthetic Review and Integration Roadmap.” International Journal ofGeographical Information Science 27 (11): 2122–2145. doi:10.1080/13658816.2013.776049.

Wen, Y., M. Chen, G. Lu, H. Lin, L. He, and S. Yue. 2013. “Prototyping an Open Environment forSharing Geographical Analysis Models on Cloud Computing Platform.” International Journal ofDigital Earth 6 (4): 356–382. doi:10.1080/17538947.2012.716861.

Wright, D. J., and S. Wang. 2011. “The Emergence of Spatial Cyber Infrastructure.” Proceedings ofthe National Academy of Sciences 108 (14): 5488–5491. doi:10.1073/pnas.1103051108.

Yang, C., M. Goodchild, Q. Huang, D. Nebert, R. Raskin, Y. Xu, M. Bambacus, and D. Fay. 2011b.“Spatial Cloud Computing: How can the Geospatial Sciences Use and Help Shape CloudComputing?” International Journal on Digital Earth 4 (4): 305–329. doi:10.1080/17538947.2011.587547.

Yang, C., H. Wu, Q. Huang, Z. Li, and J. Li. 2011a. “Using Spatial Principles to OptimizeDistributed Computing for Enabling the Physical Science Discoveries.” Proceedings of theNational Academy of Sciences 108 (14): 5498–5503. doi:10.1073/pnas.0909315108.

Yang, C., Y. Xu, and D. Nebert. 2013. “Redefining the Possibility of Digital Earth and Geoscienceswith Spatial Cloud Computing.” International Journal of Digital Earth 6 (4): 297–312.doi:10.1080/17538947.2013.769783.

Yue P., H. Zhou, J. Gong, and L. Hu. 2013. “Geoprocessing in Cloud Computing Platforms –A Comparative Analysis.” International Journal of Digital Earth 6 (4): 404–425. doi:10.1080/17538947.2012.748847.

Zhang, C., W. Li, and T. Zhao. 2007. “Geospatial Data Sharing Based on Geospatial Semantic WebTechnologies.” Journal of Spatial Science 52 (2): 35–49. doi:10.1080/14498596.2007.9635121.

Zhang, C., T. Zhao, and W. Li. 2010a. “Automatic Search of Geospatial Features for Disaster andEmergency Management.” International Journal of Applied Earth Observation and Geoinforma-tion 12 (6): 409–418. doi:10.1016/j.jag.2010.05.004.

Zhang, C., T. Zhao, and W. Li. 2010c. “A Framework for Geospatial Semantic Web Based SpatialDecision Support System.” International Journal of Digital Earth 3 (2): 111–134. doi:10.1080/17538940903373803.

Zhang, C., T. Zhao, and W. Li. 2013. “Towards Improving Query Performance of Web FeatureServices (WFS) for Disaster Response.” ISPRS International Journal of Geo-Information 2 (1):67–81. doi:10.3390/ijgi2010067.

Zhang, C., T. Zhao, W. Li, and J. P. Osleeb. 2010b. “Towards Logic-based Geospatial FeatureDiscovery and Integration using Web Feature Service and Geospatial Semantic Web.”International Journal of Geographical Information Science 24 (6): 903–923. doi:10.1080/13658810903240687.

Zhao, T., C. Zhang, M. Wei, and Z.-R. Peng. 2008. “Ontology-Based Geospatial Data Query andIntegration.” Lecture Notes in Computer Science LNCS5266: Geographic Information Science5266: 370–392.

402 T. Zhao et al.

Dow

nloa

ded

by [

Uni

vers

ity o

f C

onne

ctic

ut]

at 0

7:56

28

Sept

embe

r 20

15