spatial data processing frameworks - a literature survey · spatial data processing frameworks - a...
TRANSCRIPT
Spatial Data Processing Frameworks - ALiterature Survey
Ayman Zeidan
Department of Computer Science
The Graduate Center of the City University of New York
365 5th Ave, New York, NY 10016
Graduate Committee
Dr. Huy T. Vo, The City College of New York, New York, NY 10031
Dr. Feng Gu, The College of Staten Island, New York, NY 10314
Dr. Kaliappa Ravindran, The City College of New York, New York, NY 10031
ABSTRACT
Location-based applications and services have become an integral part of our lives. These ap-
plications extract meaningful information through the analysis of large location-tagged (spatial)
datasets that are generated like never before. The term ”Information Explosion” is often used to
describe the sheer amount of data that is being made available to individuals, businesses, and other
entities. Traditional computing and database systems fall short when it comes to the efficient han-
dling of truly large datasets. Consequently, several high-performance parallel computing systems
were developed with the goal of providing quick, accurate, and scalable solutions. Unfortunately,
today’s state-of-the-art parallel processing systems are generic systems and are not well suited to
perform efficient processing of large spatial datasets. Therefore, specialized frameworks are needed
to empower these systems and improve spatial data processing. Instead of building parallel pro-
cessing systems from the grounds up, already existing and stable systems like Apache Hadoop and
Apache Spark are utilized.
i
TABLE OF CONTENTS
ABSTRACT i
LIST OF TABLES iv
LIST OF ILLUSTRATIONS v
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 Challenges with spatial data analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 The portion that is spatial data . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Different shapes and sizes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Data at rest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.4 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.5 Usability and Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.6 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.7 Temporal Aspect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.8 Interactive vs Batch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.9 Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.10 Reliability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3 Spatial Data Analysis Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.1 Hadoop-Based Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 Esri GIS Tools for Hadoop . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.2 Hadoop-GIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.3 SpatialHadoop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.1.4 Features and Performance Summary . . . . . . . . . . . . . . . . . . 21
3.2 Spark-Based Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 SpatialSpark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.2 GeoSpark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
ii
3.2.3 LocationSpark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.2.4 STARK . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.5 Simba . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.6 Features and Performance Summary . . . . . . . . . . . . . . . . . . 33
4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
iii
LIST OF TABLES
TABLE 1 : Feature comparison of Hadoop-based frameworks . . . . . . . . . . . . . . . 22
TABLE 2 : Feature comparison of Spark-based frameworks . . . . . . . . . . . . . . . . 35
iv
LIST OF ILLUSTRATIONS
FIGURE 1 : Selected Spatial Data Geometric Shapes . . . . . . . . . . . . . . . . . . . 4
FIGURE 2 : Architecture of Esri GIS Tools for Hadoop[9] . . . . . . . . . . . . . . . . . 16
FIGURE 3 : Architecture of Hadoop-GIS[30] . . . . . . . . . . . . . . . . . . . . . . . . 17
FIGURE 4 : Architecture of SpatialHadoop[37] . . . . . . . . . . . . . . . . . . . . . . . 19
FIGURE 5 : GeoSpark Architecture[61] . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
FIGURE 6 : LocationSpark Architecture[54] . . . . . . . . . . . . . . . . . . . . . . . . 27
FIGURE 7 : STARK Architecture[44] . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
FIGURE 8 : Simba Architecture(Orange shaded boxes)[58] . . . . . . . . . . . . . . . . 31
v
1 Introduction
Nearly every computer application generates some form of data. Depending on the application,
this data can come from different sources like runtime log files, activity recordings, features usage,
weather sensors, and/or driving habits. Over time, this data accumulates and grows to the point
where it is too big to extract meaningful information using traditional techniques or using a single
machine (node). When a dataset grows in size in a short amount of time, it is referred to as Big
Data. The Oxford English Dictionary defines Big Data as ”Extremely large datasets that may be
analyzed computationally to reveal patterns, trends, and associations, especially relating to human
behavior and interactions.” While, the exact definition of big data is still somewhat subjective,
everyone seems to agree on the 3V s of big data – Volume, Variety, and Velocity. Some will even
extend these to include the possible value that can be extracted from the data.
A significant portion of the data collected contains a spatial component that indicates the physical
location where the data point was collected. This is crucial since knowing the data’s physical loca-
tion increases the chance of extracting additional valuable information that would not be available
otherwise. As a result, interest in collecting and analyzing big spatial data has increased signif-
icantly. In 2014, Facebook, the world’s largest social networking site, announced that their data
warehouse can store 300 petabytes of data with 600 terabytes of daily incoming [26]. Facebook uses
this data to improve features and produce targeted ads that are tailored to individual users. The
data collected is tagged with various information including the location from where the user logged
in, accessed a page, clicked on an ad, and the device’s specifications[27]. A Boeing 787 Dreamliner
airplane can generate as much as 500 gigabytes of flight data by collecting location-tagged infor-
mation from engines, sensors, fuel tanks, crew and passenger activities, and weather sensors[28].
Some of the data is analyzed on board during the flight to aid the pilot and crew while the rest is
analyzed later to improve future flight experiences. Google houses one of the world’s largest data
centers, processes over 40,000 queries per second, and can process over 20 terabytes of raw web
data[29]. Many of Google’s services are free, but the data collected from the users’ interactions
with these services is analyzed to produce targeted ads and develop and improve company services.
1
Many other examples exist that show how and why companies collect, store, and analyze data
including spatial data. However, the ultimate goal is the same: unlock hidden values in these
datasets to improve and invent. Data analysis must be done quickly and sometimes it is needed in
real-time. Analyzing spatial data differs from other types due to a number of challenges. Mainly,
the analysis must take into consideration the spatial attributes of the data, different shapes and
sizes (i.e. Point, Polygon, LineString), and the different operations (i.e. Cluster, Distance, Join,
kNN).
To that effect, specialized systems where invented to offer support to spatial data. For a long
time, relational database systems (RDBMS) were a viable and attractive option for spatial data
management. RDBMS like Oracle[17], PostGIS[18], and SQL Server[24] offer support to spatial
data objects and operations. However, their capabilities are limited by the size and shape of the
dataset. More recently, parallel processing on commodity hardware gained popularity for being
inexpensive, easy to use, easy to maintain, and highly scalable.
Two of the most popular processing frameworks are Apache Hadoop[36] Spark[63]. They differ
from each other in design and programming model, but they are both designed to handle generic
datasets. While they can be used to process spatial datasets, results are achieved with considerable
time and resource cost. Mainly, this is due to the lack of recognition and treatment of spatial
data. To that effect, spatial frameworks were designed to work with Hadoop and Spark and make
them recognize the different shapes and operations of spatial data. For Apache Hadoop, Esri GIS
tools for Hadoop[9], Hadoop-GIS[30], and SpatialHadoop[37] utilize one or more layers from the
Hadoop ecosystem to add spatial support. For Apache Spark, SpatialSpark[60], GeoSpark[61],
LocationSpark[54], STARK[44], and Simba[58]) utilize Spark’s ecosystem to add spatial support.
All of these frameworks are similar in the sense that they allow for performing spatial operations
against spatial objects. However, a number of drawbacks exist and as of the time of this writing, no
single framework offers support to all spatial objects and operations. In section 2 we discuss some
of the challenges researchers face when designing Hadoop and Spark spatial frameworks. In section
3 we survey a number of Hadoop-based (section 3.1) and Spark-based (section 3.2) frameworks.
2
2 Challenges with spatial data analysis
Analyzing spatial data differs from other types due to a number of factors like shapes, sizes, and
operations. Extracting the spatial attribute from a dataset requires knowledge of the underlying
structure. Information about where the data is stored, how it is stored, and the frequency of
updates can affect the analysis process. Spatial objects can take on multiple different shapes and
sometimes the dataset can be a hybrid of these shapes.
Any spatial data analysis system must be able to ultimately produce meaningful and accurate
results in a timely manner. The scalability of the system is affected by decisions like what data
should be kept in memory and in what shape should non-spatial data make it to the output. The
system should also be optimized for streaming and/or batch processing. Usability of the system is
also important as it will determine whether experts and/or non-experts can use it.
2.1 The portion that is spatial data
Spatial data is often part of a bigger picture. Depending on the analysis being done, the spatial
attribute may be relevant at first, towards the end, or in between. For example, Twitter[25]
data (called tweets) have been used for sentiment analysis (opinion mining) to predict election
results[46, 56], foresee future stock prices[35], or collect product reviews[45]. Such an analysis can
skip the spatial attribute of these tweets and produce meaningful results. However, taking into
consideration the geographic locality of a tweet can significantly improve the results. It is more
meaningful to consider tweets of users from certain cities (i.e. New York, London) who are reviewing
products that are only available in those cities.
Moreover, after the spatial analysis step is completed, the data can undergo additional non-spatial
analysis. This naturally requires that the spatial processing system preserve any non-spatial data
that was originally present. Doing so will have an impact on the performance and frameworks
like Hadoop-GIS and SpatialSpark do not allow for non-spatial data to be carried through the
computation steps.
3
(a) Point (b) LineSegment (c) MultiLineSegment (d) LineString
(e) MultiLineString (f) Polygon (g) Polygon (h) MultiPolygon
Figure 1: Selected Spatial Data Geometric Shapes
2.2 Different shapes and sizes
One of the major challenges with spatial data is that it can take different shapes, sizes, and
dimensions. Moreover, they can be formed from coordinate systems like Global Positioning System
(GPS) coordinates or planer coordinates. The Open Geospatial Consortium[6] (OGC) aims at
creating an open standard for geospatial content and services and offers certification of systems.
By doing so, a system’s interoperability is increased, vendor’s confidence is improved, and users
are assured that multiple systems can work together. Figure 1 depicts some of the most common
forms of spatial data that a system should support.
Point: A Point object (Figure 1a) is the simplest spatial object and is the basic building object
of other spatial objects. A Point consists of two coordinates like longitude and latitude or (x, y).
A single location on a map like a restaurant or a subway station can be represented as a Point.
LineSegment: A LineSegment object (Figure 1b) consists of two points with the first marking
the beginning of a segment and the last marking its end. If a spatial object consists of two or more
unconnected LineSegments, a MultiLineSegment (Figure 1c) object is used to represent its shape.
A straight road can be represented as a LineSegment object.
4
LineString: A LineString object (Figure 1d) consists of two or more connected LineSegments
that do not form a closed shape. The endpoint of the first LineSegment marks the beginning of the
second one. If a spatial object consists of two or more unconnected LineStrings, a MultiLineString
(Figure 1e) object is used to represent its shape. A city road that consists of multiple segments is
represented as a LineString object.
Polygon: A Polygon object (Figure 1f) consists of multiple LineStrings that must form a closed
shape. A Polygon can also contain another Polygon (Figure 1g). If a Polygon consists of two or
more Polygons (Figure 1h) a MultiPolygon object is used to represent its coordinates. Countries,
states, city blocks, and water ponds can all be represented using a Polygon object.
These shapes can be further used to form even more complex objects like MultiPoint, Circle, and
Curve. Moreover, the dataset can be heterogeneously composed of different shapes which further
complicates the analysis process. A good system must recognize spatial shapes for what they are in
order to produce fast and meaningful results. A common approach to overcome the lack of support
for other objects is to compute the object’s envelope (or Minimum Bounding Regions (MBR)) –
the smallest rectangle that fully encompasses the object.
2.3 Data at rest
Data can be stored (disk, tape) in many different shapes and formats. A processing system must
account for these variations without limiting its capabilities. This goes beyond whether the data
is encrypted and/or compressed. A spatial processing system is no different and should account
for datasets that are of single or multiple types, uniformly distributed or not . . . Determining
the type(s) of spatial data can become expensive; the system can either make certain assumptions
about the data and automatically load it, or rely on the user to preload their data. This technique is
implemented in [61, 44, 58]; they allow users to manually parse and load the dataset or request that
the framework automatically read and parse the datasets if it is in a specific format like WKT12,
CSV11, GeoJSON13 . . . However, this feature is restricted to supported formats and does not allow
for non-spatial data. In general, there are three different classifications of data (including spatial):
5
Structured: This type of data is well formed with a clear data model. The fields’ types are known
and decide how they are stored (integral, currency, point, polygon . . . ). It has the advantage
of being easily generated, stored, queried, and analyzed. Structured data is ideal for use in a
traditional RDBMS like MySQL, Oracle, and Microsoft SQL Server. However, even with a clear
structure, these RBDMSs are limited by the amount of data they can store and process in a timely
manner. In essence, one can only scale up/out so much before query times become noticeably
lagging[43]. Other systems like NoSQL can also be used to process large datasets although they are
not yet mature enough mainly because they try to merge RDBMS and distributed storage features
into one system.
Unstructured: Unlike structured data, unstructured data is characterized by not having a clear
data model. Their field types are hard to discern and sometimes impossible to assign them a
type. Most of the data that is being generated nowadays is unstructured. Documents like photos,
videos, web server logs, word documents, spreadsheets, and PDFs are some of the examples of
unstructured data. Storing them in a traditional RDBMS is near impossible or impractical at
best. As a result, they are mostly stored as either text or binary files that should be indexed and
analyzed. Often, the analysis occurs multiple times with the original files kept intact. Systems like
Hadoop[4], MapReduce[36], Spark[3], and Impala[12] were all designed to help with the management
of unstructured data. Their techniques differ from simple distributed storage, key-value storage,
document storage, or wide-column storage[40].
Semi-structured: A hybrid form of the two preceding types is a semi-structured dataset. It is
not as well formed as structured data but a partial data model can be discerned. Depending on
the data itself, semi-structured data can be managed using structured or unstructured techniques.
Although, traditional RDBMS tend to lag when the size of the dataset exceeds a certain threshold.
As a result, distributed systems like the ones mentioned for unstructured data are better suited.
Examples of semi-structured data include E-mails, Metadata of documents (word, spreadsheet files,
PDFs . . . ), and media-file properties (time, location, size . . . ).
6
2.4 Operations
Analyzing spatial data means performing different operations that depend on the spatial objects
and desired results. Implementing these operations must take into consideration the object’s type,
the mixture of the objects, and how to carry non-spatial information through the computation
steps (if any).
Currently, no system can implement all the different possible combinations. Instead, some systems
like [61, 54, 44, 58] will design their code in such a way that it can be extended to add additional
support. Unsupported objects can be converted to generic Rectangle objects by computing their
MBRs as done in [9, 30, 37, 60]. The more operations that a system can support, the more at-
tractive it becomes; therefore, third-party libraries like Java Topology Suite (JTS)[15] or Geometry
Engine - Open Source (GEOS)[8] are utilized to improve features. It is crucial to find a lowest
common denominator such that more operations are supported with minimal code without affect-
ing performance or increasing a system’s complexity. Some of the most common spatial operations
are:
Range: In a range operation, the input is two sets of spatial objects S and R. The output is a
set of all records in S that overlap R. Systems like [58] will implement this operation with a single
dataset and a range query, while others will allow for two large datasets [61, 54, 44] or one small
and one large [30, 60].
Contains: In a contains operation, the input is two spatial objects O1 and O2. The output is true
if O2 contains O1, or false otherwise. Systems like [61, 60, 44] implement this operations for objects
like point and polygon while [58, 30, 9] do not offer this operation.
k Nearest Neighbor (kNN): kNN queries can take on different shapes like distance kNN and
kNN join. The simplest form of a kNN operation consists of an input set of Point objects, P , a set
of spatial objects S, and an integer value k. The output consists of all of the elements of P along
with the nearest k objects from the set S for each element in P . In [58] kNN join works for when
one of the datasets is a set of point objects. Others like [61, 54, 44] allow for polygons and generic
rectangle objects but only as a distance kNN join.
7
Join: In a join operation, the input is two sets of spatial objects S and R and a spatial predicate.
The output is a pair of elements (s, r) such that s ∈ S, r ∈ R and the predicate is true. The
predicate can be one of many like equals, larger than, within distance, or overlap. Due to the large
range of predicates, only a small number is supported. In [60] a spatial join is implemented for
intersect and within; in [61] only contains and intersects are supported. A better approach is taken
in [44] where the predicate is a user-submitted function with some sample functions (i.e. distance)
already implemented.
2.5 Usability and Integration
From a researchers point of view, making a spatial system easy to use and integrate is not very
interesting. Once the processing system’s features are implemented and are stable, ease-of-use may
be taken into consideration but only to allow end-users from a non-computer science background
to interact with the system.
Since currently no system can cover all spatial objects, researchers aim at creating an easy-to-
extend system. This allows others to write code to extend the functionality of the system to include
new objects and operations. Some of the techniques used are supporting standards like OGC[6]
as partially done in [9, 59, 37], Structured Query Language (SQL) as done in [58], extending
existing high-level languages like Hive1[55, 1] as done in [9, 30], Pig-Latin2[50, 2] as done in [37],
or integrating with APIs like Spark’s RDD3[62, 19] as done in [61, 54, 44, 54], DataFrame4[31, 21],
and/or Spark SQL module5[31, 21] as done in [58].
A closely related topic to usability is integration. A processing system should allow for easy
integration of input and output with other tasks. Spatial data is not always the first or final step.
Often a spatial operation will start after some other operation and/or end before another task
starts. For example, consider the problem of finding all restaurants along a driving route. While a
spatial query is needed to find the restaurants (Points) around the different streets (LineSegment),
1Hive is a SQL-like language for working with data stored on HDFS2Pig-Latin is an abstraction layer intended to simplify MapReduce programming using easy to understand idioms3Resilient Distributed Dataset (RDD) are read-only in-memory distributed data structures4A DataFrame is a dataset that resembles a table in a RDBMs5Spark SQL is a Spark module for working with structured data through a SQL-like manner
8
the search results may be further refined by looking at attributes like the type of food served, drive-
thru vs. sit Down . . . Most systems surveyed here allow for some form of integration by either
writing their output to HDFS for other tasks to use [9, 30, 37] or through RDD transformations as
done in [61, 54, 44]
2.6 Indexing
Spatial indexing is a step in spatial data processing that is usually preferred to speed up operations.
In most cases, spatial data is not pre-indexed, therefore, a spatial processing system can offer an
implementation of one or more indexing techniques. These indexes can be built on the fly (live)
and perhaps written to disk (persisted) in order to be used in subsequent processing tasks.
In a distributed spatial system, there are usually two levels of spatial indexing, global and local.
Global indexing is used to perform an initial grouping of data based on their spatial relationship;
this step is also referred to as partitioning because it partitions the data across the processing
nodes. Depending on the processing system, the global index can be kept on the master node or
broadcasted to all processing nodes []. The global index can be built by either sampling the dataset
to avoid processing large amounts of data, or, if feasible, through a full scan of the dataset(s). All
of the frameworks that we have surveyed employ sampling techniques when partitioning the data.
The global index can, also, be used to exclude data that does not contribute to the output thus
improving the overall performance. Local indexing, on the other hand, is utilized on the processing
nodes after the data has been partitioned. This speeds the process since only a subset of the data
on the local machine are considered. The results that were obtained through the local index can
be further refined by calculating the exact relationship between objects.
Many indexing techniques exist and can even be used in conjunction with one another. Which
technique to use depends on factors like on-disk or in-memory, type of dataset, and/or speed of
construction/retrieval. Some of the most popular spatial indexing structures are:
Grid: A grid[52] spatial index is an indexing structure for organizing spatial objects. A number
of different types of grids exist, but in its simplest form, a 2-D rectangle is divided into a number
of contiguous and equal cells. Each cell is assigned a unique ID and can have a similar or different
9
size from the others. Spatial objects are assigned to one or more cell in the grid depending on their
shape. Grid indexing is useful when the data is uniformly distributed.
R-Tree: An R-Tree[42] is a height-balanced tree used for indexing multidimensional objects like
Points, Rectangles, and Polygons. Objects are inserted into the tree with their MBRs and Nearby
MBRs are grouped together in order to expedite searching. R-Trees are popular in spatial data
processing but produce approximate results and usually require a finer comparison step.
R*Tree: An R*-Tree[33] is a variation of an R-Tree that tries to minimize the MBR coverage and
overlap. As a result, its querying times are faster but update times are slower. R*-Tree is better
suited when the tree is queried more often than it is modified.
Binary Space Partitioning: Binary Space Partitioning (BSP) is a method of continuously di-
viding a plane into two or more halves. A record of how space is being divided is kept in a tree
data structure to represent the BSP. Mainly it was invented for computer graphic rendering, but it
can also be used for indexing spatial objects and is the basis of structures like Quadtree and K-D
Tree.
Quadtree: A Quadtree[39] is a variation of a BSP where each node has zero (leaf) or exactly four
nodes. A 2-D space is constantly divided into four regions such that the regions satisfy a certain
condition (i.e. divide until each region has 0 or 1 points).
K-D Tree: A K-D Tree[34] is a variation of BSP and a generalization of a binary tree. Each
node has zero (leaf) or exactly two nodes. A 2-D space is constantly divided into two such that
the regions satisfy a certain condition (i.e. divide until each region has 1 point). Different from a
Quadtree, a K-D Tree splits the node into two according to some mathematical equation like the
mean of the node’s data.
Space-Filling Curves: Space-Filling Curves (SFC) is a technique to use a line in order to fill
a 2-D space. One of the most popular SFC is the Hilbert Curve with the indexes indicating an
initial clustering of spatial objects. The precision of the curve can be increased with the number
of iterations, n; however, this may decrease the performance of the index.
10
2.7 Temporal Aspect
Many of the spatial data collected have a time component which indicates the time that data was
collected (timestamp). In some analysis, taking the timestamp into consideration produces more
meaningful results because it focuses on, for example, more recent recordings.
Adding temporal support to spatial data querying comes with a set of unique challenges. For ex-
ample, each spatial object must be able to hold information about its own timestamp. Partitioning
the data will also need to take into account the time factor and instead of only using a spatial index
like an R-Tree, a time index like an interval tree must be added. By doing so, the data processing
workflow will change since, for example, a specific object must be duplicated if it spans multiple
time segments. Performing the spatial query is also affected since the query should account for
the specified time and include or exclude certain objects. Join, kNN, filter queries, will become
spatio-temporal queries where the predicates must examine the time factor. Out of all the systems
mentioned in this survey, only the Spark based systems in [44, 58] have taken the temporal aspect
into consideration but in a very limited scope.
2.8 Interactive vs Batch
There are two types of spatial search that a system can offer, interactive and batch. An interactive
search is a live search such that the search query is executed on demand. All systems surveyed
here are designed for batch processing but some can be used for interactive searches [58, 61, 37].
An example of an interactive search is a person looking for shops in a specific neighborhood. The
dataset containing all the shops in all areas is preprocessed and put on standby (Ram or disk).
When the search query arrives, it is executed against the existing dataset and results are filtered
accordingly. This type of search is ideal for one large dataset and one relatively small. Frameworks
like [61, 37] have described a graphical interactive interface that they developed to demonstrate
their techniques.
Batch processing involves multiple objects (one or more types) across two datasets. Objects from
both sets are examined in order to determine their relationship. An example of this would be
11
finding out which of a given dataset of tweets originated near a body of water (lake, pond, rivers
. . . ). This type of search usually calls for the preprocessing and indexing of both datasets in order
to perform the spatial query.
Each one of these search types requires its own optimization. The size of the dataset(s) in question
is key in both cases as the result might call for writing a portion of the dataset(s). The system must
also be smart about the amount of memory available and how much of it is used for computations
and for object caching. If the system is distributed, then distributed memory may call for some
data shuffling between the physical machines. This requires high-speed data connections which
may translate into time and monetary costs.
2.9 Scalability
A system’s scalability is measured by its ability to handle increasing amount of work without failure.
Additionally, the system must be able to make use of any added resources it gets and put them to
optimal use.
A modern-day analysis system must be scalable and handle terabytes or petabytes of data. Since
RAM is usually not large enough to hold all of the data in memory physical disks are utilized.
Distributed storage systems like HDFS were invented to store files as large as the physical stor-
age units. Frameworks like MapReduce and Spark provide a common framework for developing
distributed programs without the users worrying about the complex operations or low-level data
communications and concurrency controls.
2.10 Reliability
Hardware reliability has come a long way since the early days of computing. Although possible,
hardware failures are rare and their impact is further reduced by techniques like uninterrupted
power supplies, load-balancing, disk redundancy, server cloning . . .
While a reliable analytic system requires reliable hardware, it should be able to recover from
hardware and software faults without losing previous computations. An error or a crash should not
require the complete restart of the job. Some of the techniques used include writing intermediate
12
results to permanent storage media or taking periodic backups. These techniques can be used to
automatically restart the task from the point of failure. Spark provides fault-tolerance through
its RDD technology which internally builds a lineage graph6. Spark-based systems automatically
inherit and leverage this technology for reliability.
3 Spatial Data Analysis Frameworks
Finding a single machine that is able to process today’s large datasets is challenging; finding one
to process large dataset in a timely manner is near impossible. This is because the processing
performance is directly proportional to the available CPU, memory, and network resources. A
single machine can only be scaled up so much before scaling out becomes necessary. Distributed
computing systems were developed in order to speedup the computation process across independent
machines. The idea is simple; instead of one machine processing the data in sequence, multiple
machines work in parallel with each processing a small portion of the dataset.
Apache Hadoop and Apache Spark are two of the most widely used distributed computing frame-
works. Both rely on Apache’s Hadoop Distributed File System[4] (HDFS) and offer operations that
abstract the complex and error-prone procedures of low-level data communications and concur-
rency controls. They are data-neutral and allow users to write customized tasks that automatically
distribute the workload across multiple machines. These machines (called processing nodes) are
managed by one master node which decides how the workload is distributed and keeps track of the
nodes’ progress.
Both, Hadoop and Spark, are suitable for most types of datasets since they allow users to write
custom code for their datasets and operations. However, this presents a problem for spatial datasets
since the processing framework needs to recognize the spatial object’s shape in order to process it.
To that effect, spatial frameworks were developed to utilize Hadoop or Spark to allow for efficient
spatial processing. From an end user’s viewpoint, all frameworks perform the same task; they
take in as input one or more datasets, perform a specific spatial operation, and finally produce
6A lineage graph is a Directed Acyclic Graph (DAG) that shows the different phases of RDD transformations fromthe start RDD to the end RDD
13
the results. Users will differentiate the frameworks by speed, accuracy, and supported objects and
operations.
The differences between the spatial frameworks are due to a number of reasons. Each framework
implements its own techniques which may be an improvement of another framework or simply
introduce new ones. Some integrate better with the underlying structures by working directly
with the framework’s core API; others will simply build on top of the framework and avoid the
core. Support of objects and operations is also subjective and may be due to limitations with
the underlying algorithm being implemented or simply because the researchers wanted to target a
specific problem.
3.1 Hadoop-Based Frameworks
In 2003, Google published a paper that details a new proprietary distributed file system called
Google File System (GFS)[41]. It had several advantages, but mainly it was able to store and
retrieve truly large files quickly and safely. Files in GFS are split into segments of 64 megabytes
and replicated across different servers. By doing so, it eliminated single-point-failures and provided
high availability and scalability. Additionally, GFS does not require specialized hardware and can
run on inexpensive commodities hardware which makes it extremely attractive. In 2006, Yahoo
engineers were able to implement their own version of GFS called Hadoop Distributed File System
(HDFS)[53]. The project was then donated to the Apache Software Foundation who is now in
charge of maintaining it[4].
A viable large-file storage solution is incomplete without an effective way to process files. For
GFS, Google developed a data processing model called MapReduce[36]. Apache followed in their
footsteps and created their own, but similar, version of MapReduce7. The idea of MapReduce is to
take the program to the data instead of the traditional way of bringing the data to the program.
Such a model sparked the development of many fast and parallel data processing techniques.
Ever since its release, Hadoop has proven to be an excellent system for processing big datasets
regardless of the dataset’s type. The range of applications that utilize Hadoop are many and include
7HDFS and MapReduce are usually referred to as just Hadoop
14
machine learning[47], sorting of terabyte datasets[47], stock market data analysis and prediction[32],
and big data analysis [48]. Naturally, spatial data was no exception, and Hadoop can be used to
process them, However, Hadoop does not recognize spatial data; therefore the time it takes it to
process spatial data is slower than it should. A better approach is to read spatial data from HDFS
and transform them into runtime spatial objects. Afterward, specialized spatial query engines
execute parallel techniques against these object to produce the desired results. Frameworks like
Esri GIS Tools for Hadoop[9], Hadoop-GIS[30], and SpatialHadoop[37] do just that and empower
Hadoop to become spatially aware thus improving results and runtimes.
3.1.1 Esri GIS Tools for Hadoop
Esri GIS tools for Hadoop is a set of tools published by Environmental Systems Research Institute
(Esri)8. One of their most popular products is a software called ArcGIS[5] which is used for working
with and creating geographical maps. In order to harness the power of Hadoop, Esri released a set
of tools for performing spatial operations on Hadoop and import the results into ArcGIS. They are
designed to provide spatial functionality that is OGC compliant similar to those found in geospatial
database systems like PostGIS and Oracle Spatial.
The Esri GIS Tools framework consists of three layers(Figure 2). The Esri Geometry API for
Java layer allows MapReduce jobs to become spatially aware through defining geometry objects
(i.e. Point, Polygon), spatial operations (i.e. intersect, join), and spatial indexing (i.e. QuadTree,
HashTable). The Spatial Framework for Hadoop layer consists of a set of Hive User Defined Func-
tions (UDF) that enable users to write spatial queries in HQL9. The Geoprocessing Tools for Hadoop
layer offers a set of tools for data connectivity between Hadoop and ArcGIS, submit workflow jobs,
and convert data to and from JSON10. Unlike the previous two layers, the Geoprocessing Tools for
Hadoop is implemented in Python rather than Java.
A job in Esri GIS Tools for Hadoop consists of writing SQL-Like queries using HQL. Quires are
then translated into spatially-aware MapReduce tasks that extract relevant data. For example,
8Esri is a software company specializing in Geographic Information System software and services. https://www.
esri.com9Hive Query Language (HQL) is a SQL-like language for Hadoop10JSON: JavaScritp Object Notation https://www.json.org
15
�������� ������������������ ����� �������������������� ������������������������������������� �������������� ����������������������
��������������������������������������������� ������������������������������������������� ����� �������������������� ������������������������������
Figure 2: Architecture of Esri GIS Tools for Hadoop[9]
in the case of kNN query involving Points and Polygons datasets, a single Map and Reduce jobs
locally index the entire Polygon dataset in the memory of the processing nodes and points are then
sequentially checked to determine which Polygon they fall within. The reducer can then perform
a job like aggregating the number of points within each Polygon. The reducer causes lots of data
shuffling to occur as Points get routed to the proper processing node. As with any Hadoop task,
the results are finally written back to HDFS. The format of the output is in JSON which makes it
easy for ArcGIS to import and process the results.
The Esri GIS Tools for Hadoop can be imported as a library and included in a user’s MapReduce
task. However, they were developed to extend the capabilities of ArcGIS. Some of the tasks
implemented may or may not employ a global index. For example, the kNN query involving Points
and Polygons, a global grid is not utilized. In an aggregate hotspot query, a grid global index is
used which increases the number of map tasks. Esri GIS Tools for Hadoop is intended for geometry
filtering and therefore is unable to support very large datasets. Any dataset that nears a terabyte
in size cannot be processed.
16
at massive scale, although parallel RDBMS architectures [28] canbe used to achieve scalability. Parallel SDBMSs tend to reducethe I/O bottleneck through partitioning of data on multiple paral-lel disks and are not optimized for computationally intensive op-erations such as geometric computations. Furthermore, parallelSDBMS architecture often lacks effective spatial partitioning mech-anism to balance data and task loads across database partitions, anddoes not inherently support a way to handle boundary crossing ob-jects. The high data loading overhead is another major bottleneckfor SDBMS based solutions [29]. Our experiments show that load-ing the results from a single whole slide image into a SDBMS cantake a few minutes to dozens of minutes. Scaling out spatial queriesthrough a parallel database infrastructure is studied in our previouswork [34, 35], but the approach is highly expensive and requiressophisticated tuning for optimal performance.
2.3 Overview of MethodsThe main goal of Hadoop-GIS is to develop a highly scalable,
cost-effective, efficient and expressive integrated spatial query pro-cessing system for data- and compute-intensive spatial applications,that can take advantage of MapReduce running on commodity clus-ters. To realize such system, it is essential to identify time consum-ing spatial query components, break them down into small tasks,and process these tasks in parallel. An intuitive approach is to spa-tially partition the data into buckets (or tiles), and process thesebuckets in parallel. Thus, generated tiles will become the unit forquery processing. The query processing problem then becomes theproblem on designing querying methods that can run on these tilesindependently, while preserving the correct query semantics. InMapReduce environment, we propose the following steps on run-ning a typical spatial query, as shown in Algorithm 1.
In step A, we perform effective space partitioning to generatetiles. In step B, spatial objects are assigned tile UIDs, mergedand stored into HDFS. Step C is for pre-processing queries, whichcould be queries that perform global index based filtering, queriesthat do not need to run in tile based query processing framework.Step D performs tile based spatial query processing independently,which are parallelized through MapReduce. Step E provides han-dling of boundary objects (if needed), which can run as anotherMapReduce job. Step F does post-query processing, for example,joining spatial query results with feature tables, which could be an-other MapReduce job. Step G does data aggregation of final results,and final results are output into HDFS. Next we briefly discussthe architectural components of Hadoop-GIS (HiveSP ) as shown inFigure 1, including data partitioning, data storage, query languageand query translation, and query engine. The query engine consistsof index building, query processing and boundary handling on topof Hadoop.
2.4 Data PartitioningSpatial data partitioning is an essential initial step to define, gen-
erate and represent partitioned data. There are two major consid-erations for spatial data partitioning. The first consideration is toavoid high density partitioned tiles. This is mainly due to poten-tial high data skew in the spatial dataset, which could cause loadimbalance among workers in a cluster environment. Another con-sideration is to handle boundary intersecting objects properly. AsMapReduce provides its own job scheduling for balancing tasks,the load imbalance problem can be partially alleviated at the taskscheduling level. Therefore, for spatial data partitioning, we mainlyfocus on breaking high density tiles into smaller ones, and take arecursive partitioning approach. For boundary intersecting objects,we take the multiple assignment based approach in which objects
Algorithm 1: Typical workflow of spatial query processing onMapReduce
A. Data/space partitioning;B. Data storage of partitioned data on HDFS;C. Pre-query processing (optional);D. for tile in input collection do
Index building for objects in the tile;Tile based spatial querying processing;
E. Boundary object handling;F. Post-query processing (optional);G. Data aggregation;H. Result storage on HDFS;
Input Data Storage Querying System
RESQUESpatial Query
Processor
Spatial Index
Builder
QLSP Query Language
Spat
ial S
hape
sFe
atur
es
HadoopHDFS
Tile Spatial Indexes
Global Spatial Indexes
Boundary Handling
Web InterfaceCmd Line Interface
Dat
a Pa
rtitio
ning QLSP Parser/Query Translator/Query Optimizer
Query Translation
Query Engine
Figure 1: Architecture overview of Hadoop-GIS (HiveSP )
are replicated and assigned to each intersecting tile, followed by apost-processing step for remedying query results (section 5).
2.5 Real-time Spatial Query EngineA fundamental component we aim to provide is a standalone spa-
tial query engine with such requirements: i) is generic enough tosupport a variety of spatial queries and can be extended; ii) canbe easily parallelized on clusters with decoupled spatial query pro-cessing and (implicit) parallelization; and iii) can leverage existingindexing and querying methods. Porting a spatial database enginefor such purpose is not feasible, due to its tight integration withRDBMS engine and complexity on setup and optimization. Wedevelop a Real-time Spatial Query Engine (RESQUE) to supportspatial query processing, as shown in the architecture in Figure 1.RESQUE takes advantage of global tile indexes and local indexescreated on demand to support efficient spatial queries. Besides,RESQUE is fully optimized, supports data compression, and comeswith very low overhead on data loading. This makes RESQUEa highly efficient spatial query engine compared to a traditionalSDBMS engine. RESQUE is compiled as a shared library whichcan be easily deployed in a cluster environment. Hadoop-GIS takesadvantage of spatial access methods for query processing with twoapproaches. At the higher level, Hadoop-GIS creates global re-gion based spatial indexes of partitioned tiles for HDFS file splitfiltering. As a result, for many spatial queries such as containmentqueries, we can efficiently filter most irrelevant tiles through thisglobal region index. The global region index is small and can bestored in a binary format in HDFS and shared across cluster nodesthrough Hadoop distributed cache mechanism. At the tile level,RESQUE supports an indexing on demand approach by buildingtile based spatial indexes on the fly, mainly for query processing
1011
Figure 3: Architecture of Hadoop-GIS[30]
3.1.2 Hadoop-GIS
Hadoop-GIS is a framework for processing spatial datasets on Hadoop. It aims to create a fast
and scalable framework for processing spatial datasets in a warehousing system that is already
running Hadoop. Its architecture (Figure 3) consists of three major layers built on top of Hadoop
– Query Language, Query Translation, and Query Engine. The Query Language layer extends
the Hadoop Hive language to introduce support for spatial objects and operations. Users are able
to write spatial queries directly in Hive which simplified the MapReduce writing process. The
Query Translation layer optimizes the Hive code and translates it into proper MapReduce tasks
in order to perform the query. Finally, the Real-time Spatial Query Engine (RESQUE) performs
tasks like spatial indexing, query execution, and spatial boundary handling. The source code for
these layers[10] is a mix of code written in Java, C++, and Python and utilizes the open source
libraries LibSpatialIndex[14] and GEOS[8]. Through the use of these libraries, Hadoop-GIS can
reuse already existing code written in languages other than Java (C++ and Python) and allows
users to run programs written in these languages. Running Hadoop-GIS tasks is requires some
preliminary setup where all libraries have to be pre-installed and the proper environment variables
setup[11].
Hadoop-GIS works by applying a series of MapReduce jobs with each job starting by reading a
file from HDFS and ending by writing results to a new file to HDFS. This is necessary because
Hadoop-GIS streams its input data and relies on HDFS/MapReduce which must write intermediate
17
results to disk. While this feature achieves fault tolerance, it is I/O intensive. Moreover, streaming
is not efficient compared to direct HDFS read, but Hadoop-GIS requires it since it uses and allows
non-Java tasks.
Hadoop-GIS starts by scanning all records from both datasets and applying any filtering operations.
The filtered records from both datasets are then sampled and indexed based on a grid built from
the sampled data. The indexes of both datasets are used to build a global index which is then used
to partition both datasets. This step places objects into groups (called buckets or tiles[30]). The
spatial objects’ MBR are calculated and overlapping MBRs are placed in the same bucket. Each
bucket is assigned a unique ID for identification and the final results are written to disk. This
step relies on Hadoop streaming which is slower than direct HDFS read. Moreover, results through
sampling are useful if the data itself is uniformly distributed, which is hardly the case with spatial
data. The assignment of buckets relies on the objects’ MBRs which is not accurate and can produce
a large number of false-positives. Duplicates can also arise during this step; Hadoop-GIS remedies
this by sorting the final results and filtering out duplicates.
After each object is assigned to a gird ID, Hadoop-GIS shuffles the data such that objects with the
same ID are placed on the same partition. This step involves reading the files from the previous
step. Since the datasets are not uniformly distributed, Hadoop-GIS tries to lessen the effect of data
skew by splitting large partitions into two or more smaller sub-partitions. The overhead associated
with this step seriously degrades the performance as it requires reading and writing files from HDFS
as well as data shuffling.
Once the data is split into the proper partitions, Hadoop-GIS builds a local R-Tree index on each of
the partitions. This index is used to query one dataset against the other which speeds up the query
processing step which performed by the query engine (RESQUE). The engine utilizes the GEOS
library to compute the actual relationship between the objects (i.e. distance). In order to remove
duplicates, a sort process is performed before writing unique results one final time to HDFS.
18
k
Figure 2: SpatialHadoop architecture
tial operations and analysis techniques inside, providing a rich sys-tem to be widely used by developers, practitioners, and researchers.
We will demonstrate SpatialHadoop with its real system pro-totype running on an Amazon EC2 cluster against two setsof real spatial data obtained from Tiger Files [12] and Open-StreetMap [10]. Tiger files include 70 Million spatial objects (sizeof 60GB) of road segments, water features, and other geographicinformation in USA. OpenStreetMap includes map informationfrom the whole world including road segments, points of interest,and buildings boundaries with a total size of 300GB.
2. SpatialHadoop ARCHITECTUREFigure 2 depicts the system architecture of SpatialHadoop. A
SpatialHadoop cluster contains one master node that accepts a userquery, breaks it into smaller tasks, and carries out the tasks onmultiple slave nodes. There are three types of users who interactwith SpatialHadoop, casual users, developers and administrators.Casual users are non-technical users who access SpatialHadoopthrough the provided language to process their datasets. Devel-opers are more advanced users who have deeper understanding ofthe system and can implement new spatial operations, which couldbe specific to some applications. Administrators are able to tune upthe system through adjusting system parameters in the configura-tion files provided with SpatialHadoop installation.
SpatialHadoop adopts a layered design of four main layers,namely, language, storage, MapReduce, and operations layers.The language layer provides a simple high level SQL-like languagethat supports spatial data types and operations. The storage layeremploys a two-level index structure of global and local spatial in-dex structures. The global index partitions data across computationnodes while the local index organizes data inside each node. TheMapReduce layer has two new components, namely, SpatialFile-Splitter and SpatialRecordReader that exploits the global and localindexes, respectively, to prune data that do not contribute to thequery answer. The operations layer encapsulates the implementa-tion of various spatial operations that take advantage of the spatialindexes and the new components in the MapReduce layer. Spa-tialHadoop is initially equipped with an efficient implementationof three basic spatial operations, namely, range query, kNN, andspatial join. Other spatial operations can be added to the opera-tions layer using a similar approach of the implementation of basicspatial operations.
3. LANGUAGE LAYERSpatialHadoop provides a simple high level language that sim-
plifies the interaction with the system for non-technical users. Thislanguage provides a built-in support for spatial data types, spa-tial primitive functions, and spatial operations. Spatial data types(Point, Rectangle, and Polygon) are used to define theschema of an input file upon its loading process. The spatial prim-itive functions Distance, Overlaps, and MBR are applied tospatial attributes to calculate the distance between the centroid oftwo shapes, find whether two shapes overlap or not, and computethe minimal bounding rectangle of a polygon, respectively. Thespatial operations range query, k-nearest neighbor, and spatial joinare applied to input files with spatial attributes and produce the re-sults in another output file.
Rather than creating a new spatial language from scratch, Spa-tialHadoop extends Pig Latin [8], a high level language for Hadoopby adding new spatial constructs while preserving the originalfunctionality. In particular, SpatialHadoop language overrides thekeywords FILTER and JOIN, when their parameters have spa-tial predicate(s), to perform range query and spatial join, respec-tively. For example, when the FILTER keyword is used with theOverlaps predicate, SpatialHadoop reroutes its processing to therange query operation. For k nearest neighbor queries, a new key-word KNN is introduced. Following is an example that calculatesthe 100 nearest houses to the query point query loc.
houses = LOAD ’houses’ AS (id:int, loc:point);nearest_houses = KNN houses WITH_K=100
USING Distance(loc, query_loc);
4. STORAGE LAYERIn the storage layer, SpatialHadoop adds new spatial indexes that
are well adapted for the MapReduce environment. These indexesovercome a limitation in Hadoop, which supports only non-indexedheap files. There are two challenges that prevent traditional spa-tial indexes to be used as-is in Hadoop. First, traditional indexesare designed for the procedural programming paradigm while Spa-tialHadoop uses the MapReduce programming paradigm. Second,traditional indexes are designed for local file systems while Spatial-Hadoop uses the Hadoop Distributed File System (HDFS), whichis inherently limited as files can be written in an append only man-ner, and once written, they cannot be modified. To overcome thesechallenges, SpatialHadoop organizes its index in two levels, globaland local indexing. The global index partitions data across nodesin the cluster while the local index organizes data efficiently withineach node. The separation of global and local indexes lends itselfto the MapReduce programming paradigm where the global indexis used for preparing the MapReduce job while the local indexesare used for processing map tasks. Breaking the file into smallerpartitions allows indexing each partition separately in memory andwriting it to a file in a sequential manner.
The global index is kept in the main memory of the master nodewhile each local index is stored as one file block (typically 64MB)in a slave node. SpatialHadoop supports grid file [7], R-tree [4] andR+-tree [11] indexes. An index is constructed for an existing fileby issuing the new file system command writeSpatialFileintroduced in SpatialHadoop, where the user specifies the input file,column to index, and index type to construct.
An index is constructed in SpatialHadoop through a MapReducejob that runs in three phases, namely, partitioning, local index-ing, and global indexing. In the partitioning phase, a file is spa-tially partitioned such that each partition is contained in a rectan-gle while its contents fits in one file block (64MB). A grid index
1231
Figure 4: Architecture of SpatialHadoop[37]
3.1.3 SpatialHadoop
SpatialHadoop is a spatial data processing framework for Hadoop. It offers a tighter interaction
with Hadoop than Hadoop-GIS and Esri GIS Tools for Hadoop via the use of low-level Hadoop
APIs. Tasks in SpatialHadoop recognize spatial operations directly and passes operations to the
built-in query engine. It’s architecture (Figure 3) consists of four layers – language, storage, MapRe-
duce, and operations. The storage layer provides a mechanism to index input files and writes them
back to HDFS. This layer is I/O intensive but necessary in order to persist results. The MapRe-
duce layer extends Hadoop’s MapReduce by adding two new components (SpatialFileSplitter and
SpatialRecordReader) to allow for distributed spatial query processing. The operations layer intro-
duces a number of spatial operations (i.e. Range, kNN, Join) and a number of spatial objects (i.e.
Point, Rectangle, Polygon). This is the layer that executes steps for performing the specified query.
Finally, the language layer (called Pigeon) extends Hadoop’s Pig Latin language – A SQL-like
high-level language intended to simplify MapReduce programming in Hadoop. Pigeon introduces
new constructs through a set of user-defined functions that create the spatial types and operations.
The addition of the Pigeon language will require users to have a good understanding of Hadoop
and Pig Latin programming before learning the new constructs.
SpatialHadoop relies on configuration files and comes pre-configured to run with any spatial dataset
19
on all versions of Hadoop[13]. While common operations are supported, users may wish to change
the configuration files and fine-tune the framework depending on the task at hand. This, again,
requires good Hadoop experience and knowledge of spatial data programming. It is also tedious
if the configuration needs to change depending on the task at hand. For example, sample ratio
is controlled by the configuration spatialHadoop.storage.SampleRatio with default 0.01. R-Tree
indexing is controlled by the configuration spatialHadoop.storage.RTreeBuildMode which has two
options, fast which requires more memory but less time and light which uses less memory with
more time.
SpatialHadoop starts by building a partitioning scheme that takes into consideration the HDFS
block size (64MB), the proximity of spatial objects, and the number of objects in each partition.
This step will ensure that nearby spatial objects are assigned to the same partition. To avoid large
indexes, SpatialHadoop only uses a sample of both datasets. Results are written to HDFS until
they are read again for the next phase. After the data is partitioning, a local index is built for each
partition. Because of the previous step, the size of the local index will not exceed 64MB and hence
will be treated by HDFS as a single block when written to HDFS. If the size is less than 64MB,
SpatialHadoop will pad the block with 0s to fill the entire block. After the local indexes are built
and written to HDFS, the global index is built by merging all files into one single file using HDFS’s
concat command. The global index file is then loaded into the Master node’s main memory where
it will be utilized to index the spatial data blocks using their MBR. The partitioning scheme that is
followed here relies heavily on HDFS for data persistence. However, this degrades the performance
since disk IOs are expensive and if the input files change, the indexing will no longer be valid.
After the data is correctly partitions, SpatialHadoop follows a similar approach to that of Hadoop-
GIS. A local index is built on each partition and then queried in order to discern an initial rela-
tionship between the objects. Finally, the spatial library JTS[15] is used to compute the actual
relationship between the objects. Duplicate may arise due to objects overlapping multiple grid
cells. To remedy this, SpatialHadoop runs a duplicate avoidance technique which requires the com-
putation of the intersection between the resulting record and the query area. Records are added to
the final result only if the top-left corner of the intersection is inside the partition boundaries.
20
3.1.4 Features and Performance Summary
The major aim of the previously mentioned frameworks is to provide spatial support on Hadoop.
Their approaches differ in ways like the objects and operations they support, techniques they use,
the underlying languages, required expertise level . . . . Table 1 shows a high-level summary of these
features; however, all of these frameworks suffer from the same drawback of relying on HDFS for
fault tolerance.
A number of experiments were done to compare these in terms of speed and scalability. In [60], a
number of spatial datasets were used to gauge the performance of Hadoop-GIS and SpatialHadoop
with a maximum sized dataset of 6.9GB. Hadoop-GIS was not able to process this dataset, but
SpatialHadoop succeeded. In the same experiment, the authors reduced the size of the dataset to
1/12 the size in order to gauge Hadoop-GIS’s performance. In this test, SpatialHadoop proved that
it can outperform Hadoop-GIS. The authors concluded that the problem is due to Hadoop-GIS’s
intensive I/O, streaming approach, and use of the GEOS library.
A more detailed experiment was done in [58] which compared a number of non-Hadoop based frame-
works along with Hadoop-GIS and SpatialHadoop. The experiments showed that SpatialHadoop
is better compared to Hadoop-GIS. The first experiment focused on the index construction time
of the frameworks and showed that SpatialHadoop is faster than Hadoop-GIS using a dataset of
4.4 billion records. A second experiment compared the local index sizes and showed that Spatial-
Hadoop requires slightly less memory than Hadoop-GIS. However, Hadoop-GIS uses slightly less
memory for its global index. Another two experiments focused on throughput and latency when
performing Range and kNN queries. Both frameworks produced results that are close to one an-
other in the Range test, but Hadoop-GIS failed the kNN test. The final experiment tested the Join
operation, and the results showed SpatialHadoop to be the better framework with Hadoop-GIS
failing to complete the operation.
21
Feature Esri GIS Tools Hadoop-GIS SpatialHadoop
Release Date 2013 2013 2014Last Update 2018 2012 2018Integration Approach On Top of Hadoop On Top of Hadoop Into HadoopLanguage Integration HiveQL HiveQL Pigeon (Pig Latin)OGC Compliant Yes No YesGeometry Library Esri Geometry API
JavaGEOS JTS
Global Indexing Grid (Partial) Grid Grid, R-TreeLocal Indexing PMR Quadtree R-Tree R-Tree or R+-TreeIndex Persistence No Yes YesData Pruning No No YesCarry non-spatial data No No NoSkew Handling Level Partition Partition PartitionMixed object Query No No NoBase Code Java, Python (non-
spatial tools)Java, Python, andC++
Java
Allows user-defined opera-tions
No, (base-code mod-ifications required)
No, (base-code mod-ifications required)
No, (base-code mod-ifications required)
Installation None GEOS, libspatialin-dex
None
Configuration No System environmentvariables and Config-uration Files
Configuration Files
Required Expertise Level Regular Hive Regular Hadoop Advanced HadoopSpatial Objects Point, Polygon, Line,
Envelope (MBR)Point, Box, Polygon Point, Circle, Rect-
angle, LineStringSpatial Operations Range, kNN, Join Range, kNN, Join Range, kNN, Join
Table 1: Feature comparison of Hadoop-based frameworks
3.2 Spark-Based Frameworks
Apache Hadoop gained considerable attention from users and researchers and became one of the
most popular distributed processing frameworks for large datasets. However, in 2013 this attention
began to shift when Apache released the first version of Apache Spark (Spark). Spark is compatible
with Hadoop but, more importantly, it solves two major drawbacks in Hadoop; (1) the need for
intermediate data writes to HDFS between tasks to achieve fault tolerance and (2) in-memory data
processing which was limited in Hadoop.
At the core of Spark is a technology called Resilient Distributed Dataset (RDD)[62, 19]. RDDs
are read-only collections of data that are distributed across different computing nodes. RDDs live
in the memory of processing nodes in a parallel computing cluster. Each is processed indepen-
22
dently with the possibility of moving data into and out of the nodes. There are two groups of
operations on RDDs; transformations and actions. Transformations (i.e. map, filter, union) are
lazy operations which are not executed immediately; once executed they transform the RDD into
a new RDD. Actions (i.e. foreach, count, reduce), on the other hand, are operations that trigger
the transformations. Spark achieves fault-tolerance through building a lineage graph6.
Spark is written using the Scala[49, 20] functional programming language which is ideal for parallel
programming. This, along with the previously mentioned features, made Spark one of the most
popular big data processing frameworks. Naturally, spatial data processing is one of the areas that
become interested in Spark. Similar to Hadoop, Spark is a generic framework that leaves the specific
operations details to the user. It offers a safe and convenient way to parallelize programs across
processing nodes without having to worry about low-level communication, concurrency control, or
fault tolerance. Spatial operations like join, union, and even kNN can be performed on Spark as-is;
however, results are achieved at a considerable resource and time overheads. Therefore, specialized
frameworks were developed to run on top of Spark to make it spatially aware and ultimately achieve
quicker and more accurate results.
3.2.1 SpatialSpark
SpatialSpark[60, 59] is one of the earliest works on spatial data processing frameworks to take
advantage of Spark’s in-memory processing. It is written in Scala and released in 2015 with the aim
of providing spatial operations for running on Apache Spark by performing in-memory operations.
Its current code release[23] shows that it is able to perform spatial join queries on two datasets.
SpatialSpark has two modes of spatial join operations; broadcast and partitioned spatial join.
Broadcast spatial join is ideal for use with one small dataset (i.e. city or county boundaries) and
one large dataset (i.e, geo-tagged tweets). In a partitioned spatial join, two large datasets are
partitioned and individually processed across available computing nodes.
SpatialSpark starts by sampling one of its datasets and computes the MBR of each partition. The
MBRs are used to build a global spatial index which assigns each partition an ID and is then
broadcasted to all processing nodes. Once the index is broadcasted, partitions will query the index
23
for each spatial object in order to determine which partition the object should be sent to. The
global index can be written to HDFS in order to be used in subsequent tasks. SpatialSpark uses
the groupByKey transformation to group objects with the same partition ID on the same partition.
Then the join method is used to join both datasets together. Overall, this process has a number
of drawbacks. First, similar to the Hadoop frameworks, the sampling of the dataset is only useful
in the rare case of uniform data distribution. Second, the broadcast method adds a networking
overhead which may increase processing time if the sampled dataset is large. Third, it is memory
intensive especially because of the use of broadcast and groupByKey. These operations require
that the data be saved in memory and if the memory is not large enough to hold the indexes and
objects, SpatialSpark will fail.
After the datasets are partitioned, the local join process matches objects from both datasets.
Initially, this step relies on the objects’ MBRs but can utilize the JTS library to compute an
accurate relationship between the objects (i.e. Euclidean distance). Depending on the user’s
choice, SpatialSpark can build a local index before performing this computation. Overall, this
step is fairly quick especially when objects are matched by their MBRs. Moreover, due to the
sampling technique, some partitions might become overloaded more than others which will increase
the processing time.
3.2.2 GeoSpark
GeoSpark[61] is a cluster computing Spark framework written in Java for processing large spatial
datasets. GeoSpark’s architecture (Figure 5) consists of two layers built on top of Spark, Spatial
RDD (SRDD) and Spatial Query Processing. The Spark layer remains unmodified and no instal-
lation is required as tasks use GeoSpark by including it as a library. The SRDD layer extends
Spark’s RDD class to enable RDDs to support spatial objects (Point, Polygon, Circle, Line, Rect-
angle) and spatial operations (Join, kNN, Range). The spatial query processing layer carries the
task of performing spatial queries against data in SRDDs.
GeoSpark starts by creating SRDDs for the input datasets automatically or via custom procedures.
For automatic SRDD creation, GeoSpark parses and builds spatial objects from input files if they
24
Figure 1: GeoSpark Overview
also adaptively decides whether a spatial index needs to becreated locally on a Spatial RDD partition to strike a bal-ance between the run time performance and memory/cpuutilization in the cluster. Experiments show thatGeoSparkachieves better run time performance than its Hadoop-basedcounterparts (e.g., SpatialHadoop).The rest of this paper is organized as follows. Section 2
highlights the related work. GeoSpark architecture isgiven in Section 3. Preliminary experiments that evaluateGeoSpark are given in Section 4. Finally, Section 5 con-cludes the paper.
2. BACKGROUND AND RELATEDWORKSpatial Database Systems. Spatial database opera-
tions are vital for spatial analysis and spatial data mining.Spatial range queries inquire about certain spatial objectsexist in a certain area (e.g., Return all parks in Phoenix).Spatial join queries are queries that combine two datasetsor more with a spatial predicate, such as distance relations(e.g., find the parks that have rivers in Phoenix). Spatialk-Nearest Neighbors queries find the k nearest objects to agiven spatial object (e.g., show the 10 nearby restaurants).Spatial query processing algorithms usually make use of spa-tial indexes to reduce the query latency. For instance, R-Tree [3] provides an efficient data partitioning strategy toefficiently index spatial data. Its key idea is that groupnearby objects and put them in the next higher level nodeof the tree. Quad-Tree [8] is also a spatial index that recur-sively divides a two-dimensional space into four quadrants.Parallel and Distributed Spatial Data Processing.
As the development of distributed data processing system,more and more people in geospatial area direct their atten-tion to deal with massive geospatial data with distributedframeworks. Hadoop-GIS [1] utilizes global partition in-dexing and customizable on demand local spatial indexingto achieve efficient query processing. SpatialHadoop [2], acomprehensive extension to Hadoop, has native support forspatial data by modifying the underlying code of Hadoop.MD-HBase [6] extends HBase, a non-relational database
Figure 2: SRDD partitioning
runs on top of Hadoop, to support multidimensional indexeswhich allows for efficient retrieval of points using range andkNN queries. Parallel SECONDO [4] combines Hadoop withSECONDO, a database which can handle non-standard datatypes, like spatial data, usually not supported by standardsystems. Although these systems have well-developed func-tions, all of them are implemented on Hadoop framework.That means they cannot avoid the disadvantages of Hadoop,especially a large number of reads and writes on disks.
3. GEOSPARK ARCHITECTUREAs depicted in Figure 1, GeoSpark consists of three main
layers: (1) Apache Spark Layer: that consists of regularoperations that are natively supported by Apache Spark.These native functions are responsible for loading / savingdata from / to persistent storage (e.g., stored on local disk orHadoop file system HDFS). (2) Spatial Resilient DistributedDataset (SRDD) Layer (Section 3.1). (3) Spatial Query Pro-cessing Layer (Section 3.2).
3.1 Spatial RDD (SRDD) LayerThis layer extends Spark with spatial RDDs (SRDDs)
that efficiently partition SRDD data elements across ma-chines and introduces novel parallelized spatial transforma-tions and actions (for SRDD) that provide a more intuitiveinterface for users to write spatial data analytics programs.The SRDD layer consists of three new RDDs: PointRDD,RectangleRDD and PolygonRDD. One useful Geometricaloperations library is also provided for every spatial RDD.Spatial Objects Support. GeoSpark supports various
spatial data input format (e.g., Comma Separated Value,Tab Separated Value and Well-Known Text). Each typeof spatial objects is stored in a SRDD, PointRDD, Rect-angleRDD or PolygonRDD. GeoSpark provides a set ofgeometrical operations which is called Geometrical Opera-tions Library. This library natively supports geometricaloperations. For example, Overlap(): Finds all of the inter-nal objects which are intersected with others in geometry;MinimumBoundingRectangle(): Finds the minimum bound-ing rectangles for each object in a Spatial RDD or returna large minimum bounding rectangle which contains all ofthe internal objects in a Spatial RDD; Union(): Returns theunion polygon of all polygons in this RDD.SRDD Partitioning. GeoSpark automatically parti-
tions all loaded Spatial RDDs by creating one global gridfile for data partitioning. The main idea for assigning eachelement in a Spatial RDD to the same 2-Dimensional spatialgrid space is as follows: Firstly, split the spatial space into a
Figure 5: GeoSpark Architecture[61]
are in a recognizable format (i.e. CSV11, WKT12, GeoJSON13). Alternatively, the user can parse
their input data, build the spatial objects, and then construct the SRDDs. Automatic SRDD
creation might seem useful at first, but, in fact, it is much more restrictive. For example, a file
that is in CSV format must conform to a specific CSV style; namely, each row should be the
spatial object’s coordinates without any non-spatial data. In essence, this feature seems to enforce
structure on spatial data which is mostly not structured.
Once the SRDDs are built, GeoSpark partitions these SRDDs by building a global grid over the
entire dataset. The grid is built by sampling the datasets and computing the MBR for the entire
sample. Then the MBR is partitioning such that each box has a unique ID and contains about the
same number of spatial objects. Then, GeoSpark examines objects in both datasets, computes its
MBR, and assigns it to a specific box in the grid. If an object falls within multiple grid cells, a copy
of that object is made and assigned to the overlapping cells. This step is very computing intensive
requiring an initial pass over the first dataset in order to sample and build the global grid followed
by another pass to assign each object a grid box ID. In addition, this step will generate duplicate
objects to account for an object’s MBR spanning multiple grid cells. This increases the required
resources (computation, memory, shuffling) by the framework and calls for a filtering process before
the final results are produced.
11Comma Separated Value12Well-Known Text13A format for encoding a variety of geographic data structures http://geojson.org
25
After the objects in the SRDDs are assigned to their perspective grid cells, GeoSpark will examine
the objects within these SRDDs in order to decide whether an index is needed. This process is
carried out for each of SRDDs such that the index is only built if the cost of building the index
(scan time and memory) improves the overall query execution time. While this step is intended to
speedup the query, it may, overall, affect the performance of the framework. The decision to index
the SRDD requires a partial or full scan of the spatial objects in that SRDD. Due to the nature
of Spark, unless data is cached, it will need to be computed the next time it is needed. Therefore,
either the memory requirement or time complexity of the task must increase. It does not seem that
users of GeoSpark have control over this step other than to specify the type of index that should
be used when GeoSpark decides to build the index.
With spatial objects stored in their respective SRDDs (with or without the index), the spatial query
processing layer begins executing the required operation. GeoSpark will follow certain steps that
depend on the type of the query. For range queries, the query MBR is computed and then broad-
casted to all SRDDs to check their spatial objects against that MBR. For join queries, the SRDDs
are joined using their grid IDs. Afterward, spatial objects are compared using their own MBRs in
order to decide if they overlap. For kNN queries, the framework computes the distance between
the spatial objects and keeps the best k matches (uses heap-based top-k algorithm). Afterward,
different SRDDs from different nodes are grouped and the k overall results are kept. Naturally, the
memory and/or read operations requirements will vary depending on the query’s implementation.
Overall, GeoSpark seems memory intensive as it caches data that it will need in future steps.
As a final step and right before producing the results, GeoSpark filters out duplicates that were due
to the partitioning and query execution steps. This is a necessary step since any duplicate results
will affect the accuracy of the results. In order to perform this step properly, the framework incurs
additional computing overhead to group, sort, and filter the data. GeoSpark does not perform a
finer computation step to compute the actual relation between the objects. Instead, its results rely
on the object’s MBR and leaves any further refinements to the user.
26
LocationSpark: A Distributed In-Memory DataManagement System for Big Spatial Data∗
Mingjie Tang†, Yongyang Yu†, Qutaibah M. Malluhi‡, Mourad Ouzzani�, Walid G. Aref††Purdue University, ‡Qatar University, �Qatar Computing Research Institute, HBKU
{tang49, yu163, aref}@cs.purdue.edu, [email protected], [email protected]
ABSTRACTWe present LocationSpark, a spatial data processing systembuilt on top of Apache Spark, a widely used distributed dataprocessing system. LocationSpark offers a rich set of spa-tial query operators, e.g., range search, kNN, spatio-textualoperation, spatial-join, and kNN-join. To achieve high per-formance, LocationSpark employs various spatial indexes forin-memory data, and guarantees that immutable spatial in-dexes have low overhead with fault tolerance. In addition,we build two new layers over Spark, namely a query sched-uler and a query executor. The query scheduler is respon-sible for mitigating skew in spatial queries, while the queryexecutor selects the best plan based on the indexes and thenature of the spatial queries. Furthermore, to avoid un-necessary network communication overhead when process-ing overlapped spatial data, We embed an efficient spatialBloom filter into LocationSpark’s indexes. Finally, Loca-tionSpark tracks frequently accessed spatial data, and dy-namically flushes less frequently accessed data into disk. Weevaluate our system on real workloads and demonstrate thatit achieves an order of magnitude performance gain over abaseline framework.
Categories and Subject DescriptorsH.3.4 [Systems and Software]: Spatial data management
1. INTRODUCTIONSpatial computing [15] is becoming significantly impor-
tant with the proliferation of mobile devices. The growingscale and importance of location data have driven the de-velopment of numerous specialized spatial data processing
∗This work is supported by QNRF Grant No. 4-1534-1-247and National Science Foundation under Grant III-1117766.
This work is licensed under the Creative CommonsAttributionNonCommercial-NoDerivatives 4.0 International License.To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For any use beyond those covered by this license, obtainpermission by emailing [email protected] of the VLDB Endowment, Vol. 9, No. 13Copyright 2016 VLDB Endowment 2150-8097/16/09.
systems, e.g., SpatialHadoop [9], Hadoop-GIS [5] and MD-Hbase [14]. By taking advantage of the power and cost-effectiveness of MapReduce [8], these systems typically out-perform spatial extensions on top of relational database sys-tems by orders of magnitude [5].These MapReduce-based systems enable users to run spa-
tial queries using predefined high level spatial operatorswithout having to worry about fault tolerance and compu-tation distribution. However, these systems do not leveragethe power of distributed memory, and are unable to reuseintermediate data [17, 10]. Nonetheless, data reuse is verycommon in spatial data processing. For example, spatialdatasets, e.g., OpenStreetMap (>60G) and Point of Inter-est (POI, for short, >100G) [9], are usually large. It isunnecessary to read these datasets continuously from disk(e.g., using HDFS) for each query. Meanwhile, intermediatequery results have to be written back to HDFS, and thisdirectly impedes further data analysis.To tackle the above challenges, we introduce Location-
Spark, an efficient spatial data processing system built ontop of Apache Spark [17]. Spark is a distributed computa-tion framework that allows users to work on distributed in-memory data without worrying about data distribution andfault-tolerance. LocationSpark is built as a library on top ofSpark (see Figure 1). It provides spatial query APIs on topof the standard dataflow operators. LocationSpark requiresno modifications to Spark, revealing a general method tocombine spatial data processing within distributed dataflowframeworks.
Spatial AnalyticalSpatial Analytical WEBWEB
Query SchedulerQuery Scheduler
Apache Spark
Spatial OperatorsSpatial Operators
APIsAPIs
Spatial IndexSpatial Index
Range, kNN,Insert,Delete,UpdateSpatial-Join, kNN-Join, Spatio-Textual
Grid, R-tree, Quadtree, IR-tree,Spatial-Bloom Filter
Spatial Query Skew Handler
Clustering,Spatio-Textual Topic
Memory ManagementMemory Management Dynamic Memory Caching
LocationSpark(>5000 lines of code)
Query ExecutorQuery Executor Dynamic Spatial Query Execution
Figure 1: LocationSpark’s layered architecture on top ofSpark.
1565
Figure 6: LocationSpark Architecture[54]
3.2.3 LocationSpark
LocationSpark[54] is a Spark framework for processing large spatial datasets. Its architecture
(Figure 6) consists of two layers built on top of Spark – Query Scheduler and Query Executor.
Both layers are implemented in Scala with the major aim at solving the query skew problem. The
Spark layer remains unmodified and no installation is required as tasks use LocationSpark as a
library. The query scheduler layer distributes the data across the different computing nodes in
a balanced way. The query executor selects the best execution plan based on the type of query
required and the index that was used to index the data.
Data must be parsed and loaded into LocationSpark by creating RDDs of spatial objects that it
understands. Currently, LocationSpark can support Box and Point spatial objects with kNN join
query. The Box object can be used as a generic spatial object capable of representing any spatial
object after calculating that object’s MBR. This means that the results of any query are not based
on the object’s actual boundaries which produce incomplete results.
With data loaded into RDDs, LocationSpark proceeds to collect random statistical information
from each partition using the query type and data points. This information is used to build a
global index with equal sized points and identifies potential problematic spots (called hotspots)
in the data partitions. Based on these hotspots, a cost-based-model calculates the overhead of
repartitioning the hotspot data by reallocating underutilized nodes. The user can specify either
27
a grid or a region quadtree as the global index type. While fast, the process of building a global
index from random data samples suffers from two major drawbacks. First, it has the overhead of
having to pass through the dataset (or at least the sample records) or persist the data in RAM for
the subsequent pass. Second, random selection results are inherently nondeterministic with each
run producing different results.
LocationSpark allows the grid index to be written to disk in order to speedup future operations.
With the global index built, LocationSpark partitions the entire dataset equally between the avail-
able processing nodes. This step examines each object in the dataset to figure out which processing
node it should be redirected to. In order to perform the spatial join queries, LocationSpark dupli-
cates the outer table and sends it across to the processing nodes. It does this assuming that the
outer table is smaller and contains the query objects and the inner table is the queried dataset.
Because this is a memory and communication intensive step, LocationSpark embeds in the global
index a spatial bloom-filter (sFilter). The sFilter allows for testing if a point falls within a given
spatial range; if it falls outside the query boundaries, it is not duplicated.
Due to these optimizations, LocationSpark requires a large amount of memory to work and store
its spatial data and indexes. Therefore, and in order to reduce the memory requirements, Loca-
tionSpark monitors access frequencies (time and number of hits) for each of the spatial objects.
Objects with low frequencies are serialized from memory to disk. With this step, it is clear that
the framework tries to reduce its memory usage, however, it comes at a greater expense since it
increases the amount of disk IO which is slower than memory access.
As a final step and right before producing the results, LocationSpark filters out duplicates that
were generated due to the global partitioning step. This is a necessary step since any duplicates
will affect the accuracy of the results. Additionally, LocationSpark’s output is currently limited
to only counting the number of points that fall within a specific Box. By doing so it discards the
spatial objects that were loaded into its RDDs and used during the computation.
28
Scala API
Partitioner
RDD
Spark Core
Spatial Partitioner
DistanceFunctions
PredicatesSpatial RDDIndexes
Figure 1: Overview of STARK architecture and integration intoSpark.
store to HDFS
query execution
load from HDFS
spatialpartitioning
optionalindexing
raw data
Figure 2: Internal workflow for converting, partitioning, and query-ing spatio-temporal data
a spatio-temporal partitioner was applied on a data set, apartition contains all elements that are near to each otherin time and/or space and the bounds of a partition repre-sent a spatial region and/or temporal interval which coverall items of that partition. This bound is very useful to de-termine what partitions actually have to be processed for aquery. For example, an intersects query only has to checkthe items of partitions where the partition bounds them-selves intersect with the query object. Such a check candecrease the number of data items to process significantlyand thus, also reduce the processing time drastically.When the spatial and temporal objects of a data set are
not points or instants, respectively, these regions and in-tervals may span across multiple partitions. There are twooptions to handle such scenarios:
• The item is replicated into every of these partitions andthe resulting duplicates have to be pruned afterwards.
• The items are assigned to only one partition and thepartition bounds are adjusted accordingly which re-sults in overlapping partitions.
STARK uses the latter approach by assigning polygons topartitions based on their centroid point. Beside the parti-tion bounds, we keep an additional extent information thatis adjusted with the minimum and maximum values of therespective objects in each dimension. We decide which par-tition has to be checked during query execution based onthis extent information and prune partitions that cannotcontribute to the final result.In its current version, STARK only considers the spa-
tial component for partitioning. The partitioners implementSpark’s Partitioner interface and can be used to spatiallypartition an RDD with the RDD’s partitionBy method.
Grid Partitioner.The first partitioner included in STARK is a fixed grid
partitioner. Here, the data space is divided into a number ofintervals per dimension resulting in a grid of rectangular cells(partitions) with equal dimensions. The bounds of thesepartitions are computed in a first step and afterwards with asingle pass over the data, each item is assigned to a partitionby calculating in which grid cell this item is contained.
Cost-Based Binary Space Partitioner.As the fixed grid partitioner created partitions of equal
size over the data space, it might create some partitionsthat contain the majority of the data items, while otherpartitions are empty. As an example consider the worldmap where events only occur on land, but not on sea. With
a grid partitioning, there might be empty cells on sea andoverfilled partitions in densely populated areas. To overcomethis problem, we implemented a cost based binary spacepartitioning algorithm, based on [1]. This partitioner dividesthe space into two partitions with equal cost (number ofcontained items). If the cost for one partition exceeds athreshold, it is recursively divided again into two partitionsof equal cost. This way, large regions with only a few itemswill belong to the same partition, while dense regions aresplit into multiple partitions. The recursion stops when apartition does not exceed the cost threshold or the algorithmreached a granularity threshold, i.e., a minimum side lengthof a partition.
2.2 IndexingJust as in relational DBMS, indexing the content can sig-
nificantly improve query performance. STARK uses theJTS2 library for spatial operations. This library also pro-vides an R-tree implementation (more accurately, an STR-tree) for indexing. STARK can use this index structure toindex the content of a partition. A spatial partitioning isnot mandatory to use index, but might bring additionalperformance benefits. Basically, STARK has three index-ing modes, that can be chosen by the user:
No Indexing.The partitions are not indexed and all items within a par-
tition have to be evaluated with the respective predicatefunction.
Live Indexing.When a partition is processed for evaluating a predicate,
the content of that partition is first put into an R-tree andthen, this index is queried using the query object. Since theresults of the R-tree query are only candidates where theminimum bounding boxes match the query, these candidateshave to be checked again if they really match the queryobject. During this candidate pruning step, the temporalpredicate is evaluated as well, if needed. Live indexing canbe used in a program by calling the liveIndex method on anRDD. This method takes the order of the tree as well as anoptional partitioner as parameters, in case the RDD shouldbe repartitioned before indexing.
Persistent Indexing.Creating an index may be time consuming and often the
same index will be reused in subsequent runs of the sameor in another program. For such cases, STARK allows to
2http://tsusiatsoftware.net/jts/main.html
Figure 7: STARK Architecture[44]
3.2.4 STARK
STARK[44] is a Spark framework for processing large spatial datasets. It differs from other frame-
works in taking into consideration the temporal attribute of spatial data (spatio-temporal frame-
work). It is written using the Scala language and tightly integrates itself with the Spark API
such that RDDs are automatically transformed into spatially-aware RDDs. It does this by taking
advantage of Scala’s implicit conversions – a technique that allows new methods to be added to
existing types.
STARK’s architecture (Figure 7) consists of four layers built on top of Spark – spatial RDD,
predicates, distance functions, language. The Spatial RDD layer adds spatial functionality to
Spark’s RDDs. At the core of these RDDs is an object called STObject which contains the spatio-
temporal information of the spatial object. The time attribute of the object can be left blank and
subsequently ignored by STARK’s query. The predicates layer adds a number of predicates (i.e.
distance, intersects) to the spatial operations join and filter. The distance functions provide a set
of pre-programmed distance functions to be used with the predicate operations. The idea behind
this approach is to provide support to data of different coordinate systems (Cartesian and geodetic)
which require different distance metrics for accurate computations. The spatial partitioner layer
decides on the best way to partition the objects across the different computing nodes. Currently,
STARK works with the spatial attribute and ignores the temporal attribute when partitioning or
indexing the datasets. Finally, a new language integration called Piglet extends Pig Latin in order
to add support to spatial data programming. Piglet adds a new geometry data type and new filter,
join, and indexing operators.
29
A job in STARK starts by accepting a RDD of type (STObject, Object). The first element (STOb-
ject) holds the spatio-temporal information and the second element (Object) can be set to any type
and will only be carried through the computation steps. The RDD must be in this form in order for
STARK to work since Scala’s implicit conversion will not recognize the spatial RDD as such. With
the Spatial RDDs built, STARK builds a global index using the spatial attribute stored in STObject
in order to be able to spatially partition the datasets. STARK offers two types of global indexing;
grid which divides the dataset into equally-sized boxes and is not optimized for partition skews, and
cost-binary space partitioning (BSP). BSP divides the dataset into boxes of equal number objects
thus providing a partition balancing techniques that mitigate partition skews. The user can select
which technique to use, but that would require the user to have density knowledge about the data.
STARK tries not duplicate objects that span multiple partitions. If an object like a Polygon
breaks off into another partition(s), it is assigned to the partition where its centroid falls, and
then the object is virtually pruned. To compensate for this, STARK will record the extent of a
partition and use it in the query execution phase. By doing so, additional memory and processing
time are required to store and compute the extent of the partition with every newly added object.
Additionally, this process will grow the size of the partition which depending on the objects assigned
to it, may cause its size to grow to the point where it must be split.
After the data is partitioned according to the global index, STARK performs the query operations
on each partition. The user can choose to index the spatial data on each partition using an R-
Tree. STARK recommends running the query without in-memory indexing if the cost of building
an querying the index exceeds that of querying all items. If an R-Tree is used, then the R-Tree
is queried and an initial relationship between the objects is derived. Since these results are based
on the object’s MBR, the results are refined further. STARK does this automatically on the local
partitions by computing the actual relationship between the objects (i.e. distance).
30
JDBCCLI
RDBMS Hive Native RDDHDFS
Scala/ Python Program
Extended DataFrame APISimba SQL Parser
Extended Query Optimizer
Cache Manager Physical Plan (with Spatial Operations)
Table Indexing
Apache SparkTable Caching
Index Manager
Figure 1: Simba architecture.
in DataFrame API also allows Simba to interact with other Sparkcomponents easily, such as MLlib, GraphX, and Spark Streaming.Lastly, we introduce index management commands to Simba’s pro-gramming interface, in a way which is similar to that in traditionalRDBMS. We will describe Simba’s programming interface withmore details in Section 4 and Appendix A.Indexing. Spatial queries are expensive to process, especially fordata in multi-dimensional space and complex operations like spatialjoins and kNN. To achieve better query performance, Simba intro-duces the concept of indexing to its kernel. In particular, Simbaimplements several classic index structures including hash maps,tree maps, and R-trees [14, 23] over RDDs in Spark. Simba adoptsa two-level indexing strategy, namely, local and global indexing.The global index collects statistics from each RDD partition andhelps the system prune irrelevant partitions. Inside each RDD par-tition, local indexes are built to accelerate local query processingso as to avoid scanning over the entire partition. In Simba, user canbuild and drop indexes anytime on any table through index manage-ment commands. By the construction of a new abstraction calledIndexRDD, which extends the standard RDD structure in Spark, in-dexes can be made persistent to disk and loaded back together withassociated data to memory easily. We will describe the Simba’sindexing support in Section 5.Spatial operations. Simba supports a number of popular spatialoperations over point and rectangular objects. These spatial oper-ations are implemented based on native Spark RDD API. Multipleaccess and evaluation paths are provided for each operation, so thatthe end users and Simba’s query optimizer have the freedom andopportunities to choose the most appropriate method. Section 6discusses how various spatial operations are supported in Simba.Optimization. Simba extends the Catalyst optimizer of Spark SQLand introduces a cost-based optimization (CBO) module that tailorstowards optimizing spatial queries. The CBO module leveragesthe index support in Simba, and is able to optimize complex spa-tial queries to make the best use of existing indexes and statistics.Query optimization in Simba is presented in Section 7.Workflow in Simba. Figure 2 shows the query processing work-flow of Simba. Simba begins with a relation to be processed, eitherfrom an abstract syntax tree (AST) returned by the SQL parser ora DataFrame object constructed by the DataFrame API. In bothcases, the relation may contain unresolved attribute references orrelations. An attribute or a relation is called unresolved if we donot know its type or have not matched it to an input table. Simbaresolves such attributes and relations using Catalyst rules and aCatalog object that tracks tables in all data sources to build log-ical plans. Then, the logical optimizer applies standard rule-basedoptimization, such as constant folding, predicate pushdown, andspatial-specific optimizations like distance pruning, to optimize thelogical plan. In the physical planning phase, Simba takes a logicalplan as input and generates one or more physical plans based onits spatial operation support as well as physical operators inherited
SQL Query
DataFrameAPI
OptimizedLogical PlanLogical Plan Physical
PlansSelected
Physical PlanSimba Parser RDDs
CatalogIndex Manager
Cache Manager
Statistics
Analysis LogicalOptimization
PhysicalPlanning
Cost-BasedOptimization
Figure 2: Query processing workflow in Simba.
RDBMS
Hive
HDFS
Native RDD
RDD[Row]
ColumnarRDD
IndexRDDDistributedIndexing
In-MemoryColumnar Storage
Figure 3: Data Representation in Simba.from Spark SQL. It then applies cost-based optimizations basedon existing indexes and statistics collected in both Cache Managerand Index Manager to select the most efficient plan. The phys-ical planner also performs rule-based physical optimization, suchas pipelining projections and filters into one Spark map operation.In addition, it can push operations from the logical plan into datasources that support predicate or projection pushdown. In Figure 2,we highlight the components and procedures where Simba extendsSpark SQL with orange color.
Simba supports analytical jobs on various data sources such asCVS, JSON and Parquet [5]. Figure 3 shows how data are rep-resented in Simba. Generally speaking, each data source will betransformed to an RDD of records (i.e., RDD[Row]) for furtherevaluation. Simba allows users to materialize (often referred as“cache”) hot data in memory using columnar storage, which canreduce memory footprint by an order of magnitude because it re-lies on columnar compression schemes such as dictionary encodingand run-length encoding. Besides, user can build various indexes(e.g. hash maps, tree maps, R-trees) over different data sets to ac-celerate interactive query processing.Novelty and contributions. To the best of our knowledge, Simbais the first full-fledged (i.e., support SQL and DataFrame with asophisticated query engine and query optimizer) in-memory spa-tial query and analytics engine over a cluster of machines. Eventhough our architecture is based on Spark SQL, achieving efficientand scalable spatial query parsing, spatial indexing, spatial queryalgorithms, and a spatial-aware query engine in an in-memory, dis-tributed and parallel environment is still non-trivial, and requiressignificant design and implementation efforts, since Spark SQL istailored to relational query processing. In summary,
• We propose a system architecture that adapts Spark SQL tosupport rich spatial queries and analytics.
• We design the two-level indexing framework and a new RDDabstraction in Spark to build spatial indexes over RDDs na-tively inside the engine.
• We give novel algorithms for executing spatial operators withefficiency and scalability, under the constraints posed by theRDD abstraction in a distributed and parallel environment.
• Leveraging the spatial index support, we introduce new logi-cal and cost-based optimizations in a spatial-aware query op-timizer; many such optimizations are not possible in SparkSQL due to the lack of support for spatial indexes. We alsoexploit partition tuning and query optimizations for specificspatial operations such as kNN joins.
Figure 8: Simba Architecture(Orange shaded boxes)[58]
3.2.5 Simba
Simba[58] is a Spark framework for large spatial data analysis. Its aim is to introduce a framework
with a simple programming interface, low latency, high throughput, and scalability. Unlike the
previously mentioned frameworks, Simba does not integrate directly with Spark’s RDDs as it is
built to work with Spark DataFrame[22] and Spark SQL[31]. Currently, Simba only supports
spatial operations over point and rectangular objects. Its architecture[58] (Figure 8) consists of
a number of components to provide native spatial operations – SQL Parser, Spatial Operations,
Query Optimizer, and Index Manager.
Simba’s SQL Parser layer allows users to run spatial queries using SQL-like statements by adding
support to spatial keywords and grammar to Spark SQL (i.e. Point, Polygon, Range, kNN, Join). A
similar process adds grammar support to the DataFrame API. The Index Manager layer provides the
necessary utilities for users to build global and local indexes like R-Tree, HashMap, and TreeMap.
These indexes can be built and dropped anytime using the provided abstraction IndexRDD and can
be written to disk in order to speed up future operations. The Spatial Operations layer implements
a number of spatial operations over point and rectangular objects. The Query Optimizer layer
extends Spark SQL Catalyst optimizer in order to provide a Cost-Based Optimization (CBO)
techniques for optimizing complex spatial queries.
Simba tasks start with a relation either from an abstract syntax tree returned by the SQL parser or
the DataFrame API. Relation’s attributes that have not been matched with a type or an input table
are assigned a type using the Catalyst and a Catalog object which tracks tables in all data sources.
Afterward, the logical optimizer produces an optimized logical plan through standard rule-based
31
optimization like constant folding, predicate pushdown, spatial distance pruning. The logical plan
is then optimized via non-spatial rules (constant folding, predicate pushdown) and spatial rules
(distance pruning). The optimized logical plan is then turned into a one or more physical plans
based on criteria like spatial operation support and physical operators inherited from Spark SQL. In
the case of multiple physical plans, CBO is applied taking into consideration the choice of indexes
and random dataset statistics collected from Spark’s CacheManager and Simba’s Index Manager.
The optimal plan is then selected, however, since this step relies on random data samples, it is
possible that the execution plan could change when the same task is executed again. Finally, the
selected physical plan is transformed into an RDD object which is treated as a table with objects
in the RDD as the rows. The RDD can be written to HDFS and reused again to skip this process
in subsequent runs.
Simba utilizes the 2-phase indexing approach, global and local. Datasets are treated as tables with
records represented as Row objects; a table is then basically an RDD of type Row. To index a
table, Row objects within an RDD are packed into an array which, also, makes sampling quick.
This undoubtedly increases the memory requirements and introduces and overhead, but Simba
states that their experiments show the overhead to be negligible. Initially, Simba partitions tables
such that close by objects are assigned to the same partitions while balancing the load across all
partitions. Afterward, each partition builds a local index (i.e. R-Tree), loads all rows into an array,
collects statistics, calculates the partitions’ MBRs, and computes the number of records. Finally,
the global index is built by having each partition report its statistics back to the master node which
will build the index (R-tree or Grid). The global index is kept in the memory of the master node
and is used to prune irrelevant partitions for an input query. As an added feature, the global index
can be written to disk and loaded directly for future tasks.
Spatial queries execution in Simba is type dependent and utilizes the global and local indexes.
kNN queries utilize the global index to prune irrelevant partitions and the local index to improve
performance. A circle is drawn around the point and the global index is used to select the best
partitions that cover at least the required k within that partitions MBR. Candidates are selected
on each partition after calculating the actual distance; results from all the partitions are then
32
combined on the master node and the top best k are returned. For distance join queries, Simba
uses the global index to get an initial approximation of how to join the two datasets. The results of
this step is a set of possible pairs (i, j) which may contribute to the solution with each pair assigned
a partition ID. Then, pairs with similar partition IDs are sent to the same processing node where the
precise distance between the points is calculated. kNN joins are implemented using three different
approaches. The baseline method is the simplest and the least efficient as it uses the block nested
loop kNN join in Spark. The Voronoi kNN Join and z-Value kNN Join method is faster than the
baseline method but produces approximate results. The R-Tree kNN Join method provided faster
and better results. It partitions a dataset into n partitions using a Sort-Tile-Recursive algorithm
for load balancing and preserving locality. Then a distance bound is calculated for each partition in
order to derive a subset of the results. The distance is calculated by finding the furthest point from
the center of each partition’s MBR; the results are then sent back to the master node and finally
utilize an R-Tree to find a subset of the results on each partition. Finally, Spark’s zipPartitions is
invoked, a new R-Tree is built, a local kNN is executed, and the union of the results produces the
query’s output.
3.2.6 Features and Performance Summary
Much like the Hadoop-based frameworks, Spark-based frameworks aim at simplifying and speeding
up the processing of spatial data. Different from the Hadoop frameworks, Spark frameworks rely on
in-memory data processing first (RDD) then HDFS. The techniques these frameworks use, language
features they offer, and operations they support are all directly affected by the underlying Spark
system. Table 2 shows a high-level summary of the frameworks’ features.
Each of these frameworks discussed why its technique(s) are better than those of the ones that
came before it. These discussions were then backed by experiments that used one or more large
spatial datasets. In [44], the STARK framework is compared to GeoSpark and SpatialSpark using
a dataset containing 50 million polygons. The experiment put the frameworks under different tests
to examine the different indexing modes. Results showed that STARK performs better when used
with live indexing. SpatialSpark was reported to be limited in its functionality since it only supports
33
a limited number of operations without an index (contains, within distance). GeoSpark was also
problematic in the sense that it was not able to process the entire dataset. This was attributed to
the excessive caching of data that its algorithm follows.
In [58], a number of experiments were done to compare Simba, GeoSpark, and SpatialSpark. The
experiments used three datasets with varying sizes to compare the time and memory costs of
building the indexes (local, global), throughput, and latency. For the cost of building the indexes,
the experiment used a dataset of 1 billion records. The results showed the geopark’s indexing is
slightly faster than all others but only because it relied on a sample of the dataset and skipped
the global index. Simba was close behind followed by SpatialSpark. The experiment also tested
Simba’s cost of multidimensional indexing and found that the time increases linearly as the number
of dimensions increased from 2 to 6.
In the next experiment, the frameworks’ RAM requirements of the indexes were measured for
varying data sizes. The experiment showed that most of the memory consumed by the local
indexes across the different processing nodes. SpatialSpark’s global and local indexes were slightly
better than Simba’s. GeoSpark’s local index consumed the most memory out of all frameworks
tested.
The throughput and latency experiment used 500 million records to test range and kNN queries
on a number of frameworks. Simba finished its operations in far less time than the others. Simba’s
throughput was better, followed by SpatialSpark, followed by GeoSpark. Latency results were
similar with Simba requiring less time than SpatialSpark and GeoSpark. The experiment also
tested Simba’s cost of multidimensional indexing and found that throughput decreases and latency
increases as the number of dimensions increased from 2 to 6 for both query types.
34
Feature SpatialSpark GeoSpark LocationSpark STARK Simba
Release Date 2015 2015 2016 2017 2016Last Update 2017 2018 2017 2018 2018Integration Approach On Top of
SparkOn Top ofSpark
Into Spark Into Spark Into Spark
Language Integration - - Scala (SpatialRDD)
Piglet andScala viaRDD integra-tion
DataFrameand SparkSQL
OGC Compliant No No No No NoGeometry Library JTS JTSPlus JTS JTS Built-inGlobal Indexing Grid, K-D
TreeGrid Grid, Quad-
TreeGrid, Cost-Based BinarySpace, R-Tree
Grid, R-Tree,KD-tree
Local Indexing Op-tions
None, R-Tree None, R-Tree,Quad-Tree
None, Grid,R-tree, Quad-Tree, IR-tree.
None, R-Tree HashMaps,TreeMaps,R-Tree
Index Persistence Yes No Yes Yes YesData Pruning No No No Yes YesCarry non-spatialdata
No Yes Yes Yes
Skew Handling Level - Partition Query Partition PartitionMixed object Query No No No Yes YesBase Code Scala Java Scala Scala ScalaAllow for new opera-tions
No Yes (SRDDand queryprocessinglayers)
Yes (Spa-tial Objects,Query Sched-uler, QueryExecutor)
Yes (Spatial-RDDFunc-tions andPiglet)
Yes (Requiressource codemodification)
Installation None None None None NoneConfiguration In code In code In code In code In codeRequired ExpertiseLevel
ModerateSpark
Regular Spark Regular Spark Regular Spark Regular SparkDataFrame
Spatial Objects MBR Circle,LineString,Point, Poly-gon, andRectangle
Box and Point Inherited fromJTS
Circle, MBR,Point, Poly-gon
Spatial Operations Range, Join,kNN
Range, Join,kNN
Range, kNN,Join, kNN-join
Filtering,Join, kNN,Clustering
Range, Join,kNN
Table 2: Feature comparison of Spark-based frameworks
35
4 Future Work
The field of spatial data analysis is rapidly changing. The various frameworks discussed here aim
at simplifying the analysis process with each framework claiming that its approach is better for
spatial data processing. However, it is unclear how these frameworks compare to one another under
similar tests. It would be interesting to put all of the frameworks to test under the same conditions
using the same cluster center.
The various experiments reported in the test sections of the frameworks use different dataset sizes
and environment configurations. For future work, we would like to apply the same dataset to all of
the frameworks and observe their usability, runtimes, and behaviors. While they all claim that they
support truly large datasets, an exact definition is not given. For instance, in [60] the experiments
use workstations of 10 nodes with 15 gigabytes of memory and a number of datasets with the
largest being 23.8 gigabytes. In [44], experiments are performed using 16 nodes with 16 gigabytes
of memory and a dataset containing 34 million entries.
In addition, we would like to put these frameworks under different scalability tests. First, the
number of nodes is fixed while we vary the size of the input datasets. Second, the size of the
datasets is fixed while the number of nodes is linearly increased. Such tests would give an indication
of the frameworks’ scalability. Tests can, also, focus on the frameworks’ performance when using
different indexes similar to those reported in [58]. If the framework offers index caching, a number
of tests can gauge the performance variations when the framework skips the indexing step.
Some frameworks offer batch and/or live processing which would be worth investigating. Since
both approaches are different, a framework that is better at live queries may not perform as well
for batch queries. In addition, the usability factor is important as it indicates how easy it is to
launch and perform multiple queries.
Finally, none of the experiments that we have studied showed how accurate are their results. The
framework’s performance is only as reliable as its results. As an accuracy measure, we would like to
compare each of the frameworks’ results to those obtained via a traditional naive approach. Such
36
a result can be obtained from running non-optimized quires and simply focus on pairing objects
together for 100% accuracy. Moreover, the examination should look at how much of the input data
makes it to the output. For example, does the framework drop objects that are unmatched and/or
does it maintain the object’s boundaries by not just computing and working with its MBR.
5 Summary
In this paper, we surveyed a number of frameworks that make Apache Hadoop and Apache Spark
spatially aware. Spatial data analysis is a field that has recently picked up new momentum due
to the recent explosion of the amount of spatial data being recorded. The release of new parallel
execution frameworks has remotivated researchers to produce spatial analysis systems that are fast,
accurate, reliable, and scalable. This task is not trivial due to a number of challenges like the need
to recognize the many types of spatial objects, support of large number of operations, different
shapes a dataset can take, indexing, multidimensional objects, scalability, and reliability.
Apache Hadoop and Spark are two of the most popular big data processing frameworks; their
underlying structure is open-sourced, easy to use, scalable, and shields the user from much of the
worrying of parallel programming. Because of that, they are generic data-processing frameworks
and are not suitable for fast spatial data processing. To that effect, specialized frameworks have
been developed to make them spatially aware. Each of these frameworks offers its own features
that vary in usability, objects, operations, indexing . . . (Tables 1 and 2).
37
Bibliography
[1] Apache hive tm. https://hive.apache.org/.
[2] Apache pig! https://pig.apache.org/.
[3] Apache spark - lightning-fast cluster computing. https://spark.apache.org/.
[4] Apache hadoop! https://hadoop.apache.org/.
[5] Arcgis. https://www.arcgis.com/features/index.html.
[6] Compliance testing — ogc. http://www.opengeospatial.org/compliance.
[7] Designs, lessons and advice from building large distributed systems. http://www.cs.cornell.edu/projects/ladis2009/talks/dean-keynote-ladis2009.pdf. Slide 24.
[8] Geos. https://trac.osgeo.org/geos.
[9] Gis tools for hadoop by esri. http://esri.github.io/gis-tools-for-hadoop/. (Accessedon 04/08/2018).
[10] Github - bunnyg/hadoop-gis: Hadoop-gis. https://github.com/bunnyg/Hadoop-GIS.
[11] Hadoop-gis - [data management and biomedical data analytics lab]. http://bmidb.cs.
stonybrook.edu/hadoopgis/index.
[12] Impala. https://impala.apache.org/.
[13] Installing and configuring spatialhadoop. http://spatialhadoop.cs.umn.edu/.
[14] libspatialindex libspatialindex 1.8.0 documentation. https://libspatialindex.github.
io/.
[15] Locationtech jts topology suite — projects.eclipse.org. https://projects.eclipse.org/
projects/locationtech.jts.
[16] Nosql databases. http://nosql-database.org/.
[17] Oracle spatial and graph. http://www.oracle.com/technetwork/database/options/
spatialandgraph/overview/index.html.
[18] Postgis spatial and geographic objects for postgresql. https://postgis.net/.
[19] Rdd programming guide - spark 2.3.0 documentation. https://spark.apache.org/docs/
latest/rdd-programming-guide.html.
[20] The scala programming language. https://www.scala-lang.org/.
[21] Spark sql & dataframes — apache spark. https://spark.apache.org/sql/.
[22] Spark sql and dataframes - spark 2.3.0 documentation. https://spark.apache.org/docs/
latest/sql-programming-guide.html. (Accessed on 04/20/2018).
38
[23] Spatialspark: Big spatial data process using spark. http://simin.me/projects/
spatialspark/.
[24] Sql server 2017 on windows and linux — microsoft. https://www.microsoft.com/en-us/
sql-server/sql-server-2017.
[25] Twitter. it’s what’s happening. https://twitter.com/.
[26] Scaling the facebook data warehouse to 300 pb. https://code.facebook.com/posts/
229861827208629/scaling-the-facebook-data-warehouse-to-300-pb/, Mar 2014.
[27] 98 personal data points that facebook uses to target ads to you - the wash-ington post. https://www.washingtonpost.com/news/the-intersect/wp/2016/08/19/
98-personal-data-points-that-facebook-uses-to-target-ads-to-you/, Aug 2016.
[28] Data has transformed, and is transforming, everything. http://www.telegraph.co.uk/
education/stem-awards/power-systems/data-is-transforming-everything/, Jun 2017.
[29] How much data does google handle?? https://www.heshmore.com/
how-much-data-does-google-handle/, Jun 2017.
[30] Ablimit Aji, Fusheng Wang, Hoang Vo, Rubao Lee, Qiaoling Liu, Xiaodong Zhang, and JoelSaltz. Hadoop gis: a high performance spatial data warehousing system over mapreduce.Proceedings of the VLDB Endowment, 6(11):1009–1020, 2013.
[31] Michael Armbrust, Reynold S Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K Bradley,Xiangrui Meng, Tomer Kaftan, Michael J Franklin, Ali Ghodsi, et al. Spark sql: Relationaldata processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conferenceon Management of Data, pages 1383–1394. ACM, 2015.
[32] Girija V Attigeri, Manohara Pai MM, Radhika M Pai, and Aparna Nayak. Stock marketprediction: A big data approach. In TENCON 2015-2015 IEEE Region 10 Conference, pages1–5. IEEE, 2015.
[33] Norbert Beckmann, Hans-Peter Kriegel, Ralf Schneider, and Bernhard Seeger. The r*-tree: anefficient and robust access method for points and rectangles. In Acm Sigmod Record, volume 19,pages 322–331. Acm, 1990.
[34] Jon Louis Bentley. Multidimensional binary search trees used for associative searching. Com-munications of the ACM, 18(9):509–517, 1975.
[35] Johan Bollen, Huina Mao, and Xiaojun Zeng. Twitter mood predicts the stock market. Journalof computational science, 2(1):1–8, 2011.
[36] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters.Communications of the ACM, 51(1):107–113, 2008.
[37] Ahmed Eldawy. Spatialhadoop: towards flexible and scalable spatial processing using mapre-duce. In Proceedings of the 2014 SIGMOD PhD symposium, pages 46–50. ACM, 2014.
39
[38] Ahmed Eldawy and Mohamed F Mokbel. The era of big spatial data: a survey. Informationand Media Technologies, 10(2):305–316, 2015.
[39] Raphael A. Finkel and Jon Louis Bentley. Quad trees a data structure for retrieval on compositekeys. Acta informatica, 4(1):1–9, 1974.
[40] Felix Gessert, Wolfram Wingerath, Steffen Friedrich, and Norbert Ritter. Nosql databasesystems: a survey and decision guidance. Computer Science-Research and Development, 32(3-4):353–365, 2017.
[41] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. In Proceed-ings of the Nineteenth ACM Symposium on Operating Systems Principles, SOSP ’03, pages29–43, New York, NY, USA, 2003. ACM.
[42] Antonin Guttman. R-trees: A dynamic index structure for spatial searching, volume 14. ACM,1984.
[43] Christoforos Hadjigeorgiou et al. Rdbms vs nosql: Performance and scaling comparison. MScin High, 2013.
[44] Stefan Hagedorn, Philipp Gotze, and Kai-Uwe Sattler. The stark framework for spatio-temporal data analytics on spark. Datenbanksysteme fur Business, Technologie und Web (BTW2017), 2017.
[45] Long Jiang, Mo Yu, Ming Zhou, Xiaohua Liu, and Tiejun Zhao. Target-dependent twittersentiment classification. In Proceedings of the 49th Annual Meeting of the Association forComputational Linguistics: Human Language Technologies-Volume 1, pages 151–160. Associ-ation for Computational Linguistics, 2011.
[46] Efthymios Kouloumpis, Theresa Wilson, and Johanna D Moore. Twitter sentiment analysis:The good the bad and the omg! Icwsm, 11(538-541):164, 2011.
[47] Sara Landset, Taghi M Khoshgoftaar, Aaron N Richter, and Tawfiq Hasanin. A survey of opensource tools for machine learning with big data in the hadoop ecosystem. Journal of Big Data,2(1):24, 2015.
[48] Jyoti Nandimath, Ekata Banerjee, Ankur Patil, Pratima Kakade, Saumitra Vaidya, and Di-vyansh Chaturvedi. Big data analysis using apache hadoop. In Information Reuse and Inte-gration (IRI), 2013 IEEE 14th International Conference on, pages 700–703. IEEE, 2013.
[49] Martin Odersky, Philippe Altherr, Vincent Cremet, Burak Emir, Stphane Micheloud, NikolayMihaylov, Michel Schinz, Erik Stenman, and Matthias Zenger. The scala language specifica-tion, 2004.
[50] Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, and Andrew Tomkins.Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACMSIGMOD international conference on Management of data, pages 1099–1110. ACM, 2008.
[51] Owen OMalley. Terabyte sort on apache hadoop. Yahoo, available online at:http://sortbenchmark. org/Yahoo-Hadoop. pdf,(May), pages 1–3, 2008.
40
[52] Philippe Rigaux, Michel Scholl, and Agnes Voisard. Spatial databases: with application to GIS.Elsevier, 2001.
[53] Konstantin Shvachko, Hairong Kuang, Sanjay Radia, and Robert Chansler. The hadoop dis-tributed file system. In Mass storage systems and technologies (MSST), 2010 IEEE 26thsymposium on, pages 1–10. Ieee, 2010.
[54] Mingjie Tang, Yongyang Yu, Qutaibah M Malluhi, Mourad Ouzzani, and Walid G Aref. Loca-tionspark: A distributed in-memory data management system for big spatial data. Proceedingsof the VLDB Endowment, 9(13):1565–1568, 2016.
[55] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony,Hao Liu, Pete Wyckoff, and Raghotham Murthy. Hive: a warehousing solution over a map-reduce framework. Proceedings of the VLDB Endowment, 2(2):1626–1629, 2009.
[56] Andranik Tumasjan, Timm Oliver Sprenger, Philipp G Sandner, and Isabell M Welpe. Pre-dicting elections with twitter: What 140 characters reveal about political sentiment. Icwsm,10(1):178–185, 2010.
[57] Tom White. Hadoop: The definitive guide. ” O’Reilly Media, Inc.”, 2012.
[58] Dong Xie, Feifei Li, Bin Yao, Gefei Li, Liang Zhou, and Minyi Guo. Simba: Efficient in-memoryspatial analytics. In Proceedings of the 2016 International Conference on Management of Data,pages 1071–1085. ACM, 2016.
[59] Simin You, Jianting Zhang, and Le Gruenwald. Large-scale spatial join query processing incloud. In Data Engineering Workshops (ICDEW), 2015 31st IEEE International Conferenceon, pages 34–41. IEEE, 2015.
[60] Simin You, Jianting Zhang, and Le Gruenwald. Spatial join query processing in cloud: Analyz-ing design choices and performance comparisons. In Parallel Processing Workshops (ICPPW),2015 44th International Conference on, pages 90–97. IEEE, 2015.
[61] Jia Yu, Jinxuan Wu, and Mohamed Sarwat. Geospark: A cluster computing framework forprocessing large-scale spatial data. In Proceedings of the 23rd SIGSPATIAL InternationalConference on Advances in Geographic Information Systems, page 70. ACM, 2015.
[62] Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy Mc-Cauley, Michael J Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: Afault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIXconference on Networked Systems Design and Implementation, pages 2–2. USENIX Associa-tion, 2012.
[63] Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica.Spark: Cluster computing with working sets. HotCloud, 10(10-10):95, 2010.
41