managing massive amounts of spatio- temporal data … · managing massive amounts of...
TRANSCRIPT
MANAGING MASSIVE AMOUNTS OF SPATIO-TEMPORAL DATA USING
Anita GraserCenter for Mobility Systems, AIT Austrian Institute of Technology
ABOUT
Anita GraserScientist @ AIT Austrian Institute of Technology
− QGIS user since 2008
− MSc in Geomatics 2010
− QGIS Project Steering Committee since 2013
− OSGeo Director 2015-17
− Moderator on GIS.StackExchange.com
− Author of „Learning QGIS“ (1st ed 2013), „QGIS Map Design“
(2016) & „QGIS 2 Cookbook“ (2016)
@underdarkGIS
Austria‘s largest non-university research institute
− Energy
− Health & Bioresources
− Digital Safety & Security
− Vision, Automation & Control
− Mobility Systems
− Low-Emission Transport
− Technology Experience
− Innovation Systems & Policy
AIT
ANGESTELLTE
1,300
Application areas
− Road traffic → FCD, e.g. Waze, TomTom, Uber
− Air traffic → ADS-B, e.g. Flightradar
− Marine traffic → AIS, e.g. MarineTraffic
− Human movement → CDR, e.g. mobile network providers
→ Data-driven decision making
→ Technologically challenging
CONTEXT & MOTIVATION
411/07/2018
SPATIAL DATA
511/07/2018Source: Constantin Stanca “High Performance and Scalable Geospatial Analytics on Cloud with Open Source”
SPATIAL RELATIONSHIPS
611/07/2018Source: Constantin Stanca “High Performance and Scalable Geospatial Analytics on Cloud with Open Source”
SPATIAL FUNCTIONS
711/07/2018Source: Constantin Stanca “High Performance and Scalable Geospatial Analytics on Cloud with Open Source”
Big geo data
is anything
which is
crash ArcGIS
Small data is when
is fit in RAM.
Big is when is
crash because is
no fit in RAM
WHAT‘S „MASSIVE“ SPATIO-TEMPORAL DATA
TRADITIONAL TOOLS
911/07/2018Scaling PostgreSQL and PostGIS http://s3.cleverelephant.ca/2017-cdb-postgis.pdf
LOOKING FOR SCALABLE SOLUTIONS
10
ESRI GIS Tools
for Hadoophttps://github.com/E
sri/gis-tools-for-
hadoop
LocationSparkhttps://github.com/merlin
tang/SpatialSpark
STARK - Spatio-
Temporal Data
Analytics on Sparkhttps://github.com/dbis-
ilm/stark
SpatialSparkhttps://github.com/syoum
mer/SpatialSpark
GeoSparkhttps://github.com/DataS
ystemsLab/GeoSpark
PySpark & Geopandashttps://github.com/sabman/
PySparkGeoAnalysis
OPENSOURCE & MATURE
1111/07/2018
https://projects.eclipse.org/wg/locationtech/projects
WHAT IS GEOMESA?
1211/07/2018Source: Constantin Stanca “High Performance and Scalable Geospatial Analytics on Cloud with Open Source”
WHAT IS GEOMESA?
1311/07/2018Source: Constantin Stanca “High Performance and Scalable Geospatial Analytics on Cloud with Open Source”
WHAT IS GEOMESA?
1411/07/2018Source: Constantin Stanca “High Performance and Scalable Geospatial Analytics on Cloud with Open Source”
WHAT IS GEOMESA?
1511/07/2018Source: Constantin Stanca “High Performance and Scalable Geospatial Analytics on Cloud with Open Source”
WHAT IS GEOMESA?
1611/07/2018Source: Constantin Stanca “High Performance and Scalable Geospatial Analytics on Cloud with Open Source”
Features
✓ Store gigabytes to petabytes of spatial data (tens of billions of points or more)
✓ Serve up tens of millions of points in seconds
✓ Ingest data faster than 10,000 records per second per node
✓ Scale horizontally easily (add more servers to add more capacity)
✓ Support Spark analytics
✓ Drive a map through GeoServer or other OGC Clients
GEOMESA
1711/07/2018
http://www.geomesa.org/documentation/user/introduction.html#what-is-geomesa
… making 2/3D data sortable
→ Space-filling curves
SPATIO-TEMPORAL INDIZES
1811/07/2018
http://doi.ieeecomputersociety.org/10.1109/TVCG.2014.2298017
GEOMESA Z-CURVE
1911/07/2018
http://www.geomesa.org/documentation/tutorials/geohash-substrings.html
geomesa describe-schema -c geomesa.gdelt -f gdelt -u user -p password
INFO Describing attributes of feature 'gdelt'
globalEventId | String
eventCode | String
...
dtg | Date (Spatio-temporally indexed)
geom | Point (Spatially indexed)
User data:
geomesa.index.dtg | dtg
geomesa.indices | z3:4:3,z2:3:3,records:2:3
geomesa.table.sharing | false
GEOMESA COMMAND LINE
20
geomesa export -c geomesa.gdelt -f gdelt -u root -p GisPwd
-q "globalEventId='671867776'"
Using GEOMESA_ACCUMULO_HOME = /opt/geomesa
id,globalEventId:String,...,dtg:Date,*geom:Point:srid=4326
d9e...,671867776,...,2007-07-13T00:00:00.000Z,POINT (-97 38)
GEOMESA COMMAND LINE
21
geomesa export -c geomesa.gdelt -f gdelt -u root -p GisPwd
-q "CONTAINS(POLYGON ((0 0, 0 90, 90 90, 90 0, 0 0)),geom)" -m 3
Using GEOMESA_ACCUMULO_HOME = /opt/geomesa
id,globalEventId:String,...,dtg:Date,*geom:Point:srid=4326
139...,671713129,...,2017-07-10T00:00:00.000Z,POINT (5.43827 5.35886)
9e8...,671928676,...,2017-07-10T00:00:00.000Z,POINT (5.43827 5.35886)
d6c...,671817380,...,2017-07-09T00:00:00.000Z,POINT (5.43827 5.35886)
More complex queries & analyses → Spark(SQL)!
GEOMESA COMMAND LINE
22
GEOMESA
2311/07/2018Source: Constantin Stanca “High Performance and Scalable Geospatial Analytics on Cloud with Open Source”
Option #1: DataFrame API
import org.locationtech.geomesa.spark.jts._
import spark.implicits. _
gdeltDf.where(st_contains(st_makeBBOX(0.0, 0.0, 90.0, 90.0), $"geom"))
Option #2: SparkSQL (mit UDFs)
SELECT * FROM gdelt
WHERE st_contains(st_makeBBOX(0.0, 0.0, 90.0, 90.0), geom)
GEOMESA
24
Save dataframe to GeoMesa table
val df = spark.sql(sqlQuery)
val dsParams = Map( "accumulo.instance.id" -> "...",
"accumulo.zookeepers" -> "...",
"accumulo.user" -> "...",
"accumulo.password" -> "...",
"accumulo.catalog" -> "tablename") )
df.write.format("geomesa").options(dsParams)
.option("geomesa.feature", "featurename").save()
GEOMESA
25
Example: Trajectory from points sorted by time
val someDF = Seq(
(1, Timestamp.valueOf("2018-01-01 12:00:00"), 2.5, geomFactory.createPoint(new Coordinate(0, 0))),
(1, Timestamp.valueOf("2018-01-01 12:05:00"), 3.5, geomFactory.createPoint(new Coordinate(1, 1))),
(2, Timestamp.valueOf("2018-01-01 12:00:00"), 5.5, geomFactory.createPoint(new Coordinate(0, 0))),
(2, Timestamp.valueOf("2018-01-01 12:05:00"), 5.5, geomFactory.createPoint(new Coordinate(1, 1)))
).toDF("id", "t", "sog", "pt")
+--+-------------------+---+-----------+
|id|t |sog|pt |
+--+-------------------+---+-----------+
|1 |2018-01-01 12:00:00|2.5|POINT (0 0)|
|1 |2018-01-01 12:05:00|3.5|POINT (1 1)|
|2 |2018-01-01 12:00:00|5.5|POINT (0 0)|
|2 |2018-01-01 12:05:00|5.5|POINT (1 1)|
+--+-------------------+---+-----------+
GEOMESA
26
Example: Trajectory from points sorted by time
someDF
.withColumn("collected", collect_list($"pt").over(Window.partitionBy("id").orderBy("t")))
.groupBy("id")
.agg(max($"collected").as("collected"))
.withColumn("line", st_makeLine($"collected"))
.show(false)
+--+------------------------------+-------------------------+
|id|collected |line |
+--+------------------------------+-------------------------+
|1 |[POINT (0 0), POINT (1 1)] |LINESTRING (0 0, 1 1) |
|2 |[POINT (10 10), POINT (11 11)]|LINESTRING (10 10, 11 11)|
+--+------------------------------+-------------------------+
GEOMESA
27
Example: Trajectory from points sorted by time
spark.sql("""WITH windowed AS (
SELECT id, collect_list(first(pt)) OVER (PARTITION BY id ORDER BY t) line
FROM temp
GROUP BY id, t)
SELECT id, max(line), st_makeline(max(line))
FROM windowed
GROUP BY id""").show(false)
+--+------------------------------+--------------------------+
|id|max(line) |UDF:st_makeLine(max(line))|
+--+------------------------------+--------------------------+
|1 |[POINT (0 0), POINT (1 1)] |LINESTRING (0 0, 1 1) |
|2 |[POINT (10 10), POINT (11 11)]|LINESTRING (10 10, 11 11) |
+--+------------------------------+--------------------------+
GEOMESA
28
http://www.geomesa.org/documentation/user/spark/sparksql_functions.html
Geometry Constructors
• st_geometryFromText
• st_makeBBOX
• st_makeLine
• st_makePoint
• st_makePolygon
• …
Geometry Accessors
• st_geometryN
• st_isValid
• st_pointN
• st_x
• …
Geometry Outputs
• st_asGeoJSON
• st_asText
• …
Spatial Relationships
• st_area
• st_centroid
• st_closestPoint
• st_contains
• st_covers
• st_crosses
• st_disjoint
• st_distance
• st_distanceSphere
• st_distanceSpheroid
• st_equals
• st_intersects
• st_length
• st_lengthSphere
• st_lengthSpheroid
• st_overlaps
• st_relate
• st_touches
• st_within
Geometry Processing
• st_bufferPoint
• st_convexHull
• …
GEOMESA-SPARK-SQL MODULE
29
BIG SPATIAL TECHNOLOGY STACK
30
ACCESSING GEOMESA IN GEOSERVER
3111/07/2018
GEOSERVER PREVIEW
3211/07/2018
CONSUMING WFS IN QGIS
3311/07/2018
EXAMPLE
TRAFFIC COUNTS
EXAMPLE
TRAVEL TIME
Based on similar trajectory search
EXAMPLE
TRAJECTORY PREDICTION
5 MIN 10 MIN 15 MIN
Graser, A., Schmidt, J., Widhalm, P. (2018) Predicting trajectories with probabilistic time geography and massive unconstrained movement data, GIScience Workshop on Analysis of Movement Data (AMD’18), 28. August 2018, Melbourne, Australia.