processing geospatial at scale at locationtech

PROCESSING GEOSPATIAL DATA @ SCALE

Rob Emanuele

What we’ll be covering…

Background on geospatial concepts

What is LocationTech?

Background on big data frameworks

Overview of LocationTech projects for processing big geo data.

Geospatial Data

Core of GIS (Geographic information system)

Raster (images, weather data)

Vector (points of interest, country boundries)

Raster Data

Vector Data (Points)

Vector Data (Lines)

Vector Data (Polygons)

Source: https://ryouready.wordpress.com/2009/11/16/infomaps-using-r-visualizing-german-unemployment-rates-by-color-on-a-map/

https://ryouready.wordpress.com/2009/11/16/infomaps-using-r-visualizing-german-unemployment-rates-by-color-on-a-map/

Vector Data

Contains

Heatmap (Kernel Density)

Zonal Statistics

Feature Extraction (Image Segmentation)

Source: http://www.professeurs.polymtl.ca/christopher.pal/

http://www.professeurs.polymtl.ca/christopher.pal/

Large geospatial data

Landsat 8 on AWS: 311,405 scenes @ ~800 MB each. That's 250 TB and counting.

OpenStreetMap: planet.osm is 617 GB.

3 years of geotagged tweets: 3 TB

WHAT IS ?

Project to build a better search engine, back in the early 2000’s.

Worked for small datasets, but was not scalable.

The Google papers

After reading the papers, Nutch developers added a distributed file system and MapReduce model to Nutch.

In 2006, those portions were spun out of Nutch to form…

Hadoop

Matei Zaharia

Worked with Hadoop at UC Berklee

Noticed Hadoop was not a good fit for Machine Learning algorithms and other iterative models.

So in 2009, he created…

Apache Accumulo

Created by the NSA in 2008

Donated to the Apache Foundation in 2011

Graduated to a top level project in 2012

Almost defunded by the US government the same year.

(Sec. 929) Prohibits any DOD component from utilizing the cloud computing database developed by the National Security Agency (NSA) and known as "Accumulo" after the end of FY2013, unless the DOD CIO certifies that: (1) there are no viable commercial open source databases that have such security features, or (2) Accumulo itself has become a successful open source database project. Requires DOD and intelligence community officials to coordinate the use by DOD components of cloud computing infrastructure and services offered by the intelligence community for purposes other than intelligence analysis.

Data Node

Data Node

Data Node

Name Node

Master

Tablet Server

Tablet Server

Tablet Server

Accumulo

BigTable clone (columnar database)

Records stored on HDFS

Lexicographically sorted table index

GEOJINNI (FORMERLY SPATIALHADOOP)

Spatial language Built-in spatial data types

Spatial Indexes Spatial Operations

R-TREE INDEX OF A 400 GB ROAD NETWORK

http://spatialhadoop.cs.umn.edu/demos/road_network_rtree.png

72 Frames × 14 Billion points per frame Total = 1 Trillion points

Generated in three hours on a 10-node cluster

HEAT MAP FROM 2009 TO 2014 MONTH-BY-MONTH

Geo +

accessed through

SELECT tweet.text, user.name FROM tweet, user WHERE bbox(tweet.location, -115, 45, -110, 50) AND tweet.user_id = user.user_id

+

GeoTrellis

a Scala library for geospatial data types and operations.

enables Spark with geospatial capabilities (raster now, soon vector!).

storage and query raster from HDFS, Accumulo, and S3

Zonal Summaries

Benchmark Results

439.5 GB of monthly temperature model output data

USA temperature yearly average, 2006 to 2100

Benchmark Results

439.5 GB of monthly temperature model output data

USA temperature yearly average, 2006 to 2100

40 m3.xlarge instances (estimated $2.00 USD per hour

on spot market)

GEOWAVE

Geo +

accessed through

GEOWAVE

GEOWAVE

Three dimensional Z-order curve

THANK YOU

@lossyrob

gitter.im/geotrellis/geotrellis

github.com/geotrellis/geotrellis

[email protected]

http://github.com/geotrellis/geotrellis

mailto:[email protected]

processing geospatial at scale at locationtech

Software