processing geospatial data at scale @locationtech

Post on 12-Apr-2017

458 Views

Category:

Software

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

PROCESSING GEOSPATIAL DATA @ SCALE

Rob Emanuele

GEO(MESA/WAVE/TRELLIS/JINNI)

What we’ll be covering…

What does “processing geospatial data at scale” mean?

Background on big data frameworks

What is LocationTech?

Overview of LocationTech projects for processing big geo data.

PROCESSING GEOSPATIAL DATA @ SCALE

PROCESSING GEOSPATIAL DATA @ SCALE

Large geospatial data

Landsat 8 on AWS: 465,68 scenes @ ~800 MB each. That's 355 TB and counting.

OpenStreetMap edit history: 75 GB compressed.

3 years of geotagged tweets: 3 TB

NED 1/3 arc second

NED 1/3 arc second

NED 1/3 arc second

PROCESSING GEOSPATIAL DATA @ SCALE

PROCESSING GEOSPATIAL DATA @ SCALE

PROCESSING GEOSPATIAL DATA @ SCALE

Project to build a better search engine, back in the early 2000’s.

Worked for small datasets, but was not scalable.

The Google papers

After reading the papers, Nutch developers added a distributed file system and MapReduce model to Nutch.

In 2006, those portions were spun out of Nutch to form…

Apache Hadoop

Heavily supported by Yahoo, which moved it’s large data processing to Hadoop.

by 2007, Twitter, Facebook, LinkedIn and many others were doing serious work with Hadoop

2008 Hadoop graduated to a top level Apache project

Hadoop

Source: http://cs.calvin.edu/courses/cs/374/exercises/12/lab/MapReduceWordCount.png

Matei Zaharia

Worked with Hadoop at UC Berklee

Noticed Hadoop was not a good fit for Machine Learning algorithms and other iterative models.

So in 2009, he created…

Open sourced in 2010 under BSD license

Maintained by UC Berkeley’s AMPLab

Donated to the Apache Software Foundation in 2013 and relicensed as Apache 2.0

Graduated to a top level Apache project in 2014

Apache Spark

Apache Spark

a distributed computation engine.

An API that lets you work with distributed data as a collection.

Written in Scala, with language bindings for use with Java, Python, and R.

2006

Apache Accumulo

Created by the NSA in 2008

Donated to the Apache Foundation in 2011

Graduated to a top level project in 2012

Almost defunded by the US government the same year.

(Sec. 929) Prohibits any DOD component from utilizing the cloud computing database developed by the National Security Agency (NSA) and known as "Accumulo" after the end of FY2013, unless the DOD CIO certifies that: (1) there are no viable commercial open source databases that have such security features, or (2) Accumulo itself has become a successful open source database project. Requires DOD and intelligence community officials to coordinate the use by DOD components of cloud computing infrastructure and services offered by the intelligence community for purposes other than intelligence analysis.

(Sec. 929) Prohibits any DOD component from utilizing the cloud computing database developed by the National Security Agency (NSA) and known as "Accumulo" after the end of FY2013, unless the DOD CIO certifies that: (1) there are no viable commercial open source databases that have such security features, or (2) Accumulo itself has become a successful open source database project. Requires DOD and intelligence community officials to coordinate the use by DOD components of cloud computing infrastructure and services offered by the intelligence community for purposes other than intelligence analysis.

Data Node

Data Node

Data Node

Name Node

Master

Tablet Server

Tablet Server

Tablet Server

Accumulo

BigTable clone (columnar database)

Records stored on HDFS

Lexicographically sorted table index

PROCESSING GEOSPATIAL DATA @ SCALE

PROCESSING GEOSPATIAL DATA @ SCALE

WHAT IS ?

GEOJINNI(FORMERLY SPATIALHADOOP)

Spatial language Built-in spatial data types

Spatial Indexes Spatial Operations

72 Frames × 14 Billion points per frame Total = 1 Trillion points

Generated in three hours on a 10-node cluster

HEAT MAP FROM 2009 TO 2014 MONTH-BY-MONTH

Geo +

accessed through

SELECT tweet.text, user.name FROM tweet, userWHERE bbox(tweet.location, -115, 45, -110, 50) AND tweet.user_id = user.user_id

+

GeoTrellis

a Scala library for geospatial data types and operations.

enables Spark with geospatial capabilities (mainly raster, currently working on vector)

storage and query raster from HDFS, Accumulo, and S3 (Cassandra support in development)

0.10 is released!

Polygonal Summaries

Polygonal Summaries

100 spot instance m3.xlarge workers @ $0.04 / hr = $4.00 / hr

400 CPUs / ≈1.5 TB memory

1 master m3.xlarge on-demand instance @ $0.26 / hr

EMR cluster charge, $0.07 / hr

$4.37 / hr

Rendering elevation with hillshade + NLCD on AWS EMR

Geo +

accessed through

GEOWAVE

Index typeZ-order spatial &

spatiotemporal binned by week

Hilbert in N-dimensions with tiered indexing and

binning

Backends supported

Accumulo (main), Cassandra, HBase,

DynamoDB, Google cloud Bigtable

Accumulo (main), HBase

Servers supported GeoServer GeoServer, MapnikProcessing

Frameworks supportedHadoop, Spark, Storm,

Kafka Hadoop, Spark

Language Scala Java

Index typeZ-order spatial &

spatiotemporal binned by week

Hilbert in N-dimensions with tiered indexing and

binning

Backends supported

Accumulo (main), Cassandra, HBase,

DynamoDB, Google cloud Bigtable

Accumulo (main), HBase

Servers supported GeoServer GeoServer, MapnikProcessing

Frameworks supportedHadoop, Spark, Storm,

Kafka Hadoop, Spark

Language Scala Java

Tiered Indexing

Tiered Indexing

Binning (time dimension)

1997 1998 1999

Binning (arbitrary dimensions)

Time

Elevation

Velocity

Collaboration

Collaboration

• Working together to learn to collaborate

• Making the connections necessary that allow collaboration to flourish

• Join the locationtech-iwg mailing list

• Share you big geospatial data challenges

• Propose projects

Get involved!

THANK YOU

@lossyrob

gitter.im/geotrellis/geotrellis

github.com/geotrellis/geotrellis

remanuele@azavea.com

top related