big geospatial data processing

18
Development of a new framework for distributed processing of Big Geospatial Data Angéla Olasz, Binh Nguyen Thai Institute of Geodesy, Cartography and Remote Sensing (FÖMI) Directorate of Geoinformation

Upload: angela-olasz

Post on 16-Apr-2017

55 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Geospatial Data processing

Development of a new framework

for distributed processing of

Big Geospatial Data

Angéla Olasz, Binh Nguyen Thai

Institute of Geodesy, Cartography and Remote Sensing (FÖMI)

Directorate of Geoinformation

Page 2: Big Geospatial Data processing

1. Introduction

2. IQmulus project short introduction

3. Defining Geospatial Big Data

4. Comparison of existing solution (Aspects of

requirements)

5. IQLib intro & objectives

6. IQLib modules and it’s status

7. Related papers & Future work

Content of this talk

A new framework for distributed processing of Big Geospatial Data

• FOSS4G Bonn, 24.-26. August 2016

Page 3: Big Geospatial Data processing

Our goal is to find a solution for processing of big geospatial data in a

distributed ecosystem providing an environment to run algorithms,

services, processing modules without any limitations on

implementation programming language as well as data

partitioning strategies and distribution among computational

nodes in order to run existing GIS processing scripts.

As a first step we focus on raster data representation:

(i) decomposition and

(ii) distributed processing.

Before building this prototype system, we have 1. analyzed data

decomposition patterns. 2. defined the common GIS user

requirements on processing environments of Big Geospatial Data 3.

tried to identify Geospatial Big Data with the help of the 4 „V”s. 4.

compared existing solutions on selected aspects.

Introduction

A new framework for distributed processing of Big Geospatial Data

• FOSS4G Bonn, 24.-26. August 2016

Page 4: Big Geospatial Data processing

IQmulus

A High-volume Fusion and Analysis Platform for Geospatial Point Clouds, Coverages and Volumetric Data Sets

„IQmulus will leverage the information

hidden in large heterogeneous

geospatial data sets and make them a

practical choice to support reliable

decision making.”

4 year FP7 EU Research Project

2012 November – 2016 November

12 European partner, 7 countries

www.IQmulus.eu

A new framework for distributed processing of Big Geospatial Data

• FOSS4G Bonn, 24.-26. August 2016

Page 5: Big Geospatial Data processing

To have a better understanding on what are the main attributes of

geospatial big data: it is hard to delineate the margin starting to

“exceed the capability of spatial computing technology”.

To estimate the size of the processable amount of data are use-case

specific, there are some good examples (Evans et al., 2014) where the

authors tried to identify the Geospatial Data and Geospatial Big Data

differences.

Here we have tried to compare Big Data, Geospatial Big Data and

Geospatial Data as a short review.

The nature of the digital representation of the continuous space can be

grouped in 3 main groups: vector, raster, 3D representation. Have

been compared to „non-geospatial” text based data format.

Defining Geospatial Big Data

A new framework for distributed processing of Big Geospatial Data

• FOSS4G Bonn, 24.-26. August 2016

Page 6: Big Geospatial Data processing

Defining Geospatial Big Data

Page 7: Big Geospatial Data processing

Aspects of requirements and

comparison of existing solutions

We have collected the most popular frameworks supporting

distributed computing on GIS data. We tried to investigate the

capabilities of each framework in the following aspects:

what kind of:

• Input/output data types are supported or suitable for that

particular framework,

• GIS processing (or executable languages)

• Data Management flexibility- supervision of the data distribution

especially for raster datatype

• Scalability potential

• Supported OS/Platform dependencies

• GIS Case studies, projects …

A new framework for distributed processing of Big Geospatial Data

• FOSS4G Bonn, 24.-26. August 2016

Page 8: Big Geospatial Data processing

Aspects of requirements and

benchmarking of existing solution

Page 9: Big Geospatial Data processing

While most of current processing frameworks follow the same

methodology as Hadoop and utilize the same data storage concept

as HDFS. One of the biggest disadvantage from processing point of

view was the data partitioning mechanism performed by HDFS file

system and distributed processing programming model.

In most cases we would like to have full control over our data

partitioning and distribution mechanism.

Existing GIS algorithms (without or with small modification)

can’t be executed (python, Matlab, R, etc.).

We decided to develop our own distributed processing

framework.

Initiative for a new framework

A new framework for distributed processing of Big Geospatial Data

• FOSS4G Bonn, 24.-26. August 2016

Page 10: Big Geospatial Data processing

IQLib - Objectives

10

Source: https://github.com/posseidon/IQLib/

IQLib specification

IQLib is a framework with the main goal of allowing an actor (human or

machine code) to manage huge data sets describing geographical

survey areas, and can be used to overcome scalability

limitations of processing algorithms.

IQLib’s core functionalities are:

1. The data-decomposition (Tiling) of a survey area in which

data points are either associated to polygons (regular or

irregular), or grouped according to temporal attributes, or

grouped into equally sized chunks, or a mixture of the above.

2. Data distribution and distributed processing among

nodes.

3. IQLib should also provide the functionality to stitching the

output data files into a single large file.

A new framework for distributed processing of Big Geospatial Data

• FOSS4G Bonn, 24.-26. August 2016

Page 11: Big Geospatial Data processing

High level concept of IQLib processing

framework

11 A new framework for distributed processing of Big Geospatial Data

• FOSS4G Bonn, 24.-26. August 2016

Page 12: Big Geospatial Data processing

Modules

As a result IQLib has 4 major modules; each module is responsible for a step in

GIS data processing.

Data Catalogue module: Data catalogue module is responsible for

storing metadata corresponding to survey areas, store all the

available, known and useful information for processing.

Tiling & stitching module: Tiling algorithms usually process raw data,

creating data chunks. Stitching usually runs after processing services

have successfully done their job. Metadata of tiled (and stitched)

dataset are registered into Data Catalogue module. With this step we

always know the parents of tiled data.

Data distribution module: New! data are distributed across processing

nodes, responsible to supervise the data distribution.

Distributed processing module: Distributed processing module is

responsible for running processing services on distributed dataset.

12 A new framework for distributed processing of Big Geospatial Data

• FOSS4G Bonn, 24.-26. August 2016

Page 13: Big Geospatial Data processing

Modules - status

Data Catalogue module

We have defined and implemented the

data model, the data/metadata access

procedures. After final approval

phase goes open source. Data

catalog is a stand-alone service

providing REST interface for users.

Tiling & stitching module *

Pre-defined tiling and stitching methods

tailored for processing algorithms.

*under planning phase

13 A new framework for distributed processing of Big Geospatial Data

• FOSS4G Bonn, 24.-26. August 2016

Page 14: Big Geospatial Data processing

Modules - status

14 A new framework for distributed processing of Big Geospatial Data

• FOSS4G Bonn, 24.-26. August 2016

Data distribution module: NEW!** We would like to have full control

over data distribution across processing nodes. Currently we are

supporting SFTP protocol only. Data partitioning and distribution

algorithms could be extended by third party developers.

Distributed processing module**: Using existing processing

algorithms/scripts without any modifications or very little adjustments.

The ability to send processing services across processing nodes on

demand, with all its dependencies.

**under development

Page 15: Big Geospatial Data processing

Conclusion - dev status

Almost ready phase:

Data Catalogue module

Under development:

Distributed processing

module.

Data distribution module:

NEW!

Theoretical planning phase:

Tiling & Stitching module

15 A new framework for distributed processing of Big Geospatial Data

• FOSS4G Bonn, 24.-26. August 2016

Currently IQLib Specification is available on GitHub at

https://github.com/posseidon/iqlib

IQLib going to have dedicated IQmulus GitHub soon!

Page 16: Big Geospatial Data processing

Related papers

• A. Olasz, Nguyen Thai B, D. Kristóf (2016). A NEW INITIATIVE FOR TILING,

STITCHING AND PROCESSING GEOSPATIAL BIG DATA IN DISTRIBUTED

COMPUTING ENVIRONMENTS; ISPRS ANNALS OF THE PHOTOGRAMMETRY,

REMOTE SENSING AND SPATIAL INFORMATION SCIENCES III-4: pp. 111-118.

• B. Nguyen Thai, A. Olasz (2015). RASTER DATA PARTITIONING FOR SUPPORTING

DISTRIBUTED GIS PROCESSING; ISPRS ARCHIVES OF PHOTOGRAMMETRY

AND REMOTE SENSING XL-3/W3 pp. 543-551.

• A. Olasz, D. Kristóf, M. Belényesi, K. Bakos, Z. Kovács, B. Balázs, Sz. Szabó (2015).

IQPC 2015: WATER DETECTION AND CLASSIFICATION ON MULTI-SOURCE REMOTE SENSING

AND TERRAIN DATA; ISPRS ANNALS OF THE PHOTOGRAMMETRY, REMOTE

SENSING AND SPATIAL INFORMATION SCIENCES XL-3/W3 pp. 583-588.

• Olasz A. and Nguyen Thai B. (2014). Decision support on distributed computing

environment (IQmulus). Proceedings of the 3rd Open Source Geospatial Research &

Education Symposium OGRS pp. 107-114.

A new framework for distributed processing of Big Geospatial Data

• FOSS4G Bonn, 24.-26. August 2016

Page 17: Big Geospatial Data processing

Future work

• Finishing the implementation of all modules

• Testing IQLib in the following aspects:

1. run existing algorithms on the framework

(python, R, etc.),

2. Experiment execution on big geospatial data

(raster, vector, point cloud),

3. Benchmark (on processing time).

A new framework for distributed processing of Big Geospatial Data

• FOSS4G Bonn, 24.-26. August 2016