virtual clusters for (rdf) stream processing

16
Alejandro Llaves Ontology Engineering Group Universidad Politécnica de Madrid Madrid, Spain [email protected] Oct 21 2015 Virtual Clusters for (RDF) Stream Processing

Upload: alejandro-llaves

Post on 16-Apr-2017

241 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Virtual Clusters for (RDF) Stream Processing

Alejandro Llaves

Ontology Engineering GroupUniversidad Politécnica de Madrid

Madrid, [email protected]

Oct 21 2015

Virtual Clusters for (RDF) Stream Processing

Page 2: Virtual Clusters for (RDF) Stream Processing

Outline

Some context: morph-streams++

Motivation

Use case: Sensor Cloud data integration

Topologies everywhere

Setting up a virtual cluster

Deploying Storm topologies

Conclusion

Page 3: Virtual Clusters for (RDF) Stream Processing

Some context...

Page 4: Virtual Clusters for (RDF) Stream Processing

Motivation

Integrating an unbounded stream of heterogeneous sensor observations

Solution:

– Storm topologies for real-time processing

– Semantic Sensor Network (SSN) ontology for modelling observations

– SWEET ontology for environmental phenomena

Page 5: Virtual Clusters for (RDF) Stream Processing

Use case: Sensor Cloud data integration (1/3)

Sensor Cloud Viticulture, water

management, weather monitoring, oyster farming...

RESTful API – JSON

Network → Platform → Sensor → Phenomenon → Observation

Lack of semantic descriptions, e.g. rain_trace vs Rain.

Multiple HTTP requests to query various streams.

Source: CSIRO

Page 6: Virtual Clusters for (RDF) Stream Processing

Use case: Sensor Cloud data integration (2/3)

Sensor Cloud messages to field-named tuples

SWEET annotations for heterogeneous phenomena descriptions

<sample time=”2015­05­28T16:30” value=”15” sensor=”bom_gov_au.94961.air.air_temp”/>

[“2015­05­28T16:32”, “2015­05­28T16:30”, “15”, “bom_gov_au”, “94961”, “air”, “air_temp”,“­43.3167”, “147.0075”]

network

phenomenon

platform sensorsampling time

system timelatitude longitude

SensorCloudParserBolt

SweetAnnotationsBolt

Page 7: Virtual Clusters for (RDF) Stream Processing

Use case: Sensor Cloud data integration (3/3)

SSN mapping

SSNConverterBolt

Page 8: Virtual Clusters for (RDF) Stream Processing

Topologies everywhere

A Storm topology “is a graph of stream transformations where each node is a spout or bolt”. https://storm.apache.org/documentation/Tutorial.html

Example of simple topology

Page 9: Virtual Clusters for (RDF) Stream Processing
Page 10: Virtual Clusters for (RDF) Stream Processing
Page 11: Virtual Clusters for (RDF) Stream Processing

Setting up a virtual cluster (1/2)

Wirbelsturm - https://github.com/miguno/wirbelsturm/

Allows deploying (local or remote) virtual clusters.

Focus on Big Data technologies: Storm, Kafka, Zookeeper...

Uses Vagrant for “easy to configure, reproducible, and portable work environments” - https://docs.vagrantup.com/v2/why-vagrant/index.html

Uses Puppet for provisioning: installation and configuration of SW packages in the cluster nodes.

Page 12: Virtual Clusters for (RDF) Stream Processing

Setting up a virtual cluster (2/2)

$ ./deploy Show wirbelsturm.yaml

Check Storm GUI - http://localhost:28080/index.html

Page 13: Virtual Clusters for (RDF) Stream Processing

Deploying Storm topologies

$ ./deploy Show wirbelsturm.yaml

Check Storm GUI - http://localhost:28080/index.html

Describe simple topology

Compile & deploy

Describe a topology set

Configure Kafka

Compile & deploy

Page 14: Virtual Clusters for (RDF) Stream Processing
Page 15: Virtual Clusters for (RDF) Stream Processing

Conclusion

Conclusion

Wirbelsturm allows easy configuration & deployment of virtual clusters, with focus on Big Data technologies.

SSN and SWEET ontologies to model and integrate environmental sensor observations.

Parallelization of bottleneck tasks reduces the average message processing latency (up to some extent). More about Storm parallelization: http://bit.ly/1NVyjU2

Delaying RDF conversion does not speed up the processing of Sensor Cloud messages in the tested environment.

Submitted paper to IJSWIS, special issue on Velocity and Variety Dimensions of Big Data – Llaves, Corcho et al.

What's coming next

Flying faster with Heron - https://blog.twitter.com/2015/flying-faster-with-twitter-heron

Page 16: Virtual Clusters for (RDF) Stream Processing

The presented research has has been funded by Ministerio de Economía y Competitividad (Spain) under the project ”4V:

Volumen, Velocidad, Variedad y Validez en la Gestión Innovadora de Datos” (TIN2013-46238-C4-2-R), by the EU Marie Curie

IRSES project SemData (612551), and supported by an AWS in Education Research Grant award.

Alejandro [email protected]

Thanks!