advanced data retrieval and analytics with apache spark and openstack swift

Post on 05-Dec-2014

276 Views

Category:

Technology

2 Downloads

Preview:

Click to see full reader

DESCRIPTION

Lightning talk from the OpenStack NYC meetup on October 8, 2014. http://bit.ly/ibm-os-meetup By Gil Vernik The integration between Apache Spark and Swift, and the use of Storlets for smart retrieval via filtering and privacy-support. The content of this talk is a statement from the IBM Research division, not IBM product divisions, and is not a statement from IBM regarding its plans, directions or product intents. Any activities described by this talk are subject to change.

TRANSCRIPT

© 2014 IBM Corporation

Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift Gil Vernik IBM Research - Haifa

© 2014 IBM Corporation

Topics Covered in This Talk § Openstack Swift

§ Apache Spark

§ Basic integration between Spark and Swift

§ Advanced integration between Spark and Swift by utilizing the Storlets technology.

© 2014 IBM Corporation

Digital Universe

More than 1.8 zettabytes (1.8 trillion gigabytes)

Grows rapidly

80% owned by enterprises 75% generated by individuals According IDC iView "Extracting Value from Chaos,"

© 2014 IBM Corporation

Map-Reduce, Databases, etc..

Data needs to be replicated, Time, Cost, etc..

© 2014 IBM Corporation

Can we do it better?

© 2014 IBM Corporation

Openstack Swift § A massively scalable object store

§ Known to work with thousands of servers, stores petabytes of data.

§ Exposes REST API

§ Features: – Storage polices – Erasure codes – Data replication – ….

PUT Proxy Nodes

Storage Nodes

© 2014 IBM Corporation

Apache Spark § Apache Spark™ is a fast and general engine for large-scale data processing

– Up to 100x faster than Hadoop Map Reduce in-memory, 10x faster on disk

§ Combines SQL, streaming, and complex analytics

§ Can read existing Hadoop data

§ Most active project in Apache today

© 2014 IBM Corporation

Swift enablement for data retrieval in Spark

§ Apache Spark implements Hadoop interfaces and can use HDFS or Amazon S3 as a data source.

Swift Network

§ IBM research enabled Spark to access data stored in Openstack Swift.

© 2014 IBM Corporation

What do we analyze?

Swift

Network

Stored Data Input to Analytics Images EXIF metadata PDF Hidden metadata LOGs Only ‘ERROR’ records …. ….

© 2014 IBM Corporation

Yes! We can do it better.

© 2014 IBM Corporation

Storlets: Flexibly extend for Swift Advanced Data processing inside Swift § Storlets is a way to ‘extend’ cloud computational capabilities

§ Storlet is compiled code, deployed to Swift and when triggered is executed by Storlet Engine directly on storage nodes.

§ Storlet engine - responsible to execute every storlet in a secure environment

§ Storlet is a standard Java code

© 2014 IBM Corporation

Storlets extend an object store by moving computation to the data – filtering, transforming, analyzing – instead of bringing the data to the

computation

© 2014 IBM Corporation

Swift Storlets: How do they benefit Spark?

Swift Storlet Network

Objects Filter Data processing +

© 2014 IBM Corporation

Storlets Enable Extending the Functionality of Spark Example: analyzing EXIF metadata from photos

§ Object store is a natural repository for photos

§ Photos contain rich capture metadata

§ Analyzing this metadata for a set of photos can show how the camera is used

© 2014 IBM Corporation

Example: Analyzing EXIF metadata Storlets can extract metadata, returning as JSON (rather than of processing the binary data directly by Spark)

10MB 1KB

© 2014 IBM Corporation

Example: Analyzing EXIF metadata.

•  Spark accesses images via storlet •  No change to Spark, only changes the URI •  JSON file returned by storlet defines schema •  SQL from Spark processes metadata

© 2014 IBM Corporation

Example: Analyzing EXIF metadata.

© 2014 IBM Corporation

Summary § Openstack Swift is the most popular open source object store

§ Apache Spark is the next big thing in data analytics

§ Spark and Swift can be integrated

§ Storlets in Swift provide clear benefits for analytics use cases.

Thank you!

More information

Gil Vernik, IBM Research -Haifa gilv@il.ibm.com

top related