advanced data retrieval and analytics with apache spark and openstack swift

18
© 2014 IBM Corporation Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift Gil Vernik IBM Research - Haifa

Upload: daniel-krook

Post on 05-Dec-2014

276 views

Category:

Technology


2 download

DESCRIPTION

Lightning talk from the OpenStack NYC meetup on October 8, 2014. http://bit.ly/ibm-os-meetup By Gil Vernik The integration between Apache Spark and Swift, and the use of Storlets for smart retrieval via filtering and privacy-support. The content of this talk is a statement from the IBM Research division, not IBM product divisions, and is not a statement from IBM regarding its plans, directions or product intents. Any activities described by this talk are subject to change.

TRANSCRIPT

Page 1: Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

© 2014 IBM Corporation

Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift Gil Vernik IBM Research - Haifa

Page 2: Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

© 2014 IBM Corporation

Topics Covered in This Talk § Openstack Swift

§ Apache Spark

§ Basic integration between Spark and Swift

§ Advanced integration between Spark and Swift by utilizing the Storlets technology.

Page 3: Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

© 2014 IBM Corporation

Digital Universe

More than 1.8 zettabytes (1.8 trillion gigabytes)

Grows rapidly

80% owned by enterprises 75% generated by individuals According IDC iView "Extracting Value from Chaos,"

Page 4: Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

© 2014 IBM Corporation

Map-Reduce, Databases, etc..

Data needs to be replicated, Time, Cost, etc..

Page 5: Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

© 2014 IBM Corporation

Can we do it better?

Page 6: Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

© 2014 IBM Corporation

Openstack Swift § A massively scalable object store

§ Known to work with thousands of servers, stores petabytes of data.

§ Exposes REST API

§ Features: – Storage polices – Erasure codes – Data replication – ….

PUT Proxy Nodes

Storage Nodes

Page 7: Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

© 2014 IBM Corporation

Apache Spark § Apache Spark™ is a fast and general engine for large-scale data processing

– Up to 100x faster than Hadoop Map Reduce in-memory, 10x faster on disk

§ Combines SQL, streaming, and complex analytics

§ Can read existing Hadoop data

§ Most active project in Apache today

Page 8: Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

© 2014 IBM Corporation

Swift enablement for data retrieval in Spark

§ Apache Spark implements Hadoop interfaces and can use HDFS or Amazon S3 as a data source.

Swift Network

§ IBM research enabled Spark to access data stored in Openstack Swift.

Page 9: Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

© 2014 IBM Corporation

What do we analyze?

Swift

Network

Stored Data Input to Analytics Images EXIF metadata PDF Hidden metadata LOGs Only ‘ERROR’ records …. ….

Page 10: Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

© 2014 IBM Corporation

Yes! We can do it better.

Page 11: Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

© 2014 IBM Corporation

Storlets: Flexibly extend for Swift Advanced Data processing inside Swift § Storlets is a way to ‘extend’ cloud computational capabilities

§ Storlet is compiled code, deployed to Swift and when triggered is executed by Storlet Engine directly on storage nodes.

§ Storlet engine - responsible to execute every storlet in a secure environment

§ Storlet is a standard Java code

Page 12: Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

© 2014 IBM Corporation

Storlets extend an object store by moving computation to the data – filtering, transforming, analyzing – instead of bringing the data to the

computation

Page 13: Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

© 2014 IBM Corporation

Swift Storlets: How do they benefit Spark?

Swift Storlet Network

Objects Filter Data processing +

Page 14: Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

© 2014 IBM Corporation

Storlets Enable Extending the Functionality of Spark Example: analyzing EXIF metadata from photos

§ Object store is a natural repository for photos

§ Photos contain rich capture metadata

§ Analyzing this metadata for a set of photos can show how the camera is used

Page 15: Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

© 2014 IBM Corporation

Example: Analyzing EXIF metadata Storlets can extract metadata, returning as JSON (rather than of processing the binary data directly by Spark)

10MB 1KB

Page 16: Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

© 2014 IBM Corporation

Example: Analyzing EXIF metadata.

•  Spark accesses images via storlet •  No change to Spark, only changes the URI •  JSON file returned by storlet defines schema •  SQL from Spark processes metadata

Page 17: Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

© 2014 IBM Corporation

Example: Analyzing EXIF metadata.

Page 18: Advanced Data Retrieval and Analytics with Apache Spark and Openstack Swift

© 2014 IBM Corporation

Summary § Openstack Swift is the most popular open source object store

§ Apache Spark is the next big thing in data analytics

§ Spark and Swift can be integrated

§ Storlets in Swift provide clear benefits for analytics use cases.

Thank you!

More information

Gil Vernik, IBM Research -Haifa [email protected]