jdg 7 & spark integration
TRANSCRIPT
![Page 1: JDG 7 & Spark Integration](https://reader030.vdocument.in/reader030/viewer/2022020314/58ceaf931a28abb2218b4b6d/html5/thumbnails/1.jpg)
JDG 7 & SparkIntegration
Infinispan Spark connector
tedwon
![Page 2: JDG 7 & Spark Integration](https://reader030.vdocument.in/reader030/viewer/2022020314/58ceaf931a28abb2218b4b6d/html5/thumbnails/2.jpg)
Agenda● Brief Introduction to JBoss Data Grid 7
● Brief Introduction to Apache Spark
● Features of Infinispan Spark connector
● Demo
![Page 3: JDG 7 & Spark Integration](https://reader030.vdocument.in/reader030/viewer/2022020314/58ceaf931a28abb2218b4b6d/html5/thumbnails/3.jpg)
What is JBoss Data Grid● Distributed cache● In-memory NoSQL● Key/value data store● Built from the Infinispan community project● In architecture, there is no master node● Peer-to-peer membership clustering● The most strength is extremely highly availability and scalability as in-memory
○ In my personal opinion● Good harmony with real-time data processing framework for data services● As a metaphor, In-memory HDFS for real-time processing
![Page 4: JDG 7 & Spark Integration](https://reader030.vdocument.in/reader030/viewer/2022020314/58ceaf931a28abb2218b4b6d/html5/thumbnails/4.jpg)
What is JBoss Data Grid 7● JDG 7 was released at July, 2016● JDG 7 is based on Infinispan 8.3.0 and EAP 7.0.0.GA
○ The previous version JDG 6.6.0 is based on Infinispan 6.4.0
● JDG 7 now becomes much more powerful as a competitive software product● There are new features and enhancements
○ New GUI Admin console○ Easy to install and configure clustering and cache○ Easy to monitor and change configuration○ Easy to create a new cache through admin console even in runtime○ Provides API for integration with Apache Spark
![Page 5: JDG 7 & Spark Integration](https://reader030.vdocument.in/reader030/viewer/2022020314/58ceaf931a28abb2218b4b6d/html5/thumbnails/5.jpg)
JDG 7 - New Features and Enhancements1. DISTRIBUTED STREAMS2. REMOTE TASK EXECUTION3. APACHE SPARK INTEGRATION4. APACHE HADOOP INTEGRATION5. NEW ADMINISTRATION CONSOLE FOR SERVER DEPLOYMENTS6. CONTROLLED SHUTDOWN AND RESTART OF CLUSTER7. NODE.JS (JAVASCRIPT) HOT ROD CLIENT8. CASSANDRA CACHE STORE9. HOT ROD C++ ENHANCEMENTS
10. HOT ROD C# ENHANCEMENTS
![Page 6: JDG 7 & Spark Integration](https://reader030.vdocument.in/reader030/viewer/2022020314/58ceaf931a28abb2218b4b6d/html5/thumbnails/6.jpg)
What is Apache Spark
![Page 7: JDG 7 & Spark Integration](https://reader030.vdocument.in/reader030/viewer/2022020314/58ceaf931a28abb2218b4b6d/html5/thumbnails/7.jpg)
● General-purpose Lightning-fast cluster computing framework● General-purpose
○ One common data processing engine○ Batch, SQL, Streaming, MLlib, Graph○ One platform to rule them all
● Lightning-fast○ RDD - Resilient Distributed Dataset○ Read-only multiset of data items distributed over a cluster of machines
● Works with Hadoop HDFS and process that data in parallel○ Distributed file System
● MapReduce interfaces with funtional programming style○ No MR chaining workflow in Hadoop
What is Apache Spark
![Page 8: JDG 7 & Spark Integration](https://reader030.vdocument.in/reader030/viewer/2022020314/58ceaf931a28abb2218b4b6d/html5/thumbnails/8.jpg)
Apache Spark - One platform to rule them all
![Page 9: JDG 7 & Spark Integration](https://reader030.vdocument.in/reader030/viewer/2022020314/58ceaf931a28abb2218b4b6d/html5/thumbnails/9.jpg)
What is Apache Spark RDD● RDD consists of mulitle data partitions over a cluster of machines
● RDD is intermediate data during a Spark data processing job
● RDD is distributed in memory of each worker node JVM
● Data processing job is transformations of RDDs
● RDDs are disappeared after finishing a job
● Not able to share RDDs between Spark jobs
![Page 10: JDG 7 & Spark Integration](https://reader030.vdocument.in/reader030/viewer/2022020314/58ceaf931a28abb2218b4b6d/html5/thumbnails/10.jpg)
What is Apache Spark RDD
![Page 11: JDG 7 & Spark Integration](https://reader030.vdocument.in/reader030/viewer/2022020314/58ceaf931a28abb2218b4b6d/html5/thumbnails/11.jpg)
Infinispan Spark connectorJDG 7’s new feautre
![Page 12: JDG 7 & Spark Integration](https://reader030.vdocument.in/reader030/viewer/2022020314/58ceaf931a28abb2218b4b6d/html5/thumbnails/12.jpg)
What is Infinispan Spark connector● https://github.com/infinispan/infinispan-spark
● JDG 7 Document https://goo.gl/9BXp98
● RDD and DStream integration with Apache Spark 1.6○ DStream is a continuous sequence of RDDs for real-time stream processing
● Use JDG as a data source for Spark
● Easy to read & write cache data in a Spark job
● Provides seamless funtional programming style and syntactic sugar
● Good to share RDD with other Spark jobs
![Page 13: JDG 7 & Spark Integration](https://reader030.vdocument.in/reader030/viewer/2022020314/58ceaf931a28abb2218b4b6d/html5/thumbnails/13.jpg)
Supported Configurations
![Page 14: JDG 7 & Spark Integration](https://reader030.vdocument.in/reader030/viewer/2022020314/58ceaf931a28abb2218b4b6d/html5/thumbnails/14.jpg)
Features of Infinispan Spark connector1. Create an Spark RDD from a JDG cache data
○ Read cache data from Spark job
2. Write any key/value based RDD to a JDG cache○ Write intermediate or final data to cache
3. Create a Spark DStream from cache-level events ○ Insert, Modify and Delete event in a cache
4. Write any key/value DStream to JDG○ Write any DStream to cache
5. Use JDG server side filters to create a cache based RDD○ Using Infinispan Query DSL
![Page 15: JDG 7 & Spark Integration](https://reader030.vdocument.in/reader030/viewer/2022020314/58ceaf931a28abb2218b4b6d/html5/thumbnails/15.jpg)
Prerequisite & Version compatibility● JDK 8, Scala 2.10.4, Spark 1.6
● JBoss Data Grid 7.0.0 Server○ Infinispan Server 8.2.4.Final
● Run JDG in Remote Client-Server mode
![Page 16: JDG 7 & Spark Integration](https://reader030.vdocument.in/reader030/viewer/2022020314/58ceaf931a28abb2218b4b6d/html5/thumbnails/16.jpg)
Demohttps://github.com/tedwon/infinispan-spark-connector-examples
![Page 17: JDG 7 & Spark Integration](https://reader030.vdocument.in/reader030/viewer/2022020314/58ceaf931a28abb2218b4b6d/html5/thumbnails/17.jpg)
Performance Considerations● The number of Spark workers should be greater than JDG nodes
○ To take advantage of the parallelism
● Support locality with co-located in the same node as JDG and Spark worker○ Spark worker only processes data in the local node with connector
![Page 18: JDG 7 & Spark Integration](https://reader030.vdocument.in/reader030/viewer/2022020314/58ceaf931a28abb2218b4b6d/html5/thumbnails/18.jpg)
Thank you
![Page 19: JDG 7 & Spark Integration](https://reader030.vdocument.in/reader030/viewer/2022020314/58ceaf931a28abb2218b4b6d/html5/thumbnails/19.jpg)
References1. https://access.redhat.com/documentation/en-US/Red_Hat_JBoss_Data_Grid/7.0/html-single/Developer_Guide/index.
html#Integration_with_Apache_Spark2. http://spark.apache.org/docs/1.6.2/programming-guide.html3. https://github.com/infinispan/infinispan-spark4. https://github.com/infinispan/infinispan5. https://github.com/tedwon/infinispan-spark-connector-examples6. https://access.redhat.com/documentation/en-US/Red_Hat_JBoss_Data_Grid/7.0/html-single/7.0.0_Release_Notes/in
dex.html#chap-New_Features_and_Enhancements7. https://en.wikipedia.org/wiki/Apache_Spark8. https://hub.docker.com/r/gustavonalle/infinispan-spark/