![Page 1: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK](https://reader033.vdocument.in/reader033/viewer/2022052905/55843838d8b42abf1e8b49b3/html5/thumbnails/1.jpg)
A Publish-Subscribe Distributed Notification
System on Hadoop
Jyotiska Nath KhasnabishIIIT-Bangalore
![Page 2: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK](https://reader033.vdocument.in/reader033/viewer/2022052905/55843838d8b42abf1e8b49b3/html5/thumbnails/2.jpg)
HadoopOpen source distributed framework for processing
“Big Data”.
Offers distributed file system(HDFS) for storing massive amount of data across clusters.
MapReduce as a programming model for processing the large amount of data.
Adopted and used in production by 1000+ companies worldwide.
20+ popular Hadoop-based subprojects and growing.
![Page 3: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK](https://reader033.vdocument.in/reader033/viewer/2022052905/55843838d8b42abf1e8b49b3/html5/thumbnails/3.jpg)
Distributed Notification System [HDFS-1742] talks about a system that could notify
interested clients about major HDFS events (like file creation, deletion, etc), MapReduce job end notification.
[HDFS-2760] talks about adding a PubSub system on HDFS for sending notification messages to clients subscribed to specific services.
[HDFS-7821] talks about an event notification system which – Provide periodic updates to subscribed users Provide the capability to let users specify 'interesting events'. Provide a 'customizable' and 'configurable' interface such that
user-defined parameters can also be 'subscribed' by the user.
![Page 4: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK](https://reader033.vdocument.in/reader033/viewer/2022052905/55843838d8b42abf1e8b49b3/html5/thumbnails/4.jpg)
Publish Subscribe Model
![Page 5: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK](https://reader033.vdocument.in/reader033/viewer/2022052905/55843838d8b42abf1e8b49b3/html5/thumbnails/5.jpg)
Messaging Systems
Apache ActiveMQ
Uses JMS (Java Messaging Service) for sending and receiving messages.
Three components – Publisher, Broker, Subscriber.
Supports both Persistence and Non Persistence.
Apache Kafka
Developed by LinkedIn.
Three components – Producer, Broker, Consumer.
Supports both Persistent and Non Persistent Messaging.
Uses Zookeeper for co-ordination.
![Page 6: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK](https://reader033.vdocument.in/reader033/viewer/2022052905/55843838d8b42abf1e8b49b3/html5/thumbnails/6.jpg)
Architecture
![Page 7: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK](https://reader033.vdocument.in/reader033/viewer/2022052905/55843838d8b42abf1e8b49b3/html5/thumbnails/7.jpg)
Use Cases
![Page 8: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK](https://reader033.vdocument.in/reader033/viewer/2022052905/55843838d8b42abf1e8b49b3/html5/thumbnails/8.jpg)
1. Message Passing
Sending status flags or progress reports of running jobs among multiple Hadoop services.
Hadoop services can take the role of either a publisher or a subscriber.
Example – TaskTrackers only notifying JobTracker their status
where there is a status change.
![Page 9: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK](https://reader033.vdocument.in/reader033/viewer/2022052905/55843838d8b42abf1e8b49b3/html5/thumbnails/9.jpg)
2. Notification for Data Availability
Chained jobs get notified about the completion of some other job on which they are dependent.
No need to poll the NameNode for data availability in the HDFS.
Multiple subscribed services or jobs can be notified when the data is available.
![Page 10: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK](https://reader033.vdocument.in/reader033/viewer/2022052905/55843838d8b42abf1e8b49b3/html5/thumbnails/10.jpg)
3. Event Based Job Chaining
Multiple MapReduce jobs can be chained based on events occurring in the Hadoop cluster.
Easier for workflow managers to chain jobs and trigger workflows automatically.
Automatic setting of job dependency for heavily chained MapReduce jobs in order to accomplish a complex computation.
![Page 11: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK](https://reader033.vdocument.in/reader033/viewer/2022052905/55843838d8b42abf1e8b49b3/html5/thumbnails/11.jpg)
Cluster Configuration
Machine #1 Machine #2 Machine #3
Processing Speed
2.3 GHz 2.3 GHz 2.3 GHz
RAM 2 GB 2 GB 2 GB
Disk Space 8 GB 8 GB 8 GB
OS Ubuntu 12.04 Ubuntu 12.04 Ubuntu 12.04
Hadoop Version 1.1.1 1.1.1 1.1.1
ActiveMQ Version
5.8.0 5.8.0 5.8.0
Kafka Version 0.8 0.8 0.8
![Page 12: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK](https://reader033.vdocument.in/reader033/viewer/2022052905/55843838d8b42abf1e8b49b3/html5/thumbnails/12.jpg)
Performance AnalysisActiveMQ vs Kafka
![Page 13: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK](https://reader033.vdocument.in/reader033/viewer/2022052905/55843838d8b42abf1e8b49b3/html5/thumbnails/13.jpg)
Performance AnalysisSingle Node vs Multi Node
![Page 14: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK](https://reader033.vdocument.in/reader033/viewer/2022052905/55843838d8b42abf1e8b49b3/html5/thumbnails/14.jpg)
Performance ComparisonWith and Without Notification System
![Page 15: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK](https://reader033.vdocument.in/reader033/viewer/2022052905/55843838d8b42abf1e8b49b3/html5/thumbnails/15.jpg)
Hadoop Cluster Load
Before After
![Page 16: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK](https://reader033.vdocument.in/reader033/viewer/2022052905/55843838d8b42abf1e8b49b3/html5/thumbnails/16.jpg)
Network Bandwidth Consumption
Before After
![Page 17: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK](https://reader033.vdocument.in/reader033/viewer/2022052905/55843838d8b42abf1e8b49b3/html5/thumbnails/17.jpg)
Mobile Client
![Page 18: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK](https://reader033.vdocument.in/reader033/viewer/2022052905/55843838d8b42abf1e8b49b3/html5/thumbnails/18.jpg)
Conclusion
Distributed notification system based on Publish Subscribe messaging model.
Can be used to pass messages between services, notify subscribed clients and chain multiple jobs.
Reduces cluster load and network bandwidth consumption significantly resulting optimal use of hardware and resources.
Can be scaled to large Hadoop cluster, > 100/1000 nodes for handling heavily inter-dependent jobs.
![Page 19: CSI 2013 Presentation Hadoop Notification System - Jyotiska NK](https://reader033.vdocument.in/reader033/viewer/2022052905/55843838d8b42abf1e8b49b3/html5/thumbnails/19.jpg)
Thank you