hadoop cluster on docker containers

Post on 17-Aug-2015

52 Views

Category:

Software

3 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Hadoop Cluster on Docker Containers“What Works and What Doesn't”

By:Pranav JoshiME-HPCGTU PG School

Content

● Introduction to Hadoop and Docker● Why Hadoop on Docker?● Job Configuration● Openstack Sahara● Handling Hadoop Single Point of Failure● Validating the Prototype● Performance Test● Conclusion● Reference

Introduction to Hadoop

● Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.

● Major Components of Apache Hadoop are,

– Hadoop Common: The common utilities that support the other Hadoop modules.

– Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

– Hadoop YARN: A framework for job scheduling and cluster resource management.

– Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Introduction to Docker Container

● Docker allows you to package an application with all of its dependencies into a standardized unit for software development.

● It is an open-source program that enables a Linux application and its dependencies to be packaged as a container.

● Containers include the application and all of its dependencies, but share the kernel with other containers.

Why Docker?

● Lightweight, Portable

● Build once, Run anywhere

● VM – without the overhead of a VM

● Isolated containers

● Automated and scripted

Separating out simple tasks

Container vs. VMs

Job Configuration

● YARN’s ApplicationMaster asks the NodeManager to launch containers: LinuxContainerExecutor

● Docker can be used not only for fine-grained performance isolation, but for delivering software packages

Openstack Sahara

Design and Implementation

● Implementation:

– Using a Dockerfile, our solution creates an image with Java, ssh and some basic packages installed, and set up the image to use the Hadoop build in a shared folder with the host.

– When an instance is created from the image, it starts ssh daemon by default in order to allow further runtime configuration through this channel.

● Management:

– Cluster managing library offers an even more abstract API allowing the client to list and create a cluster, start, stop and get details of a container and starting service in a specific container.

Hadoop and Fault Tolerance

● HDFS allows the replication of the NameNode (through passive replication), but a failure at the level of the Job-Tracker forces a job to be restarted.

● On Hadoop 2.x, part of the job management responsibility is transferred to the ApplicationMaster, which becomes a task manager.

● The loss of the ResourceManager does not block the execution of a job, only prevents new jobs to be submitted. However, the loss of an ApplicationMaster forces the restart of the job, just like on Hadoop l.x.

Handling Hadoop Single Point of Failures

● Fast recovery in the case of a failure● Small impact on the performance● Adapt to the capacity and context of the nodes

Validating the Prototype

● Using the Docker-Hadoop dashboard allowed us to analyze different failure scenarios, including:

– Crash of the Job'Tracker node: we kill the JobTracker to force a new node to resume the JobTracker role.

– Restart of an old JobTracker: we investigate the impacts of the return of an old JobTracker node. Two possibilities are investigated:

● The returning node was simply disconnected from the network and still thinks it is the JobTracker.

● The returning node has restarted and has lost all its status, but is still on the top of Zookeeper's list.

– Heartbeat tuning: a too lazy heartbeat slows-down the reaction to failures and may lead to some of the situations in the previous item. An intensive heartbeat may impact negatively on the overall performance.

Performance Test

Performance Test

Execution time analysis when using different number of tasktrackers

Conclusion

● From this presentation we can explore the use of container-based virtual machines to develop a prototyping environment for MapReduce applications.

● The use of Docker-Hadoop allowed us to improve the development speed of our Hadoop solution, as the developers could test their code directly on their own computers.

References

● IEEE Paper 1

– Title: Efficient Prototyping of Fault Tolerant Map-Reduce Applications with Docker-Hadoop

– Authors: Luiz Angelo Steffenel,Javier Rey, Matias Cogorno and Sergio Nesmachnow, France

– Publication: 2015 IEEE International Conference on Cloud Engineering● IEEE Paper 2

– Title: Finding the Big Data Sweet Spot: Towards Automatically Recommending Configurations for Hadoop Clusters on Docker Containers

– Authors: Rui Zhang, Min Li* and Dean Hildebrand, IBM Research and Almaden *IBM T.J. Watson Research Center

– Publication: 2015 IEEE International Conference on Cloud Engineering

Thank You

top related