hadoop cluster on docker containers

Hadoop Cluster on Docker Containers“What Works and What Doesn't”

By:Pranav JoshiME-HPCGTU PG School

Content

● Introduction to Hadoop and Docker● Why Hadoop on Docker?● Job Configuration● Openstack Sahara● Handling Hadoop Single Point of Failure● Validating the Prototype● Performance Test● Conclusion● Reference

Introduction to Hadoop

● Apache Hadoop is an open-source software framework written in Java for distributed storage and distributed processing of very large data sets on computer clusters built from commodity hardware.

● Major Components of Apache Hadoop are,

– Hadoop Common: The common utilities that support the other Hadoop modules.

– Hadoop Distributed File System (HDFS™): A distributed file system that provides high-throughput access to application data.

– Hadoop YARN: A framework for job scheduling and cluster resource management.

– Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

Introduction to Docker Container

● Docker allows you to package an application with all of its dependencies into a standardized unit for software development.

● It is an open-source program that enables a Linux application and its dependencies to be packaged as a container.

● Containers include the application and all of its dependencies, but share the kernel with other containers.

Why Docker?

● Lightweight, Portable

● Build once, Run anywhere

● VM – without the overhead of a VM

● Isolated containers

● Automated and scripted

Separating out simple tasks

Container vs. VMs

Job Configuration

● YARN’s ApplicationMaster asks the NodeManager to launch containers: LinuxContainerExecutor

● Docker can be used not only for fine-grained performance isolation, but for delivering software packages

Openstack Sahara

Design and Implementation

● Implementation:

– Using a Dockerfile, our solution creates an image with Java, ssh and some basic packages installed, and set up the image to use the Hadoop build in a shared folder with the host.

– When an instance is created from the image, it starts ssh daemon by default in order to allow further runtime configuration through this channel.

● Management:

– Cluster managing library offers an even more abstract API allowing the client to list and create a cluster, start, stop and get details of a container and starting service in a specific container.

Hadoop and Fault Tolerance

● HDFS allows the replication of the NameNode (through passive replication), but a failure at the level of the Job-Tracker forces a job to be restarted.

● On Hadoop 2.x, part of the job management responsibility is transferred to the ApplicationMaster, which becomes a task manager.

● The loss of the ResourceManager does not block the execution of a job, only prevents new jobs to be submitted. However, the loss of an ApplicationMaster forces the restart of the job, just like on Hadoop l.x.

Handling Hadoop Single Point of Failures

● Fast recovery in the case of a failure● Small impact on the performance● Adapt to the capacity and context of the nodes

Validating the Prototype

● Using the Docker-Hadoop dashboard allowed us to analyze different failure scenarios, including:

– Crash of the Job'Tracker node: we kill the JobTracker to force a new node to resume the JobTracker role.

– Restart of an old JobTracker: we investigate the impacts of the return of an old JobTracker node. Two possibilities are investigated:

● The returning node was simply disconnected from the network and still thinks it is the JobTracker.

● The returning node has restarted and has lost all its status, but is still on the top of Zookeeper's list.

– Heartbeat tuning: a too lazy heartbeat slows-down the reaction to failures and may lead to some of the situations in the previous item. An intensive heartbeat may impact negatively on the overall performance.

Performance Test

Execution time analysis when using different number of tasktrackers

Conclusion

● From this presentation we can explore the use of container-based virtual machines to develop a prototyping environment for MapReduce applications.

● The use of Docker-Hadoop allowed us to improve the development speed of our Hadoop solution, as the developers could test their code directly on their own computers.

References

● IEEE Paper 1

– Title: Efficient Prototyping of Fault Tolerant Map-Reduce Applications with Docker-Hadoop

– Authors: Luiz Angelo Steffenel,Javier Rey, Matias Cogorno and Sergio Nesmachnow, France

– Publication: 2015 IEEE International Conference on Cloud Engineering● IEEE Paper 2

– Title: Finding the Big Data Sweet Spot: Towards Automatically Recommending Configurations for Hadoop Clusters on Docker Containers

– Authors: Rui Zhang, Min Li* and Dean Hildebrand, IBM Research and Almaden *IBM T.J. Watson Research Center

– Publication: 2015 IEEE International Conference on Cloud Engineering

Thank You

hadoop cluster on docker containers

Software

testing strategies for docker containers

logging docker containers

docker 101 - all about docker containers

rhel --> containers --> atomic --> docker --> kuberneties...

containers debugging docker -...

monitoramento de containers docker

hadoop on docker

introduction to containers and docker

hadoop in docker containers what works and what doesn’t --...

everyone loves docker containers before they understand...

containers and docker - red...

lessons learned running hadoop and spark in docker...

containers, docker and microservices

docker containers

docker & chef and chef...8/4/14 1 docker & chef containers...

introduction to docker and containers

containers and virtualization tools ( docker )

docker containers orchestration

how docker didn't invent containers (docker meetup brno #1)

docker containers anintroduction