docker based hadoop provisioning - hadoop summit 2014

Post on 27-Aug-2014

4.817 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Docker based Hadoop provisioning in the cloud and on-premise/physical hardware

TRANSCRIPT

Janos Matyas / CTO / SequenceIQ Inc.

GOAL / MOTIVATION

TECHNOLOGY STACK

PROBLEM RESOLUTION / HOW IT WORKS

RESULTS / ACHIEVEMENTS

OVERVIEW

GOAL / MOTIVATION

Ease Hadoop provisioning – everywhere

Automate and unify the process

Arbitrary cluster size

Same process through a cluster lifecycle (Dev, QA, UAT, Prod)

(Auto) scaling Hadoop

QoS

OUR APPROACH

Use Docker

Build cloud-specific ‘Dockerized’ images

Provision the cluster

Use Ambari

DOCKER

Lightweight, portable

Build once, run anywhere

VM – without the overhead of a VM

Isolated containers

Automated and scripted

DOCKER – CONTAINERS vs. VMs

Containers are isolated, but share OS and, where appropriate, bins/libraries

APACHE AMBARI – ARCHITECTURE

Easy Hadoop cluster provisioning

Management and monitoring

Key features – blueprints

REST API

APACHE AMBARI – CREATE CLUSTER

Define a blueprint (POST /api/v1/blueprints)

Create cluster (POST /api/v1/clusters/mycluster)

HADOOP PROVISIONG ISSUES

Each cloud provider has a proprietary API

Create images for each provider

Network configuration

Service discovery

Resize, failover, member join support

OUR APPROACH – DETAILS

Build your Docker image

Install or pre-install Hadoop services with Ambari

Install Serf and dnsmasq

Build your cloud image

Use Ansible to create an image

Provision the cluster

BUILD DOCKER IMAGES

Create the Dockerfile

Have Docker.io to build the image

Optionally pre-install services

Use Ambari

Push image to Docker.io

Licensing questions

BUILD CLOUD IMAGES

Use a Docker ready base image

Use Ansible to provision the image template

Pull the Docker images

Apply custom infrastructure

Use cloud provider specific playbooks

AWS EC2

Azure

ANSIBLE

Configuration as data

Simplest way to automate IT

Secure and agentless

Goal oriented

One playbook – multiple modules

We use it to “burn” cloud images/templates

PROVISIONING – ISSUES

FQDN

/etc/hosts is read-only in Docker

Everybody needs to know everybody

DNS

Single point of failure

Dynamic cluster – nodes joining, leaving, failing

Routing

Cloud – ability to inter-host container routing

Collision free private IP range for Docker bridge

We need predefined host names/IP addresses /etc/hosts is read-only in Docker Use Ansible to provision the image template

Pull the Docker images

Start a DNS server Use it as a reference docker run -dns <IP_OF_DNS> Nodes need to know each other

PROVISIONING – SOLUTION

FQDN

Use –h and –dns Docker params

DNS

dnsmasq is running on each Docker container

Serf member-xxx events trigger dnsmasq reconfiguration

Routing

Docker bridge configuration – follows a convention

SERF

Gossip based membership

Service discovery

Decentralized

Lightweight, fault tolerant

Highly available

DevOps friendly

Keep an eye on Consul, Open vSwitch, pipework

SERF – DECENTRALIZED SERVICE DISCOVERY

Gossip instead of heartbeat

LAN, WAN profiles

Provides membership information

Event handlers: member_join, member_leave, member_failed, member-update, member-reap, user

Query

SERF – GOSSIPING

SERF – MEMBERSHIP, EVENT HANDLERS

DNSMASQ

Network infrastructure for small networks

Lightweight DNS, DHCP server

Comes with most Linux distributions

AWS EC2 – HADOOP CLUSTER

Use EC2 REST API to provision instances (from Dockerized image)

Start Docker containers

One Ambari server

N-1 Ambari agents connecting to server

Connect ambari-shell to

Define blueprint

Provision the cluster

AWS EC2 – NETWORK SECURITY

Create a VPC

Configure subnets

Routing tables

Security gateway

Set ACL

Configure VPN

AWS EC2 - CLOUDFORMATION

Manually set up VPC is too complicated

Use CloudFormation

Manage the stack together

Template-based

Environments under version control

Customizable at runtime

No extra charge

"VpcId" : { "Type" : "String", "Description" : "VpcId of your existing Virtual Private Cloud (VPC)" },

"SubnetId" : { "Type" : "String", "Description" : "SubnetId of an existing subnet (for the primary network) in your Virtual Private Cloud (VPC)" },

"SecondaryIPAddressCount" : { "Type" : "Number", "Default" : "1", "MinValue" : "1", "MaxValue" : "5", "Description" : "Number of secondary IP addresses to assign to the network interface (1-5)", "ConstraintDescription": "must be a number from 1 to 5." },

"SSHLocation" : { "Description" : "The IP address range that can be used to SSH to the EC2 instances", "Type": "String", "MinLength": "9", "MaxLength": "18", "Default": "0.0.0.0/0", "AllowedPattern": "(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})\\.(\\d{1,3})/(\\d{1,2})", "ConstraintDescription": "must be a valid IP CIDR range of the form x.x.x.x/x." } },

"Mappings" : { "RegionMap" : { "us-east-1" : { "AMI" : "ami-7f418316" },

CLOUDBREAK

Cloudbreak is a powerful left surf that breaks over a coral reef, a mile off

southwest the island of Tavarua, Fiji.Cloudbreak is a cloud-agnostic

Hadoop as a Service API. Abstracts

the provisioning and ease

management and monitoring of on-

demand clusters.

Provisioning Hadoop has never been easier

CLOUDBREAK

Benefits Elastic

Scalable

Blueprints

Flexible

Main REST resources /template – specify a cluster infrastructure

/stack – creates a cloud infrastructure built from a template

/blueprint – describes a Hadoop cluster

/cluster – creates a Hadoop cluster

RESULTS AND ACHIEVEMENTS

Hadoop as a Service API

Available for EC2 and Azure cloud

OpenStack, bare metal is coming soon

Open source under Apache 2 licence

Same goals as Apache Ambari Launchpad project

What's next?

HADOOP SERVICES - AS A SERVICE

Leverage YARN

Slider (Hoya) providers

HBase, Accumulo

SequenceIQ providers - Flume, Tomcat

YARN -1964

QoS for YARN – heuristic scheduler

Platform as a Service API

BANZAI PIPELINE

Banzai Pipeline is a surf reef break located in Hawaii, off Ehukai Beach Park in

Pupukea on O'ahu's North Shore.Banzai Pipeline is a RESTful

application development

platform for building on-demand

data and job pipelines running

on Hadoop YARN.

Banzai Pipeline is a big data API for the REST

THANK YOU

Get the code: https://github.com/sequenceiq

Read about: http://blog.sequenceiq.com

Facebook: http://facebook.com/sequenceiq

Twitter: http://twitter.com/sequenceiq

LinkedIn: http://linkedin.com/sequenceiq

Contact: janos.matyas@sequenceiq.com

FEEL FREE TO CONTRIBUTE

top related