big data in container; hadoop spark in docker and mesos
TRANSCRIPT
1
Big Data in ContainerHeiko Loewe @loewehMeetup Big Data Hadoop & Spark NRW 08/24/2016
2
Why• Fast Deployment• Test/Dev Cluster• Better Utilize Hardware• Learn to manage Hadoop• Test new Versions• An appliance for continuous
integration/API testing
3
DesignMaster Container
- Name Node- Secondary Name Node- Yarn
Slave Container- Node Manager- Data Node
Slave Container- Node Manager- Data Node
Slave Container- Node Manager- Data Node
Slave Container- Node Manager- Data Node
4
More than 1 Hosts needs Overlay NetInterface Docker0 not routed
Overlay Network
1 Host Config(almost ) noProblem
For 2 Hostsand morewe need anOverlay Net-work
5
Choice of the Overlay Network Impl.
Docker Multi-Host Network Weave Net• Backend: VXLAN, AWS,
GCE. • Fallback: custom UDP-based tunneling.
• Control plane: built-in, uses Etcd for shared state.
CoreOS Flanneld• Backend: VXLAN. • Fallback: none. • Control plane: built-in,
uses Zookeeper, Consul or Etcd for shared state.
• Backend: VXLAN via OVS. • Fallback: custom UDP-
based tunneling called “sleeve”. • Control plane: built-in.
6
Normal mode of operations is called FDP – fast data path – which works via OVS’s data path kernel module (mainline since 3.12). It’s just another VXLAN implementation.
Has a sleeve fallback mode, works in userspace via pcap.
Sleeve supports full encryption.
Weaveworks also has Weave DNS, Weave Scope and Weave Flux – providing introspection, service discovery & routing capabilities on top of Weave Net.
WEAVE NET
7
/etc/sudoers # at the end:vuser ALL=(ALL) NOPASSWD: ALL# secure_path, append /usr/local/bin for weave
Defaults secure_path = /sbin:/bin:/usr/sbin:/usr/bin:/usr/local/bin
sudo groupadd docker sudo gpasswd -a ${USER} docker sudo chgrp docker /var/run/docker.sock alias docker="sudo /usr/bin/docker"
Docker Adaption (Fedora/Centos/RHEL)
8
WARNING: existing iptables rule
'-A FORWARD -j REJECT --reject-with icmp-host-prohibited'
will block name resolution via weaveDNS - please reconfigure your firewall.
sudo systemctl stop firewalld Sudo systemctl disable firewalld
/sbin/iptables -D FORWARD -j REJECT --reject-with icmp-host-prohibited/sbin/iptables -D INPUT -j REJECT --reject-with icmp-host-prohibited
iptables-save reboot
Weave Problems on Fedora/Centos/RHEL
9
[vuser@linux ~]$ ifconfig | grep -v "^ "docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500enp3s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
[vuser@linux ~]$ docker psCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES[vuser@linux ~]$ sudo weave launch[vuser@linux ~]$ eval $(sudo weave env)[vuser@linux ~]$ sudo weave -–local expose10.32.0.6[vuser@linux ~]$ docker psCONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES0fd6ab928d96 weaveworks/plugin:1.6.1 "/home/weave/plugin" 11 seconds ago Up 8 seconds weaveplugin4b24e5802fcc weaveworks/weaveexec:1.6.1 "/home/weave/weavepro" 13 seconds ago Up 10 seconds weaveproxyc4882326398a weaveworks/weave:1.6.1 "/home/weave/weaver -" 18 seconds ago Up 15 seconds weave[vuser@linux ~]$ ifconfig | grep -v "^ "datapath: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1410
docker0: flags=4099<UP,BROADCAST,MULTICAST> mtu 1500enp3s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1500
lo: flags=73<UP,LOOPBACK,RUNNING> mtu 65536
vethwe-bridge: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1410vethwe-datapath: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1410vxlan-6784: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 65485weave: flags=4163<UP,BROADCAST,RUNNING,MULTICAST> mtu 1410
WEAVE Container
WEAVE Interfaces
Weave Run
10
https://github.com/kiwenlau/hadoop-cluster-docker/blob/master/Dockerfile
Hadoop Container Docker FileFROM ubuntu:14.04# install openssh-server, openjdk and wget# install hadoop 2.7.2# set environment variable# ssh without key# set up Hadoop directorties# copy config files from local# make Hadoop start files executable# format namenode#standard run commandCMD [ "sh", "-c", "service ssh start; bash"]
$ docker build –t loewe/hadoop:latest
11
Start Hadoop ContainerHost 1• Master
$ sudo weave run –itd –p 8088:8088 –p 50070:50070 -–name hadoop-master
• Slaves 1,2$ sudo weave run –itd -–name hadoop-slave1$ sudo weave run –itd -–name hadoop-slave2
Host2• Slave 3,4
$ sudo weave run –itd -–name hadoop-slave1$ sudo weave run –itd -–name hadoop-slave2
root@boot2docker:~# weave status dnshadoop-master 10.32.0.1 6a4db5f52340 92:64:f5:c5:57:a7hadoop-slave1 10.32.0.2 34e0a7de1105 92:64:f5:c5:57:a7hadoop-slave2 10.32.0.3 d879f077cf4e 92:64:f5:c5:57:a7hadoop-slave3 10.44.0.0 6ca7ddb9daf8 92:56:f4:98:36:b0hadoop-slave4 10.44.0.1 c1ed48630b1c 92:56:f4:98:36:b0
12
Hadoop Cluster / 2 Host / 5 Nodes
13
Click icon to add picture
Persitent Volumes for HDFS
14
• Container (like Docker) are the Foundation for agile Software Development
• The initial Container Design was stateless (12-factor App)
• Use-cases are grown in the last few Month (NoSQL, Stateful Apps)
• Persistence for Container is not easy
The Problem
15
• Enables Persistence of Docker Volumes• Enables the Implementation of
– Fast Bytes (Performance)– Data Services (Protection / Snapshots)– Data Mobility– Availability
• Operations: – Create, Remove, Mount, Path, Unmount– Additonal Option can be passed to the Volume Driver
DOCKER Volume Manager API
16
Persistente Volumes for CONTAINER
Container OS
Storage
/mnt/PersistentData
Container Container
-v /mnt/PersistenData:/mnt/ContainerData
Container Container
Automation ??Docker Host
17
Docker Host
Persistente Volumes for CONTAINER
Container OS
Storage
/mnt/PersistentData
Container Container
-v /mnt/PersistenData:/mnt/ContainerData
Container Container
18
Persistente Volumes for CONTAINER
AWS EC2 (EBS) OpenStack (Cinder) EMC Isilon EMC ScaleIO EMC VMAX EMC XtremIO Google Compute Engine (GCE) VirtualBox
UbuntuDebianRedHatCentOSCoreOSOSXTinyLinux (boot2docker)
Docker Volume APIMesos Isolator...
19
Hadoop + persisten Volumes
Host A
Making theHadoop Containerephemeral
20
Overlay Network
Strech Hadoop w/ persisten VolumesHost A Host B
Easiely strechand shrink aCluster withoutloosing the Data
21
Other similar Projects• Big Top Provisioner / Apache Foundation
https://github.com/apache/bigtop/tree/master/provisioner/docker
• Building Hortonworks HDP on Dockerhttp://henning.kropponline.de/2015/07/19/building-hdp-on-docker/https://hub.docker.com/r/hortonworks/ambari-server/https://hub.docker.com/r/hortonworks/ambari-agent/
• Building Cloudera CHD on Dockerhttp://blog.cloudera.com/blog/2015/12/docker-is-the-new-quickstart-option-for-apache-hadoop-and-cloudera/https://hub.docker.com/r/cloudera/quickstart/Watch out Overlay Network topix
22
Apache Myriad
23
Myriad Overview• Mesos Framework for Apache Yarn• Mesos manages DC, Yarn Manages Hadoop• Coarse and fine grained Resource Sharing
24
Situation without Integration
25
Yarn/Mesos Integration
26
How it works (simplyfied)
Myriad = Control Plane
27
Myriad Container
28
29
30
31
What about the DataMyriad only cares for the Compute
Master Container- Name Node
- Secondary Name Node- Yarn
Slave Container- Node Manager
- Data Node
Slave Container- Node Manager
- Data Node
Slave Container- Node Manager
- Data Node
Slave Container- Node Manager
- Data Node
Myriad/Mesos
Cares about
Has to be providedOutside fromMyriad/Mesos
Has to be providedOutside fromMyriad/Mesos
32
What about the Data
• Myriad only cares for Compute / Map Reduce• HDFS has to be provided on other Ways
Big Data New Realities
Big Data Traditional Assumptions
Bare-metal
Data locality
Data on local disks
Big Data New Realities
Containers and VMs
Compute and storage separation
In-place access on remote data stores
New Benefits and Value
Big-Data-as-a-Service
Agility and cost savings
Faster time-to-insights
33
Options for HDFS Data Layer• Pure HDFS Cluster (only Data Node running)
– Bare Metal– Containerized– Mesos based
• Enterprise HDFS Array– EMC Isilon
34
Myriad, Mesos, EMC Isilon for HDFS
35
• Multi Tenancy• Multiple HDFS Environments
sharing the same storage• Quota possible on HDFS
Environments• Snapshots of HDFS Environemnt
possible• Remote Replication• Worm Option for HDFS
• High Avaiable HDFS Infrastructure (distributed Namen and Data Nodes)
• Storage efficient (usable/raw 0.8 compared to 0.33 with Hadoop)
• Shared Access HDFS / CIFS / NFS/SFTP possible
• Maintenance equals Enterprise Array Standard
• All major Distributions supported
EMC Isilon Advantages over classic Hadoop HDFS
36
Click icon to add picture
Spark on Mesos
37
48%Standalone mode
40%YARN
11%Mesos
Most Common Spark Deployment Environments (Cluster Managers)
Source: Spark Survey Report, 2015 (Databricks)
Common Deployment Patterns
38
Bare Metal Bare Metal Bare Metal
Bare MetalSpark Client Virtual Machine
Virtual Machine Virtual Machine Virtual Machine
Spark Slave
tasktask task
Spark Slave
tasktask task
Spark Slave
tasktask task
Spark Master
Spark Cluster – Standalone Mode
Data providedoutside
39
Node Manager Node Manager Node Manager
Spark Executor
tasktask task
Spark Executor
tasktask task
Spark Executor
tasktask task
Spark Client
Spark Master Resource Manager
Spark Cluster – Hadoop YARNData provideBy HadoopCluster
40
Mesos Slave Mesos Slave Mesos Slave
Spark Executor
tasktask task
Spark Executor
tasktask task
Spark Executor
tasktask task
MesosMaster
SparkScheduler
Spark Client
Spark Cluster – MesosData providedoutside
41
Spark + Mesos + EMC IsilonTo solve the HDFS Data Layer
42
Thank YouFollow me on Twitter: @loeweh