hadoop&on&openstack&: scaling&hadoop2swiftfs …...hadoop&on&openstack&:...

28
Hadoop on OpenStack : Scaling HadoopSwiftFS for Big Data October 29 th , 2015 Andrew Leamon – Director Christopher Power Principal Engineer Engineering Analysis

Upload: others

Post on 20-May-2020

15 views

Category:

Documents


0 download

TRANSCRIPT

Hadoop  on  OpenStack  :Scaling  Hadoop-­SwiftFS for  Big  Data

October  29th,  2015Andrew  Leamon  – Director    Christopher  Power  -­ Principal  Engineer  Engineering  Analysis  

About  Comcast

Hadoop   on   OpenStack2

Comcast  brings  together  the  best  in  media  and  technology.  We  drive  innovation  to  create  the  world’s  best  entertainment  and  online  experiences.

High  Speed  Internet

Video

IP  Telephony

Home  Security  /  Automation

Universal  Parks

Media  Properties

About  Our  Team:  Engineering  Analysis

Hadoop   on   OpenStack3

OpenStack

Big  Data  Platform

SimulationAd-­hoc  Analysis

Exploratory  Data  

Analysis

Feature  Engineering  /  Machine  Learning

Reporting  /  Visualization

High  Speed  Data Video

IP  Telephony Home  Security  /  Automation

Business

Hadoop Overview

Hadoop   on   OpenStack4

Hadoop /  Cloud  Evolution  – Why  does  this  make  sense?

Hadoop   on   OpenStack5

Courtesy  of  the  Ethernet  Alliance:  https://www.nanog.org/meetings/nanog56/presentations/Tuesday/tues.general.kipp.23.pdf

Network  Bandwidth  is  growing  faster  than  Disk  I/O-­ Doubling  every  18  vs 24  months-­ Network  is  faster  than  disk-­ Location  of  Disk  is  not  as  important.

-­ IOps are  the  key  metric.

Hadoop /  Cloud  Evolution  – Memory  Growth

Hadoop   on   OpenStack6

Courtesy  of  the  Centip:    https://centiq.co.uk/sap-­sizing

• 2003  -­ MapReduce paper  was  published• 2005  -­ Hadoop Born• 2012  – Apache  Spark  was  released.    Leverage  Main  Memory  and  Avoid  Disk  I/O• 2014  – Apache  Tez released  to  Avoid  Disk  I/OAvailable  main  memory  per  server  has  increased  greatly.  

Hadoop /  Cloud  Evolution  – Performance  Increasing

Hadoop   on   OpenStack7

Courtesy   of  the  Cisco:    http://www.cisco.com/c/en/us/products/collateral/servers-­unified-­computing/ucs-­c-­series-­rack-­servers/whitepaper-­C11-­734798.html

• Performance  of  Everything  has  been  increasing  with  the  exception  of  HDD  

Hadoop /  Cloud  Evolution  – Disk  is  the  long  poll!

Hadoop   on   OpenStack8

Factors  that  make  Hadoop on  the  Cloud  Possible• Disk  is  the  long  poll,  Network  is  additive  but  proportional

• Many  workloads  are  CPU  bound  anyway• Compression  and  Columnar  formats  reduce  I/O  and  leverage  CPU

• Servers  have  more  Memory• Keep  Data  in  Memory  whenever  possible• Avoid  I/O  at  all  costs.• Only  read  once.  • Only  write  once.• Locality  is  less  important.

• MPP  Frameworks  like  Spark  &  Tez make  this  possible.

Hadoop Scaling  – Coupled  Storage  &  Compute

• On  bare  metal,  storage  and  compute  are  coupled  together.

• Scaling  one  means  that  you  have  to  scale  the  other  proportionally.

Hadoop   on   OpenStack9

Hadoop Scaling  – Decoupled  Storage  &  Compute

Hadoop   on   OpenStack10

• On  OpenStack  Compute  and  Storage  can  be  decoupled  using  Swift  as  the  Object  Store,  this  allows  you  to:

• Scale  compute  and  storage  independently• Run  multiple  clusters  simultaneously• Provide  greater  access  to  data

Big  Data  Platform

Hadoop   on   OpenStack11

Swift

Cinder

OpenStack  @  Comcast

• Vanilla  distribution  of  OpenStack

• Multiple  data  centers

• Multi-­tenant,  multi-­region

• Nova,  Neutron  (with  IPV6  support),  Glance

• Cinder  block,  Swift  object  provided  by  CEPH

• Ceilometer  metrics

• Heat  orchestration

Hadoop   on   OpenStack12

Anatomy  of  Hadoop  on  the  Cloud

Design  for  the  cloud

• Assume  things  will  fail

• Distribute  load  for  performance  and  fault  tolerance

• Use  persistent  storage  where  appropriate

Think  elastically,  scale  horizontally

• Scale  intelligently   to  meet  demand

• Return  resources  when  not  in  use

Leverage  automation

• Automate,  automate,  automate

• Increase  efficiency  and  repeatability

Hadoop   on   OpenStack13

Automation

Horizontal  Scaling

Performance  and  Fault  Tolerance

Affinity  and  Anti-­affinity

• OpenStack  allows  the  user  to  explicitly  specify  whether  a  group  of  VMs  should  or  should  not  share  the  same  physical  hosts

• Create  ServerGroup  with  Anti-­Affinity  and  provide  scheduler  hint  during  nova  boot

• Improves  performance in  a  multi-­tenant  environment  by  spreading  CPU  and  Network  load  across  physical  hosts

• Provides  a  mechanism  to  increase  fault  tolerance by  scheduling  critical  services  on  mutually-­exclusive  physical  hosts

Hadoop   on   OpenStack14

Courtesy  of  Cloudwatt  dev:  https://dev.cloudwatt.com/en/blog/affinity-­and-­anti-­affinity-­in-­openstack.html

Cluster  Node  Storage  Architecture

Cinder  Block  Storage  (CEPH  RBD)

• Persistent  storage  for  all  cluster  nodes

• DataNode  HDFS

• Can  act  as  NodeManager  local  disk

Ephemeral  Block  Device(s)

• High  performance  direct  attached  storage

• Root  volume,  local  disk

• Best  for  NodeManager  local  disk

Swift  Object  Storage  (CEPH  RadosGW)

• Data  lake,  unified  central  storage

• Source,  destination  for  job  data

Hadoop   on   OpenStack15

Cluster  Node  VM

OpenStack  Swift(Data  Lake)

Root  Volume

CEPH  libvirt

Ephemeral(Local  Disk)

Cinder  Volume(HDFS)

Local  Storage  – Cinder  vs  Ephemeral

How  important  is  ephemeral   storage  for  big  data  workloads  on  the  cloud?

• Traditional  Hadoop   jobs  are  read/write  intensive  during  their  intermediate  stages• Performant  local  disk  useful  for  transient  data  like  shuffle/sort,  spilling  to  disk,  logs• Local  disk  setting  configured  using:  yarn.nodemanager.local-­dirs

Benchmarks• TeraSort  at  1TB• DFSIO  at  10x1GB,  100x10GB

Configurations• a)  Cinder  volume  – network  attached• b)  Local  ephemeral  disk  – direct  attached

Hadoop   on   OpenStack16

vs

Cinder

Ephemeral

Local  Storage  – Cinder  vs  Ephemeral  Results

TeraSort  – ephemeral  showed  29%  wall  clock  improvement  over  cinderDFSIO  – negligible  performance  difference

Hadoop   on   OpenStack17

0.00

0.20

0.40

0.60

0.80

1.00

1.20

TeraSort  (1TB) DFSIO  (10x1GB) DFSIO  (100x1GB)

Relative  Job  Runtime

Local  Storage  Comparison  – Cinder  vs  Ephemeral

Ephermal Cinder

Hadoop  +  SwiftFS  101

How  does  Hadoop  interact  with  Swift?    

• Hadoop  SwiftFS  implements  Hadoop  

FileSystem interface  on  top  of  OpenStack  

Swift  REST  API

Hadoop-­SwiftFS

• Part  of  the  Sahara  project  (Sahara-­Extra)

• https://github.com/openstack/sahara-­extra

Hadoop-­OpenStack

• Part  of  Apache  Hadoop,  fork  of  Sahara-­Extra

• https://github.com/apache/hadoop/tree/trunk/h

adoop-­tools/hadoop-­openstack

Hadoop   on   OpenStack18

OpenStack  Swift

VM

Network

VM VM

Hadoop-­SwiftFS

Hadoop-­SwiftFS

Hadoop-­SwiftFS

Challenges  with  Hadoop  at  scale  on  Swift

When  we  attempted  to  run  jobs  at  scale  we  noticed  a  few  things

• Large  number  of  input  splits

• Hadoop  clients  took  a  long  time  to  launch  jobs

• Swift  only  returned  the  first  10,000  objects

• Job  output  is  written  and  then  renamed

• CEPH  cluster  needed  some  tuning

Hadoop   on   OpenStack19

Large  number  of  input  splits

Challenge

• Noticed  multiple   the  typical  number  of  input  splits

Issues

• Hadoop  uses  blocksize  to  compute  input  splits

• Default  Swift  “blocksize”  set  to  32MB

Solution

• Set  blocksize  appropriate   to  your  environment

• Example:    fs.swift.blocksize=131072  (for  128MB  blocks)

Hadoop   on   OpenStack20

Slow  launching  jobs

Challenge

• Hadoop  clients  took  a  long  time  to  launch  jobs

Issues

• Hadoop  does  not  know  it  is  talking  to  an  object  store

• Asks  for  metadata  and  block  locations  of  every  object

• Results  in  O(n)  performance  for  number  of  objects

Possible  Approaches

• Multi-­threading  – only  works  at  directory/partition  level

• Override  getSplits – tool  specific  implementations

Hadoop   on   OpenStack21

Container

Hadoop  ClientFileInputFormat.getSplits

Hadoop-­SwiftFS

list  objects

Slow  launching  jobs  solution

Solution

• Extend  support  for  location  awareness  flag  to  get  block  locations  methods

• Reduce  unnecessary  calls  to  get  object  metadata

• Localize  changes  to  Hadoop  SwiftFS  layer

Benefits

• Jobs  launch  faster

• Reduces  load  on  object  store

• Works  across  tool  ecosystem

• Improves  interactive  query  experience

Hadoop   on   OpenStack22

0

200

400

600

800

1000

1200

1400

100   1,000   2,500   5,000   10,000   25,000  

Seconds

#  of  Objects  in  Container

Performance  Improvement

hadoop-­swift-­latest.jar with   optimizations

Swift  only  returns  first  10,000  objects

Challenge

• Swift  only  returns  the  first  10,000  objects  in  a  container  or  partition

• http://developer.openstack.org/api-­ref-­objectstorage-­v1.html#showContainerDetails

Solution

• Page  through  list  of  objects  using  marker  and  limit  query  string  parameters

• Continue  until  the  number  of  items  returned  is  less  than  the  requested  limit   value

• Default  set  to  10000

• Configurable  by  setting  fs.swift.container.list.limit

Hadoop   on   OpenStack23

1 2 3 . . . . . . 10000

Job  output  write  and  rename

Hadoop’s  OutputCommitter  writes  task  output  to  temporary  directory

Job  completes,  temporary  directory  is  renamed  to  final  output  directory

Object  stores  are  not  file  systems

• Rename  results  in  a  copy  and  delete  which  is  expensive

• Consequence  of  using  path  of  the  object  as  hash  to  the  storage  location

Object  store  compatible  OutputCommitter

• Basic  approach  skips  temporary  write,  outputs  directly  to  final  destination

• Enhanced  approach  uses  local  ephemeral   storage  for  temporay writes

Hadoop   on   OpenStack24

CEPH  Architecture  and  Tuning

Tuning  for  Hadoop  Workloads

• Scale  RadosGWs  and  CEPH  OSD  nodes

• Enable  container  index  sharding

• Increase  placement  groups  for  RadosGW  

index  pool

• Increase  filestoremerge.threshold and  

split.multiple configurations

• Turn  off  RadosGW  logs

Hadoop   on   OpenStack25

RadosGW RadosGW

CEPH  OSD

CEPH  OSD

CEPH  OSD…

Load  Balancing

Lessons  Learned

• Get  to  know  your  OpenStack  architecture

• Understand  the  impacts  of  your  cluster  design

• Use  ephemeral  local  disk  for  NodeManager  if  possible

• Ensure  consistent  pseudo-­directory  representation

• Think  about  your  container  data  organization

• Choose  file  formats  that  reduce  I/O  (ORC/Parquet)

Hadoop   on   OpenStack26

Next  Steps  and  Future  Enhancements

Next  Steps

• Upstream  enhancements  back  to  the  community

Future  Enhancements

• Keystone  authentication  token  optimization

• Handle  large  number  of  partitions

• Streamline  map  task  object  retrieval

Hadoop   on   OpenStack27