hadoop&on&openstack&: scaling&hadoop2swiftfs …...hadoop&on&openstack&:...
TRANSCRIPT
Hadoop on OpenStack :Scaling Hadoop-SwiftFS for Big Data
October 29th, 2015Andrew Leamon – Director Christopher Power - Principal Engineer Engineering Analysis
About Comcast
Hadoop on OpenStack2
Comcast brings together the best in media and technology. We drive innovation to create the world’s best entertainment and online experiences.
High Speed Internet
Video
IP Telephony
Home Security / Automation
Universal Parks
Media Properties
About Our Team: Engineering Analysis
Hadoop on OpenStack3
OpenStack
Big Data Platform
SimulationAd-hoc Analysis
Exploratory Data
Analysis
Feature Engineering / Machine Learning
Reporting / Visualization
High Speed Data Video
IP Telephony Home Security / Automation
Business
Hadoop / Cloud Evolution – Why does this make sense?
Hadoop on OpenStack5
Courtesy of the Ethernet Alliance: https://www.nanog.org/meetings/nanog56/presentations/Tuesday/tues.general.kipp.23.pdf
Network Bandwidth is growing faster than Disk I/O- Doubling every 18 vs 24 months- Network is faster than disk- Location of Disk is not as important.
- IOps are the key metric.
Hadoop / Cloud Evolution – Memory Growth
Hadoop on OpenStack6
Courtesy of the Centip: https://centiq.co.uk/sap-sizing
• 2003 - MapReduce paper was published• 2005 - Hadoop Born• 2012 – Apache Spark was released. Leverage Main Memory and Avoid Disk I/O• 2014 – Apache Tez released to Avoid Disk I/OAvailable main memory per server has increased greatly.
Hadoop / Cloud Evolution – Performance Increasing
Hadoop on OpenStack7
Courtesy of the Cisco: http://www.cisco.com/c/en/us/products/collateral/servers-unified-computing/ucs-c-series-rack-servers/whitepaper-C11-734798.html
• Performance of Everything has been increasing with the exception of HDD
Hadoop / Cloud Evolution – Disk is the long poll!
Hadoop on OpenStack8
Factors that make Hadoop on the Cloud Possible• Disk is the long poll, Network is additive but proportional
• Many workloads are CPU bound anyway• Compression and Columnar formats reduce I/O and leverage CPU
• Servers have more Memory• Keep Data in Memory whenever possible• Avoid I/O at all costs.• Only read once. • Only write once.• Locality is less important.
• MPP Frameworks like Spark & Tez make this possible.
Hadoop Scaling – Coupled Storage & Compute
• On bare metal, storage and compute are coupled together.
• Scaling one means that you have to scale the other proportionally.
Hadoop on OpenStack9
Hadoop Scaling – Decoupled Storage & Compute
Hadoop on OpenStack10
• On OpenStack Compute and Storage can be decoupled using Swift as the Object Store, this allows you to:
• Scale compute and storage independently• Run multiple clusters simultaneously• Provide greater access to data
OpenStack @ Comcast
• Vanilla distribution of OpenStack
• Multiple data centers
• Multi-tenant, multi-region
• Nova, Neutron (with IPV6 support), Glance
• Cinder block, Swift object provided by CEPH
• Ceilometer metrics
• Heat orchestration
Hadoop on OpenStack12
Anatomy of Hadoop on the Cloud
Design for the cloud
• Assume things will fail
• Distribute load for performance and fault tolerance
• Use persistent storage where appropriate
Think elastically, scale horizontally
• Scale intelligently to meet demand
• Return resources when not in use
Leverage automation
• Automate, automate, automate
• Increase efficiency and repeatability
Hadoop on OpenStack13
Automation
Horizontal Scaling
Performance and Fault Tolerance
Affinity and Anti-affinity
• OpenStack allows the user to explicitly specify whether a group of VMs should or should not share the same physical hosts
• Create ServerGroup with Anti-Affinity and provide scheduler hint during nova boot
• Improves performance in a multi-tenant environment by spreading CPU and Network load across physical hosts
• Provides a mechanism to increase fault tolerance by scheduling critical services on mutually-exclusive physical hosts
Hadoop on OpenStack14
Courtesy of Cloudwatt dev: https://dev.cloudwatt.com/en/blog/affinity-and-anti-affinity-in-openstack.html
Cluster Node Storage Architecture
Cinder Block Storage (CEPH RBD)
• Persistent storage for all cluster nodes
• DataNode HDFS
• Can act as NodeManager local disk
Ephemeral Block Device(s)
• High performance direct attached storage
• Root volume, local disk
• Best for NodeManager local disk
Swift Object Storage (CEPH RadosGW)
• Data lake, unified central storage
• Source, destination for job data
Hadoop on OpenStack15
Cluster Node VM
OpenStack Swift(Data Lake)
Root Volume
CEPH libvirt
Ephemeral(Local Disk)
Cinder Volume(HDFS)
Local Storage – Cinder vs Ephemeral
How important is ephemeral storage for big data workloads on the cloud?
• Traditional Hadoop jobs are read/write intensive during their intermediate stages• Performant local disk useful for transient data like shuffle/sort, spilling to disk, logs• Local disk setting configured using: yarn.nodemanager.local-dirs
Benchmarks• TeraSort at 1TB• DFSIO at 10x1GB, 100x10GB
Configurations• a) Cinder volume – network attached• b) Local ephemeral disk – direct attached
Hadoop on OpenStack16
vs
Cinder
Ephemeral
Local Storage – Cinder vs Ephemeral Results
TeraSort – ephemeral showed 29% wall clock improvement over cinderDFSIO – negligible performance difference
Hadoop on OpenStack17
0.00
0.20
0.40
0.60
0.80
1.00
1.20
TeraSort (1TB) DFSIO (10x1GB) DFSIO (100x1GB)
Relative Job Runtime
Local Storage Comparison – Cinder vs Ephemeral
Ephermal Cinder
Hadoop + SwiftFS 101
How does Hadoop interact with Swift?
• Hadoop SwiftFS implements Hadoop
FileSystem interface on top of OpenStack
Swift REST API
Hadoop-SwiftFS
• Part of the Sahara project (Sahara-Extra)
• https://github.com/openstack/sahara-extra
Hadoop-OpenStack
• Part of Apache Hadoop, fork of Sahara-Extra
• https://github.com/apache/hadoop/tree/trunk/h
adoop-tools/hadoop-openstack
Hadoop on OpenStack18
OpenStack Swift
VM
Network
VM VM
Hadoop-SwiftFS
Hadoop-SwiftFS
Hadoop-SwiftFS
Challenges with Hadoop at scale on Swift
When we attempted to run jobs at scale we noticed a few things
• Large number of input splits
• Hadoop clients took a long time to launch jobs
• Swift only returned the first 10,000 objects
• Job output is written and then renamed
• CEPH cluster needed some tuning
Hadoop on OpenStack19
Large number of input splits
Challenge
• Noticed multiple the typical number of input splits
Issues
• Hadoop uses blocksize to compute input splits
• Default Swift “blocksize” set to 32MB
Solution
• Set blocksize appropriate to your environment
• Example: fs.swift.blocksize=131072 (for 128MB blocks)
Hadoop on OpenStack20
Slow launching jobs
Challenge
• Hadoop clients took a long time to launch jobs
Issues
• Hadoop does not know it is talking to an object store
• Asks for metadata and block locations of every object
• Results in O(n) performance for number of objects
Possible Approaches
• Multi-threading – only works at directory/partition level
• Override getSplits – tool specific implementations
Hadoop on OpenStack21
Container
Hadoop ClientFileInputFormat.getSplits
Hadoop-SwiftFS
list objects
Slow launching jobs solution
Solution
• Extend support for location awareness flag to get block locations methods
• Reduce unnecessary calls to get object metadata
• Localize changes to Hadoop SwiftFS layer
Benefits
• Jobs launch faster
• Reduces load on object store
• Works across tool ecosystem
• Improves interactive query experience
Hadoop on OpenStack22
0
200
400
600
800
1000
1200
1400
100 1,000 2,500 5,000 10,000 25,000
Seconds
# of Objects in Container
Performance Improvement
hadoop-swift-latest.jar with optimizations
Swift only returns first 10,000 objects
Challenge
• Swift only returns the first 10,000 objects in a container or partition
• http://developer.openstack.org/api-ref-objectstorage-v1.html#showContainerDetails
Solution
• Page through list of objects using marker and limit query string parameters
• Continue until the number of items returned is less than the requested limit value
• Default set to 10000
• Configurable by setting fs.swift.container.list.limit
Hadoop on OpenStack23
1 2 3 . . . . . . 10000
Job output write and rename
Hadoop’s OutputCommitter writes task output to temporary directory
Job completes, temporary directory is renamed to final output directory
Object stores are not file systems
• Rename results in a copy and delete which is expensive
• Consequence of using path of the object as hash to the storage location
Object store compatible OutputCommitter
• Basic approach skips temporary write, outputs directly to final destination
• Enhanced approach uses local ephemeral storage for temporay writes
Hadoop on OpenStack24
CEPH Architecture and Tuning
Tuning for Hadoop Workloads
• Scale RadosGWs and CEPH OSD nodes
• Enable container index sharding
• Increase placement groups for RadosGW
index pool
• Increase filestoremerge.threshold and
split.multiple configurations
• Turn off RadosGW logs
Hadoop on OpenStack25
RadosGW RadosGW
CEPH OSD
CEPH OSD
CEPH OSD…
Load Balancing
…
Lessons Learned
• Get to know your OpenStack architecture
• Understand the impacts of your cluster design
• Use ephemeral local disk for NodeManager if possible
• Ensure consistent pseudo-directory representation
• Think about your container data organization
• Choose file formats that reduce I/O (ORC/Parquet)
Hadoop on OpenStack26
Next Steps and Future Enhancements
Next Steps
• Upstream enhancements back to the community
Future Enhancements
• Keystone authentication token optimization
• Handle large number of partitions
• Streamline map task object retrieval
Hadoop on OpenStack27