hopsfs 10x hdfs performance
TRANSCRIPT
HopsFS: 10X your HDFS with NDB
Jim Dowling Associate Prof @ KTH
Senior Researcher @ SICSCEO @ Logical Clocks AB
Oracle, Stockholm, 6th September 2016
www.hops.io @hopshadoop
Hops TeamActive: Jim Dowling, Seif Haridi, Tor Björn Minde,
Gautier Berthou, Salman Niazi, Mahmoud Ismail,Theofilos Kakantousis, Johan Svedlund Nordström, Ermias Gebremeskel, Antonios Kouzoupis.
Alumni: Vasileios Giannokostas, Misganu Dessalegn, Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca,K “Sri” Srijeyanthan, Steffen Grohsschmiedt, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems,Stig Viaene, Hooman Peiro, Evangelos Savvidis, Jude D’Souza, Qi Qi, Gayana Chandrasekara,Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos,Peter Buechler, Pushparaj Motamari, Hamid Afzali,Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.
Marketing 101: Celebrity Endorsements
*Turing Award Winner 2014, Father of Distributed Systems
Hi!I’m Leslie Lamport* and even though you’re not using Paxos, I approve
this product.
Bill Gates’ biggest product regret?*
Windows Future Storage (WinFS*)
*http://www.zdnet.com/article/bill-gates-biggest-microsoft-product-regret-winfs/
6
Hadoop in Context
Data ProcessingSpark, MapReduce, Flink, Presto,
Tensorflow
StorageHDFS, MapR, S3, Collossus, WAS
Resource ManagementYARN, Mesos, Borg
MetadataHive, Parquet, Authorization, Search
7
HDFS v2
DataNodes (up to ~5K)
HDFS Client
Journal Nodes Zookeeper
ActiveNameNode
StandbyNameNode
Asynchronous Replication of EditLogAgreement on the Active NameNodeSnapshots (fsimage) - cut the EditLog
(ls, rm, mv, cp,stat, rm, chown, copyFromLocal,copyFromRemote,chmod, etc)
The NameNode is the Bottleneck for Hadoop
8
9
Max Pause times for NameNode Heap Sizes*
Max Pause-Times (ms)
100
1000
10000
10
JVM Heap Size (GB)
50 75 100 150
Unopti
mized
Optimized
*OpenJDK or Oracle JVM
10
NameNode and Decreasing Memory Costs
Size (GB)
250
500
1000
Year
2016 2017 2018 2019
Projected Max NameNode JVM Heap Size
2020
0
750
Size of RAM in a COTS $7,000 Rack Server
11
Externalizing the NameNode State• Problem:NameNode not scaling up with lower RAM prices
• Solution:Move the metadata off the JVM Heap
• Move it where?An in-memory storage system that can be efficiently queried and managed. Preferably Open-Source.
• MySQL Cluster (NDB)
12
HopsFS Architecture
NameNodes
NDB
Leader
HDFS Client
DataNodes
13
Pluggable DBs: Data Abstraction Layer (DAL)
NameNode(Apache v2)
DAL API(Apache v2)
NDB-DAL-Impl(GPL v2)
Other DB(Other License)
hops-2.5.0.jar dal-ndb-2.5.0-7.5.3.jar
The Global Lock in the NameNode
14
HDFS NameNode InternalsClient: mkdir, getblocklocations, createFile,…..
NameNode
Journal Nodes
Client
Reader1 ReaderN…
Handler1 HandlerM
ConnectionList
Call Queue
Namespace & In-Memory EditLogFSNameSystem Lock
EditLog Buffer
EditLog1 EditLog2 EditLog3
Listener(Nio Thread)
Responder(Nio Thread)
dfs.namenode.service.handlercount (default 10)
ipc.server.read.threadpool.size (default 1)
…
Handler1 HandlerM… Done RPCs
ackIdsflush
HopsFS NameNode InternalsClient: mkdir, getblocklocations, createFile,…..
NameNode
NDB
Client
Reader1 ReaderN…
Handler1 HandlerM
ConnectionList
Call Queue
inodes block_infos replicas
Listener(Nio Thread)
Responder(Nio Thread)
dfs.namenode.service.handlercount (default 10)
ipc.server.read.threadpool.size (default 1)
…
Handler1 HandlerM…
Done RPCs
ackIds
leases…
DAL-ImplDAL API
HARD PART
17
Concurrency Model: Implicit Locking
• Serializabile FS ops using implicit locking of subtrees.
[Hakimzadeh, Peiro, Dowling, ”Scaling HDFS with a Strongly Consistent Relational Model for Metadata”, DAIS 2014]
18
Preventing Deadlock and Starvation
• Acquire FS locks in agreed order using FS Hierarchy. • Block-level operations follow the same agreed order.• No cycles => Freedom from deadlock• Pessimistic Concurrency Control ensures progress
/user/jim/myFilemv
read
block_reportDataNodeNameNodeClient
Per Transaction Cache• Reusing the HDFS codebase resulted in too many roundtrips to the database per transaction.
• We cache intermediate transaction results at NameNodes (i.e., snapshot).
20
Sometimes, Transactions Just ain’t Enough• Large Subtree Operations (delete, mv, set-quota) can’t always be executed in a single Transaction.
• 4-phase Protocol• Isolation and Consistency• Aggressive batching• Transparent failure handling• Failed ops retried on new NN.• Lease timeout for failed clients.
Leader Election using NDB• Leader to coordinate replication/lease management• NDB as shared memory for Leader Election of NN.
21[Niazi, Berthou, Ismail, Dowling, ”Leader Election in a NewSQL Database”, DAIS 2015]
22
Path Component Caching• The most common operation in HDFS is resolving pathnames to inodes- 67% of operations in Spotify’s Hadoop workload
• We cache recently resolved inodes at NameNodes so that we can resolve them using a single batch primary key lookup.- We validate cache entries as part of transactions- The cache converts O(N) round trips to the database to O(1)
for a hit for all inodes in a path.
Path Component Caching• Resolving a path of length N gives O(N) round-trips• With our cache, O(1) round-trip for a cache hit
/user/jim/myFile
NDB
getInode(0, “user”) getInode
(1, “jim”) getInode(2, “myFile”)
NameNode
/user/jim/myFile
NDB
validateInodes([(0, “user”), (1,”jim”),(2,”myFile”)])
NameNode
CachegetInodes(“/user/jim/myFile”)
24
Hotspots• Mikael saw 1-2 maxed out LDM threads• Partitioning by parent inodeId meant fantastic performance for ‘ls’- Partition-pruned index scans- At high load hotspots appeared at the
top of the directory hierarchy• Current Solution:
- Cache the root inode at NameNodes- Pseudo-random partition key for top-level
directories, but keep partition by parent inodeId at lower levels
- At least 4x throughput increase!
/
/Users /Projects
/NSA /MyProj
/Dataset1 /Dataset2
Scalable Blocking Reporting• On 100PB+ clusters, internal maintenance protocol traffic makes up much of the network traffic
• Block Reporting - Leader Load Balances- Work-steal when exiting
safe-mode
SafeBlocks
DataNodes
NameNodes
NDB
Leader
Blocks
work steal
HopsFS Performance
26
27
HopsFS Metadata Scaleout
Assuming 256MB Block Size, 100 GB JVM Heap for Apache Hadoop
28
Spotify Workload
29
HopsFS Throughput (Spotify Workload - PM)
Experiments performed on AWS EC2 with enhanced networking and C3.8xLarge instances
30
HopsFS Throughput (Spotify Workload - PM)
Experiments performed on AWS EC2 with enhanced networking and C3.8xLarge instances
31
HopsFS Throughput (Spotify Workload - AM)
NDB Setup: 8 Nodes using Xeon E5-2620 2.40GHz Processors and 10GbE. NameNodes: Xeon E5-2620 2.40GHz Processors machines and 10GbE.
32
Per Operation HopsFS Throughput
33
NDB Performance Lessons• NDB is quite stable!• ClusterJ is (nearly) good enough
- sun.misc.Cleaner has trouble keeping up at high throughput – OOM for ByteBuffers
- Transaction hint behavior not respected- DTO creation time affected by Java Reflection- Nice features would be:• Projections• Batched scan operations support• Event API
• Event API and Asynchronous API needed for performance in Hops-YARN
34
Heterogeneous Storage in HopsFS• Storage Types in HopsFS: Default, EC-RAID5, SSD
- Default: 3X overhead - triple replication on spinning disks
- SSD: 3X overhead - triple replication on SSDs
- EC-RAID5: 1.4X overhead with low reconstruction overhead!
35
Erasure Coding
HDFS File (Sealed)
d0 d1 d2 d3 d4 d5 p0 p1 p1
overhead
(6+3)/6 = 1.5X
d0 d1 d2 d3 d4 d5 d6 d7 d8 d9d1
0
d1
1p0 p1 p2 p3 (12+4)/16= 1.33X
RS(6,3)
RS(12,4)
host9
d0 d1 d2 d3 d4 p0
Global/Local Reconstruction with EC-RAID5
36
d0 d1 d2 d3 d4 p0Block0 Block9
Block10 Block11 Block12 Block13
host0
host10 host10 host10 host10
ZFS RAID-ZZFS RAID-Z
(10+2+2)/10 = 1.4X
(10+2+4)/10 = 1.6X
RS(10,2) LR(5,1).RS(10,4)LR(5,1).
37
ePipe: Indexing HopsFS’ Namespace
Free-Text Search
NDBElasticSearch
Polyglot PersistenceThe Distributed Database is the Single Source of Truth.
Foreign keys ensure the integrity of Extended Metadata.
MetaDataDesigner
MetaDataEntry
NDB Event API
Hops-YARN
38
39
YARN Architecture
NodeManagers
YARN Client
Zookeeper Nodes
ResourceMgr StandbyResourceMgr
1. Master-Slave Replication of RM State2. Agreement on the Active ResourceMgr
40
NDB
ResourceManager– Monolithic but Modular
ApplicationMasterService
ResourceTrackerService
Scheduler
ClientService
YARN Client
AdminService
Security
Cluster State
HopsResourceTracker
Cluster State
HopsScheduler
NodeManagerNodeManagerYARN Client App MasterApp Master
ResourceManager
~2k ops/s ~10k ops/s
ClusterJ Event API
41
Hops-YARN Architecture
ResourceMgrs
NDB
Scheduler
YARN Client
NodeManagers
Resource Trackers Leader Election forFailed Scheduler
Hopsworks
42
43
Hopsworks – Project-Based Multi-Tenancy• A project is a collection of
- Users with Roles- HDFS DataSets- Kafka Topics- Notebooks, Jobs
• Per-Project quotas- Storage in HDFS- CPU in YARN• Uber-style Pricing
• Sharing across Projects- Datasets/Topics
projectdataset 1
dataset N
Topic 1
Topic N
Kafka
HDFS
Hopsworks – Dynamic Roles
44
NSA__Alice
Authenticate
Users__Alice
Glassfish
HopsFS
HopsYARN
ProjectsSecure
Impersonation
Kafka
X.509 Certificates
45
SICS ICE - www.hops.siteA 2 MW datacenter research and test environment
Purpose: Increase knowledge, strengthen universities, companies and researchers
R&D institute, 5 lab modules, 3-4000 servers, 2-3000 square meters
46
Karamel/Chef for Automated Installation
Google Compute Engine BareMetal
47
Summary• HopsFS is the world’s fastest, most scalable HDFS implementation
• Powered by NDB, the world’s fastest database • Thanks to Mikael, Craig, Frazer, Bernt and others• Still room for improvement….
www.hops.io
Hops[Hadoop For Humans]
Join us!http://github.com/hopshadoop