hopsfs 10x hdfs performance

HopsFS: 10X your HDFS with NDB

Jim Dowling Associate Prof @ KTH

Senior Researcher @ SICSCEO @ Logical Clocks AB

Oracle, Stockholm, 6th September 2016

www.hops.io @hopshadoop

http://www.hops.io/

Hops TeamActive: Jim Dowling, Seif Haridi, Tor Björn Minde,

Gautier Berthou, Salman Niazi, Mahmoud Ismail,Theofilos Kakantousis, Johan Svedlund Nordström, Ermias Gebremeskel, Antonios Kouzoupis.

Alumni: Vasileios Giannokostas, Misganu Dessalegn, Rizvi Hasan, Paul Mälzer, Bram Leenders, Juan Roca,K “Sri” Srijeyanthan, Steffen Grohsschmiedt, Alberto Lorente, Andre Moré, Ali Gholami, Davis Jaunzems,Stig Viaene, Hooman Peiro, Evangelos Savvidis, Jude D’Souza, Qi Qi, Gayana Chandrasekara,Nikolaos Stanogias, Daniel Bali, Ioannis Kerkinos,Peter Buechler, Pushparaj Motamari, Hamid Afzali,Wasif Malik, Lalith Suresh, Mariano Valles, Ying Lieu.

Marketing 101: Celebrity Endorsements

*Turing Award Winner 2014, Father of Distributed Systems

Hi!I’m Leslie Lamport* and even though you’re not using Paxos, I approve

this product.

Bill Gates’ biggest product regret?*

Windows Future Storage (WinFS*)

*http://www.zdnet.com/article/bill-gates-biggest-microsoft-product-regret-winfs/

6

Hadoop in Context

Data ProcessingSpark, MapReduce, Flink, Presto,

Tensorflow

StorageHDFS, MapR, S3, Collossus, WAS

Resource ManagementYARN, Mesos, Borg

MetadataHive, Parquet, Authorization, Search

7

HDFS v2

DataNodes (up to ~5K)

HDFS Client

Journal Nodes Zookeeper

ActiveNameNode

StandbyNameNode

Asynchronous Replication of EditLogAgreement on the Active NameNodeSnapshots (fsimage) - cut the EditLog

(ls, rm, mv, cp,stat, rm, chown, copyFromLocal,copyFromRemote,chmod, etc)

The NameNode is the Bottleneck for Hadoop

8

9

Max Pause times for NameNode Heap Sizes*

Max Pause-Times (ms)

100

1000

10000

10

JVM Heap Size (GB)

50 75 100 150

Unopti

mized

Optimized

*OpenJDK or Oracle JVM

10

NameNode and Decreasing Memory Costs

Size (GB)

250

500

1000

Year

2016 2017 2018 2019

Projected Max NameNode JVM Heap Size

2020

0

750

Size of RAM in a COTS $7,000 Rack Server

11

Externalizing the NameNode State• Problem:NameNode not scaling up with lower RAM prices

• Solution:Move the metadata off the JVM Heap

• Move it where?An in-memory storage system that can be efficiently queried and managed. Preferably Open-Source.

• MySQL Cluster (NDB)

12

HopsFS Architecture

NameNodes

NDB

Leader

HDFS Client

DataNodes

13

Pluggable DBs: Data Abstraction Layer (DAL)

NameNode(Apache v2)

DAL API(Apache v2)

NDB-DAL-Impl(GPL v2)

Other DB(Other License)

hops-2.5.0.jar dal-ndb-2.5.0-7.5.3.jar

The Global Lock in the NameNode

14

HDFS NameNode InternalsClient: mkdir, getblocklocations, createFile,…..

NameNode

Journal Nodes

Client

Reader1 ReaderN…

Handler1 HandlerM

ConnectionList

Call Queue

Namespace & In-Memory EditLogFSNameSystem Lock

EditLog Buffer

EditLog1 EditLog2 EditLog3

Listener(Nio Thread)

Responder(Nio Thread)

dfs.namenode.service.handlercount (default 10)

ipc.server.read.threadpool.size (default 1)

…

Handler1 HandlerM… Done RPCs

ackIdsflush

HopsFS NameNode InternalsClient: mkdir, getblocklocations, createFile,…..

NameNode

NDB

Client

Reader1 ReaderN…

Handler1 HandlerM

ConnectionList

Call Queue

inodes block_infos replicas

Listener(Nio Thread)

Responder(Nio Thread)

dfs.namenode.service.handlercount (default 10)

ipc.server.read.threadpool.size (default 1)

…

Handler1 HandlerM…

Done RPCs

ackIds

leases…

DAL-ImplDAL API

HARD PART

17

Concurrency Model: Implicit Locking

• Serializabile FS ops using implicit locking of subtrees.

[Hakimzadeh, Peiro, Dowling, ”Scaling HDFS with a Strongly Consistent Relational Model for Metadata”, DAIS 2014]

18

Preventing Deadlock and Starvation

• Acquire FS locks in agreed order using FS Hierarchy. • Block-level operations follow the same agreed order.• No cycles => Freedom from deadlock• Pessimistic Concurrency Control ensures progress

/user/jim/myFilemv

read

block_reportDataNodeNameNodeClient

Per Transaction Cache• Reusing the HDFS codebase resulted in too many roundtrips to the database per transaction.

• We cache intermediate transaction results at NameNodes (i.e., snapshot).

20

Sometimes, Transactions Just ain’t Enough• Large Subtree Operations (delete, mv, set-quota) can’t always be executed in a single Transaction.

• 4-phase Protocol• Isolation and Consistency• Aggressive batching• Transparent failure handling• Failed ops retried on new NN.• Lease timeout for failed clients.

Leader Election using NDB• Leader to coordinate replication/lease management• NDB as shared memory for Leader Election of NN.

21[Niazi, Berthou, Ismail, Dowling, ”Leader Election in a NewSQL Database”, DAIS 2015]

22

Path Component Caching• The most common operation in HDFS is resolving pathnames to inodes- 67% of operations in Spotify’s Hadoop workload

• We cache recently resolved inodes at NameNodes so that we can resolve them using a single batch primary key lookup.- We validate cache entries as part of transactions- The cache converts O(N) round trips to the database to O(1)

for a hit for all inodes in a path.

Path Component Caching• Resolving a path of length N gives O(N) round-trips• With our cache, O(1) round-trip for a cache hit

/user/jim/myFile

NDB

getInode(0, “user”) getInode

(1, “jim”) getInode(2, “myFile”)

NameNode

/user/jim/myFile

NDB

validateInodes([(0, “user”), (1,”jim”),(2,”myFile”)])

NameNode

CachegetInodes(“/user/jim/myFile”)

24

Hotspots• Mikael saw 1-2 maxed out LDM threads• Partitioning by parent inodeId meant fantastic performance for ‘ls’- Partition-pruned index scans- At high load hotspots appeared at the

top of the directory hierarchy• Current Solution:

- Cache the root inode at NameNodes- Pseudo-random partition key for top-level

directories, but keep partition by parent inodeId at lower levels

- At least 4x throughput increase!

/

/Users /Projects

/NSA /MyProj

/Dataset1 /Dataset2

Scalable Blocking Reporting• On 100PB+ clusters, internal maintenance protocol traffic makes up much of the network traffic

• Block Reporting - Leader Load Balances- Work-steal when exiting

safe-mode

SafeBlocks

DataNodes

NameNodes

NDB

Leader

Blocks

work steal

HopsFS Performance

26

27

HopsFS Metadata Scaleout

Assuming 256MB Block Size, 100 GB JVM Heap for Apache Hadoop

28

Spotify Workload

29

HopsFS Throughput (Spotify Workload - PM)

Experiments performed on AWS EC2 with enhanced networking and C3.8xLarge instances

30

HopsFS Throughput (Spotify Workload - PM)

Experiments performed on AWS EC2 with enhanced networking and C3.8xLarge instances

31

HopsFS Throughput (Spotify Workload - AM)

NDB Setup: 8 Nodes using Xeon E5-2620 2.40GHz Processors and 10GbE. NameNodes: Xeon E5-2620 2.40GHz Processors machines and 10GbE.

32

Per Operation HopsFS Throughput

33

NDB Performance Lessons• NDB is quite stable!• ClusterJ is (nearly) good enough

- sun.misc.Cleaner has trouble keeping up at high throughput – OOM for ByteBuffers

- Transaction hint behavior not respected- DTO creation time affected by Java Reflection- Nice features would be:• Projections• Batched scan operations support• Event API

• Event API and Asynchronous API needed for performance in Hops-YARN

34

Heterogeneous Storage in HopsFS• Storage Types in HopsFS: Default, EC-RAID5, SSD

- Default: 3X overhead - triple replication on spinning disks

- SSD: 3X overhead - triple replication on SSDs

- EC-RAID5: 1.4X overhead with low reconstruction overhead!

35

Erasure Coding

HDFS File (Sealed)

d0 d1 d2 d3 d4 d5 p0 p1 p1

overhead

(6+3)/6 = 1.5X

d0 d1 d2 d3 d4 d5 d6 d7 d8 d9d1

0

d1

1p0 p1 p2 p3 (12+4)/16= 1.33X

RS(6,3)

RS(12,4)

host9

d0 d1 d2 d3 d4 p0

Global/Local Reconstruction with EC-RAID5

36

d0 d1 d2 d3 d4 p0Block0 Block9

Block10 Block11 Block12 Block13

host0

host10 host10 host10 host10

ZFS RAID-ZZFS RAID-Z

(10+2+2)/10 = 1.4X

(10+2+4)/10 = 1.6X

RS(10,2) LR(5,1).RS(10,4)LR(5,1).

37

ePipe: Indexing HopsFS’ Namespace

Free-Text Search

NDBElasticSearch

Polyglot PersistenceThe Distributed Database is the Single Source of Truth.

Foreign keys ensure the integrity of Extended Metadata.

MetaDataDesigner

MetaDataEntry

NDB Event API

Hops-YARN

38

39

YARN Architecture

NodeManagers

YARN Client

Zookeeper Nodes

ResourceMgr StandbyResourceMgr

1. Master-Slave Replication of RM State2. Agreement on the Active ResourceMgr

40

NDB

ResourceManager– Monolithic but Modular

ApplicationMasterService

ResourceTrackerService

Scheduler

ClientService

YARN Client

AdminService

Security

Cluster State

HopsResourceTracker

Cluster State

HopsScheduler

NodeManagerNodeManagerYARN Client App MasterApp Master

ResourceManager

~2k ops/s ~10k ops/s

ClusterJ Event API

41

Hops-YARN Architecture

ResourceMgrs

NDB

Scheduler

YARN Client

NodeManagers

Resource Trackers Leader Election forFailed Scheduler

Hopsworks

42

43

Hopsworks – Project-Based Multi-Tenancy• A project is a collection of

- Users with Roles- HDFS DataSets- Kafka Topics- Notebooks, Jobs

• Per-Project quotas- Storage in HDFS- CPU in YARN• Uber-style Pricing

• Sharing across Projects- Datasets/Topics

projectdataset 1

dataset N

Topic 1

Topic N

Kafka

HDFS

Hopsworks – Dynamic Roles

44

[email protected]

NSA__Alice

Authenticate

Users__Alice

Glassfish

HopsFS

HopsYARN

ProjectsSecure

Impersonation

Kafka

X.509 Certificates

45

SICS ICE - www.hops.siteA 2 MW datacenter research and test environment

Purpose: Increase knowledge, strengthen universities, companies and researchers

R&D institute, 5 lab modules, 3-4000 servers, 2-3000 square meters

46

Karamel/Chef for Automated Installation

Google Compute Engine BareMetal

47

Summary• HopsFS is the world’s fastest, most scalable HDFS implementation

• Powered by NDB, the world’s fastest database • Thanks to Mikael, Craig, Frazer, Bernt and others• Still room for improvement….

www.hops.io

Hops[Hadoop For Humans]

Join us!http://github.com/hopshadoop

hopsfs 10x hdfs performance

Technology