hadoop open platform-as-a-service (hops)/j.dowling.pdfapache hadoop yarn ha/scaleout limitations nm...

49
Hadoop Open Platform-as-a-Service (Hops) Academics: Jim Dowling, Seif Haridi PostDocs: Gautier Berthou (SICS) PhDs: Salman Niazi, Mahmoud Ismail, Kamal Hakimzadeh, Ali Gholami R/Engineers: Stig Viaene (SICS), Steffen Grohschmeidt MSc Students: Theofilos Kakantousis, Nikolaos Stangios, “Sri” Srijeyanthan, Vangelos Savvidis, Seçkin Savaşçı.

Upload: others

Post on 20-May-2020

27 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Hadoop Open Platform-as-a-Service (Hops)

Academics: Jim Dowling, Seif Haridi

PostDocs: Gautier Berthou (SICS)

PhDs: Salman Niazi, Mahmoud Ismail, Kamal Hakimzadeh, Ali Gholami

R/Engineers: Stig Viaene (SICS), Steffen Grohschmeidt

MSc Students: Theofilos Kakantousis, Nikolaos Stangios, “Sri” Srijeyanthan, Vangelos Savvidis, Seçkin Savaşçı.

Page 2: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

What is Systems Research?*

•Systems research is the scientific study, analysis, modeling and engineering of effective software platforms.

•Its challenge is to provide dependable, powerful, performant, secure and scalable solutions within an increasingly complex IT environment.

*Drushel et al, “Fostering Systems Research in Europe”, A White Paper by EuroSys, 2006

Page 3: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Why is Big Data Important?

•In a wide array of academic fields, the ability to effectively process data is superseding other more classical modes of research.

“More data trumps better algorithms”*

*“The Unreasonable Effectiveness of Data” [Halevey, Norvig et al 09]

Page 4: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Bill Gates’ biggest product regret*

http://www.zdnet.com/article/bill-gates-biggest-microsoft-product-regret-winfs/

Page 5: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Windows Future Storage (WinFS*)

•“WinFS was an attempt to bring the benefits of schema and relational databases to the Windows file system. …The WinFS effort was started around 1999 as the successor to the planned storage layer of Cairo and died in 2006 after consuming many thousands of hours of efforts from really smart engineers.”

- [Brian Welcker]*

*http://blogs.msdn.com/b/bwelcker/archive/2013/02/11/the-vision-thing.aspx

Page 6: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Background: Hadoop Filesystem and MapRed

6

Page 7: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

HDFS: Hadoop Filesystem

write “/crawler/bot/jd.io/1”

Name node

2 1 3

5 4 6 5 6 6

3 1 4 2 1 3 4 2 5

Heartbeats Rebalance

5

2

1 3

Under-replicated blocks

Re-replicate

blocks

Data nodes Data nodes

Page 8: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Big Data Processing with No Data Locality

Job(“/genomes/jim.bam”)

Workflow Manager

2 1 3 5 6 5 3 6 2 1 4 4 1 5 2 3 4 6

Compute Grid Node Job

submit

This doesn’t scale. Bandwidth is the bottleneck

1 6 3 2 5 4

Page 9: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

MapReduce – Data Locality

Job(“/genomes/jim.bam”)

Job Tracker

2 1 3 5 6 5 3 6 2 1 4 4 1 5 2 3 4 6

Task Tracker

Task Tracker

Task Tracker

Task Tracker

Task Tracker

Task Tracker

submit

Job Job Job Job Job Job

DN DN DN DN DN DN

R R R R = resultFile(s)

Page 10: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

MapReduce Programming Model

Batch Sequential Processing

Scan Sort

With Fault Tolerance

filter

join

Page 11: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

The NameNode

11

Page 12: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

HDFS NameNode

•Stores Mappings: path_component -> inode inode -> {block} block -> {replica1,replica2,replica3}

•External API to HDFS Clients

- Internal API to DataNodes

•Monitors Datanodes for failures, corrupted data

•Manages Leases, Quotas, (re-)replication

•Must do all this in a single JVM

- Spotify have a 90GB Heap storing references to 300m files

12

Page 13: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

High Availability for the NameNode HDFS 2.x

DN DN DN DN

NN Active

NN Standby

JN JN JN

Shared NN

log stored in

quorum of

journal nodes

NN

Checkpt NN

ZK ZK ZK

Master-Slave

Replication

of NN State.

Agreement on

the Active Master

Faster Recovery,

Cut Journal Log

DOESN’T SCALEOUT !

Page 14: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

The Evolution of the NamNode

•HDFS (2006)

- In-memory metadata

•HDFS 0.07 (2006)

- WAL (EditLog)

- FSImage

•HDFS 0.21 (2009)

- Weaken Global Lock

•HDFS 2.0 (2011)

- Eventually Consistent Replication: HA-NameNode

They reinvented

the Database

for the NameNode!

Page 15: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Databases had these features long ago

•Oracle v6 (1988)

- Redo and Undo Logs

- Rollback Segments

•Oracle V7.1 (1994)

- Symmetric Replication

•Oracle 9i RAC (2001)

- Shared State Replication

and have continued to evolve…..

Page 16: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

The end of the One-size-fits-All Database

•Columnar Databases

- Vertica, Hana

•NewSQL Databases

- MySQL Cluster, VoltDB, Memstore, AtlasDB, FoundationDB

•Graph Databases

- Neo4J

•RDBMSes

- MySQL, Postgres, DB2, Oracle, SQLServer

•In-Memory Stores

- Memcached, Redis

•Key-Value Stores

- Dynamo, Cassandra, MongoDB, Riak

•Petabyte Databases

- BigQuery (Google), RedShift (Amazon), Impala (Cloudera)

16 Stonebraker et al, “One Size Fits All: An Idea Whose Time Has Come and Gone”, 2005

Page 17: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

17

•Distributed, In-memory

•2-Phase Commit

- Replicate DB, not the Log!

•Real-time

- Low TransactionInactive timeouts

•Commodity Hardware

•Scales out

- Millions of transactions/sec

- TB-sized datasets (48 nodes)

•Split-Brain solved with Arbitrator Pattern

•SQL and Native Blocking/Non-Blocking APIs

MySQL Cluster (NDB) – Shared Nothing DB

SQL API NDB API

30+ million update transactions/second

on a 30-node cluster

Page 18: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

HopsFS

18

Page 19: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

HopsFS

• Customizable and Scalable Metadata

• High throughput for read and write operations

• NameNode failover time≈5 seconds (vs ~1 minute for HDFS)

Page 20: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Request Handling (Apache HDFS vs HopsFS)

Apache HDFS NameNode Request Handling

HopsFS NameNode Request Handling

Page 21: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Fine-Grained Locking, Transactional Updates

21

• NDB gives us READ_COMMITTED isolation-level, not strong enough.

• We implemented Serializability for FS operations using implicit locking

in the DAG and row-level locking in NDB.

[Hakimzadeh, Peiro, Dowling, ”Scaling HDFS with a Strongly Consistent Relational Model for Metadata”, DAIS 2014]

Page 22: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Preventing Deadlocks and Starvation

22

/user/jdowling/dna.bam mv

read

block_report

• Solution: all request threads for inode operations traverse the FS hierarchy

in the same order, acquiring locks in the same order.

• Block-level operations have to

follow the same order.

Page 23: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Per Transaction Cache

•Experimentation revealed many roundtrips to the database per transaction.

•Cache intermediate transaction results at NameNodes.

•We also use Memcached at each NameNode to cache mappings of: path->{inode/blocks/replicas}

Page 24: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Sometimes, Transactions Just ain’t Enough

24

Subtree Operations: 4-phase Protocol • Sacrifices Atomicity, but keeps Isolation and Consistency.

• Batch operations and multithreading for performance.

• Failed NameNodes handled transparently.

• Leases used to handle failed clients.

• Large Subtree Operations with millions of Inodes can’t be executed in a single

Transaction, due to the low timeouts for Transactions (real-time).

Page 25: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Leader Election using the Database (NDB)

•We need a leader NameNode to coordinate replication and lease management

•Use NDB as shared memory for Leader Election.

•No more Zookeeper, yay!

25

Page 26: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

HopsFS Internal Protocol Scalability

•On 100PB+ clusters, internal protocols make up most of the network traffic for HDFS

•Block Reporting and Exiting Safe Mode

- Batching and work stealing.

Page 27: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

HopsFS Write Performance

27 1 Gbit Network, Nodes: 12-core Xeon X560 @ 2.8 Ghz. 2-Node NDB Cluster.

Page 28: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

HopsFS Read Performance

28 1 Gbit Network, Nodes: 12-core Xeon X560 @ 2.8 Ghz. 2-Node NDB Cluster.

Page 29: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

HopsFS Erasure Coding

HDFS 2.x Triple Replication

(300%)

2x Replication + XOR (220%)

Reed-Solomon (140%)

Page 30: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

HopsFS Erasure Coding

30

Data durability with Triple Replication Data durability with Reed-Solomon

Page 31: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Comparison with HDFS-RAID

Page 32: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

HopsFS Snapshots

•Read-Only Root-Level Single Snapshot

- Support rollback on unsuccessful software upgrades

- Prototype developed, ongoing work on integration

- Snapshot rollback order-of-growth is O(N)

Page 33: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

We did the same for YARN…

33

Page 34: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

34

Apache Hadoop Yarn HA/Scaleout Limitations

NM NM NM NM NM

Standby

RM Primary

RM

Clients Zookeeper

• The Resource Manager (RM) is a bottleneck.

• Zookeeper throughput not high enough to persist all RM state

• Standby resource manager can only recover partial state

• All running jobs must be restarted.

• RM state not queryable.

Page 35: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

35

Hops Yarn.

NM NM NM NM NM

RM RM

Client NDB NDB NDB

• The RM is a State-Machine. Almost no session state to manage.

• Transparent failover working.

Page 36: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Hops Yarn

•FIFO Scheduler

•Capacity Scheduler

•Fair Scheduler

•Distributed Resource Tracker Service (ongoing)

•Make YARN more interactive (ongoing)

- Reduce NodeManager Heartbeat Time

36

Page 37: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Hops-Hadoop

NN NN NN

DN

NM

DN

NM

DN

NM

DN

NM

DN

NM

DN

NM

DN

NM

DN

NM

DN

NM

DN

NM

DN

NM

DN

NM

RM RM RM

NDB NDB NDB NDB

Exabyte-Scale Hadoop

HDFS HDFS YARN YARN

Page 38: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

The Hops Stack Continued

38

Page 39: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Bringing Data People Together

•Data Owners

- Metadata, Ingestion

- Non-programmers

•Data Scientists

- Data analysts

- Programmers

Hops-HDFS

Hops-YARN

HopsHub Karam

el/PaaS

39

Spark Flink Adam Cuneiform

Page 40: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Perimeter Security and Multi-Tenancy

•HopsHub

- Project-level RBAC

• Hadoop trusted proxy

- Analytics Plugin Framework

• Adam, Cuneiform, Spark, Flink, MR

- REST APIs

Related Hadoop Security Projects

Knox, Sentry, Rhino

Network Isolation

LIMS

LDAP

Kerberos

40

Page 41: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

HopsHub Two-Factor Authentication

41

Page 42: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Projects for Multi-Tenancy; Activity Trails

Global Activity Trail

Project

42

Page 43: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Project Membership

43

Page 44: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

File Browser (Iceberg)

HDFS Files

Page 45: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Upload Data

Apache Flume

Overcome 3 GB browser upload limit

45

Automated Ingestion of Data

Page 46: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Run Cuneiform Workflows on YARN

46

VCF file FastQ files Results (~250 GB) (~10 GB) (~5 MB)

Variant

Calling Annotate BAM file Align

Page 47: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Ongoing MSc Projects

•Realizing the meta-data dream of WinFS

- Vangelis

•Optimizing YARN’s Resource Tracker Service (interactive YARN)

- Sri

•Interactive Data Analytics (Zeppelin-EE)

- Seckin

47

Page 48: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

PaaS support with Chef/Karamel

Support for EC2, Vagrant, Bare Metal. 48

Page 49: Hadoop Open Platform-as-a-Service (Hops)/J.Dowling.pdfApache Hadoop Yarn HA/Scaleout Limitations NM NM NM NM NM Standby RM Primary RM Clients Zookeeper • The Resource Manager (RM)

Conclusions

•Hops will be the first European distribution of Hadoop when released.

- First beta release coming in Q1 2015

•Lots of ideas for future work

- Tighter Spark, Flink integration

- BiobankCloud support

•NGS Hadoop Workshop Feb 19-20, Stockholm

- Signup at www.biobankcloud.com

49