high availability hdfs - msst conferencehigh availability hdfs matt foley hortonworks, inc....
TRANSCRIPT
![Page 2: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/2.jpg)
Matt Foley - Background
Page 2 Architecting the Future of Big Data
• MTS at Hortonworks Inc. – HDFS contributor, part of original ~25 in Yahoo! spin-out of Hortonworks – Currently managing engineering infrastructure for Hortonworks – My team also provides Build Engineering infrastructure services to ASF,
for Hadoop core and several related projects within Apache – Formerly, led software development for back end of Yahoo Mail for three
years – 20,000 servers with 30 PB of data under management, 400M active users
– Did startups in Storage Management and Log Management
• Apache Hadoop, ASF – Committer and PMC member, Hadoop core – Release Manager – Hadoop-1.0
![Page 3: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/3.jpg)
Company Background
Page 3 Architecting the Future of Big Data
• In 2006, Yahoo! was a very early adopter of Hadoop, and became the principle contributor to it.
• Over time, invested 40K+ servers and 170PB storage in Hadoop • Over 1000 active users run 5M+ Map/Reduce jobs per month • In 2011, Yahoo! spun off ~25 engineers into Hortonworks, a company
focused on advancing open source Apache Hadoop for the broader market ( http://www.wired.com/wiredenterprise/2011/10/how-yahoo-spawned-hadoop )
2006
2011
![Page 4: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/4.jpg)
Agenda
Page 4 Architecting the Future of Big Data
• Overview of HDFS architecture • Hadoop “ecosystem” • Hadoop 2.0
• High Availability • What has been the HDFS record?
–reliability –availability
• HDFS-HA
![Page 5: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/5.jpg)
What is Hadoop?
Page 5 Architecting the Future of Big Data
• Hadoop - Open Source Apache Project – Framework for reliably storing & processing petabytes of data using
commodity hardware and storage • Scalable solution
– Computation capacity – Storage capacity – I/O bandwidth
• Core components – HDFS: Hadoop Distributed File System - distributes data – Map/Reduce - distributes application processing and control
• Move computation to data and not the other way • Written in Java • Runs on
– Linux, Windows, Solaris, and Mac OS/X
![Page 6: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/6.jpg)
Commodity Hardware Cluster
Page 6 Architecting the Future of Big Data
• Typically in 2- or 3-level architecture – Nodes are commodity Linux servers – 20 - 40 nodes/rack – Uplink from rack is 10 or 2x10 gigabit – Rack-internal is 1 or 2x1 gigabit all-to-all
• “Flat fabric” 10Gbit network architectures being planned at growing number of sites
10
![Page 7: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/7.jpg)
Hadoop Distributed File System (HDFS)
Page 7 Architecting the Future of Big Data
• One PB-scale file system for the entire cluster –Managed by a single Namenode –Files are written, read, renamed, deleted, but append-only –Optimized for streaming reads of large files
• Files are broken into uniform sized blocks –Blocks are typically 128 MB (nominal – no wasted space) –Replicated to several Datanodes, for reliability –Exposes block placement so that computation can be migrated to
data • Client library directly reads data from Data Nodes
–Bandwidth scales linearly with the number of nodes –System is topology-aware –Array of block locations is available to clients
![Page 8: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/8.jpg)
HDFS Diagram
Page 8 Architecting the Future of Big Data
b1
b2
b3 b1
b5
b3 b3
b5
b2
b4 b5
b6 b2
b3
b4
Namenode
Namespace Metadata & Journal
Namespace State
Block Map
Heartbeats & Block Reports
Block ID Block Locations
Datanodes
Block ID Data
Backup Namenode
Hierarchal Namespace File Name BlockIDs
Horizontally Scale IO and Storage
![Page 9: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/9.jpg)
Block Placement
Page 9 Architecting the Future of Big Data
•Default is 3 replicas, but settable •Blocks are placed (writes are pipelined):
–First replica on the local node or a random node on local rack
–Second replica on a remote rack –Third replica on a node on same remote rack –Other replicas randomly placed
•Clients read from closest replica –System is topology-aware
•Block placement policy is pluggable
![Page 10: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/10.jpg)
Block Correctness
Page 10 Architecting the Future of Big Data
•Data is checked with CRC32 •File Creation
–Client computes block checksums –DataNode stores the checksums
•File access –Client retrieves the data and checksum from DataNode –If Validation fails, Client tries other replicas
•Periodic validation by DataNode –Background DataBlockScanner task
![Page 11: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/11.jpg)
HDFS Data Reliability
Page 11 Architecting the Future of Big Data
b1
b2
b3 b1
b5
b3 b3
b5
b2
b4 b5
b6 b2
b3
b4
Namenode
2. copy
3. blockReceived 1. replicate
Datanodes
Bad/lost block replica
Namespace State
Block Map
![Page 12: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/12.jpg)
Active Data Management
Page 12 Architecting the Future of Big Data
•Continuous replica maintenance
•End-to-end checksums
•Periodic checksum verification
•Decommissioning nodes for service
•Balancing storage utilization
![Page 13: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/13.jpg)
Other Hadoop Ecosystem Components
Page 13 Architecting the Future of Big Data
Zook
eepe
r (C
oord
inat
ion)
Core Apache Hadoop Related Hadoop Projects
HDFS (Hadoop Distributed File System)
MapReduce (Distributed Programing Framework)
Hive (SQL)
Pig (Data Flow)
HBase (Columnar NoSQL
Store)
HC
atal
og
(Tab
le &
Sch
ema
Man
agem
ent)
![Page 14: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/14.jpg)
Agenda
Page 14 Architecting the Future of Big Data
• Overview of HDFS architecture • Hadoop “ecosystem” • Hadoop 2.0
• High Availability • What has been the HDFS record?
–reliability –availability
• HDFS-HA
![Page 15: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/15.jpg)
Hadoop 2.0
Page 15 Architecting the Future of Big Data
• Developed on Hadoop branch 0.23 • Highlights:
–HDFS Namenode HA –HDFS Namenode Federation –Next-Generation MapReduce architecture (aka YARN) – Performance
![Page 16: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/16.jpg)
HDFS Federation in v2.0
Page 16 Architecting the Future of Big Data
• Improved scalability and isolation • Clear separation of Namespace and Block Storage
![Page 17: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/17.jpg)
MapReduce2 - YARN
![Page 18: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/18.jpg)
Agenda
Page 18 Architecting the Future of Big Data
• Overview of HDFS architecture • Hadoop “ecosystem” • Hadoop 2.0
• High Availability • What has been the HDFS record?
–reliability –availability
• HDFS-HA
![Page 19: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/19.jpg)
Current HDFS Reliability & Availability
Page 19 Architecting the Future of Big Data
•Block store – extremely high –Block replicas stored in native FS on multiple nodes
–Transparently ensure that blocks stay replicated –Serve from closest available replica –A lost node with 12 TB can be re-replicated in 7 minutes –A single lost disk of 1TB can be re-replicated in 30 seconds
–With standard 3x replication, probability of data loss due to normal rates of server and disk failure is infinitesimally small
–even assuming very casual approach to parts replacement
–In study of 2009 data, lost 19 blocks out of 329M on 20,000 nodes, due to software bugs that have since been fixed.
![Page 20: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/20.jpg)
Current HDFS Reliability & Availability
Page 20 Architecting the Future of Big Data
•Meta-data store –Single NameNode stores state –Journaling and snapshot management to assure data
persistence, to multiple local and NFS (HA) stores –But SPOF with manual switch-over on failure
•How well did it work? –18 month study of 25 clusters had 22 NN failures
–Only 8 of them would have been helped with HA –Impacted availability, but never durability.
![Page 21: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/21.jpg)
HA: Approach and Terminology
Page 21 Architecting the Future of Big Data
• Initial goal is Active-Standby – Active Namenode: actively serves read/write operations from clients – Standby Namenode: waits, becomes active when Active Namenode fails
– Could serve read operations
• Standby’s State may be cold, warm or hot
– Cold : Standby has zero state (e.g. started after the Active is declared dead.
– Warm: Standby has partial state: – has loaded fsImage & editLogs but has not received any block reports – has loaded fsImage and rolled logs and all block reports
– Hot Standby: Standby has all most of the Active’s state and start immediately
![Page 22: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/22.jpg)
High Level Use Cases
Page 22 Architecting the Future of Big Data
•Planned downtime –Upgrades –Config changes –Main reason for downtime
•Unplanned downtime –Hardware failure –Server unresponsive –Software failures –Occurs infrequently
•Supported failures –Single hardware failure
– Double hardware failure not supported
–Some software failures – Same software failure
affects both active and standby
![Page 23: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/23.jpg)
Deployment Models
Page 23 Architecting the Future of Big Data
• Single Namenode configuration; no failover • Active and Standby with manual failover
–Standby could be cold/warm/hot
• Active and Standby with automatic failover –Hot standby
• See HDFS-1623 for detailed use cases
![Page 24: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/24.jpg)
Design
Page 24 Architecting the Future of Big Data
•Failover control outside Namenode •Parallel Block reports to Active and Standby (Hot failover)
•Shared or non-shared Namenode state •Fencing of shared resources/data
–Datanodes –Shared Namenode state (if any)
•Client failover –IP Failover –Smart clients (e.g Zookeeper for coordination)
![Page 25: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/25.jpg)
Failover Control Outside Namenode
Page 25 Architecting the Future of Big Data
• Failover Controller – outside Namenode
• Daemon manages resources – All resources modeled uniformly – Resources – OS, HW, Network etc. – Namenode is just another resource
• Heartbeat with other nodes • Quorum based leader election
– Zookeeper for co-ordination and Quorum
• Fencing during split brain – Prevents data corruption
Failover Controller
Resources Actions
start, stop, failover, monitor, …
Quorum Service
Resources Resources
Shared Resources
![Page 26: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/26.jpg)
HA Namenode with ZooKeeper
Page 26 Architecting the Future of Big Data
NN Active
NN Standby
DN
FailoverController Active
ZK
Cmds Monitor Health of NN. OS, HW
Monitor Health of NN. OS, HW
DN DN
FailoverController Standby
ZK ZK Heartbeat Heartbeat
![Page 27: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/27.jpg)
Sharing the Namenode’s Persistent State medium term – 6 month timeframe
Page 27 Architecting the Future of Big Data
NN Active
NN Standby
Shared NN state with
single writer (fencing)
Shared Storage Approach
NN Active
NN Standby
Edit Logs
Direct stream to Standby NN
![Page 28: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/28.jpg)
Sharing the Namenode’s Persistent State long term
Page 28 Architecting the Future of Big Data
NN Active
NN Standby
Store NN journal and checkpointed image on
Datanodes
DN DN DN
![Page 29: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/29.jpg)
Hadoop 2.0 “Availability” (in the field)
Page 29 Architecting the Future of Big Data
• Requires LOTS of testing
• In small-scale test (500-800 nodes) 2Q2012
• Ramping up over rest of year, with full range of application testing
• Expected to be in production at multiple sites by end/2012
![Page 30: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/30.jpg)
Credits
Page 30 Architecting the Future of Big Data
For major contributions to Hadoop technology, and help with this presentation:
• Sanjay Radia and Suresh Srinivas, Hortonworks – Architect and Team Lead, HDFS – HA and Federation
• Owen O’Malley, Hortonworks – Hadoop lead Architect – Security, Map/Reduce
• Arun Murthy, Hortonworks – Architect and Team Lead, Map/Reduce – M/R2, YARN, etc.
• Rob Chansler, Yahoo! – Team Lead, HDFS – Analysis of Data Availability and Durability
![Page 31: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/31.jpg)
Help getting started
Page 31 Architecting the Future of Big Data
• Apache Hadoop Projects – http://hadoop.apache.org/ – http://wiki.apache.org/hadoop/
• Apache Hadoop Email lists: – [email protected] – [email protected] – [email protected]
• O’Reilly Books – Hadoop, The Definitive Guide – HBase, The Definitive Guide
• Hortonworks, Inc. – Installable Data Platform distribution (100% OSS, conforming to Apache releases)
– http://hortonworks.com/technology/techpreview/ – Training and Certification programs
– http://hortonworks.com/training/
• Hadoop Summit 2012 (June 13-14, San Jose) – http://hadoopsummit.org/
![Page 32: High Availability HDFS - MSST ConferenceHigh Availability HDFS Matt Foley Hortonworks, Inc. mfoley@hortonworks.com . Matt Foley - Background Architecting the Future of Big Data](https://reader036.vdocument.in/reader036/viewer/2022071010/5fc7a32d5e32ab565242840d/html5/thumbnails/32.jpg)
Thanks for Listening!
Page 32 Architecting the Future of Big Data
Questions?