extending hadoop for fun & profit
Post on 13-Sep-2014
2.167 views
DESCRIPTION
Apache Hadoop project, and the Hadoop ecosystem has been designed be extremely flexible, and extensible. HDFS, Yarn, and MapReduce combined have more that 1000 configuration parameters that allow users to tune performance of Hadoop applications, and more importantly, extend Hadoop with application-specific functionality, without having to modify any of the core Hadoop code. In this talk, I will start with simple extensions, such as writing a new InputFormat to efficiently process video files. I will provide with some extensions that boost application performance, such as optimized compression codecs, and pluggable shuffle implementations. With refactoring of MapReduce framework, and emergence of YARN, as a generic resource manager for Hadoop, one can extend Hadoop further by implementing new computation paradigms. I will discuss one such computation framework, that allows Message Passing applications to run in the Hadoop cluster alongside MapReduce. I will conclude by outlining some of our ongoing work, that extends HDFS, by removing namespace limitations of the current Namenode implementation.TRANSCRIPT
![Page 1: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/1.jpg)
Extending Hadoop for Fun & Profit
Milind Bhandarkar Chief Scientist, Pivotal Software,
(Twitter : @techmilind)
![Page 2: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/2.jpg)
About Me• http://www.linkedin.com/in/milindb
• Founding member of Hadoop team at Yahoo! [2005-2010]
• Contributor to Apache Hadoop since v0.1
• Built and led Grid Solutions Team at Yahoo! [2007-2010]
• Parallel Programming Paradigms [1989-today] (PhD cs.illinois.edu)
• Center for Development of Advanced Computing (C-DAC), National Center for Supercomputing Applications (NCSA), Center for Simulation of Advanced Rockets, Siebel Systems (acquired by Oracle), Pathscale Inc. (acquired by QLogic), Yahoo!, LinkedIn, and Pivotal (formerly Greenplum)
![Page 3: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/3.jpg)
Agenda• Extending MapReduce
• Functionality
• Performance
• Beyond MapReduce with YARN
• Hamster & GraphLab
• Extending HDFS
• Q & A
![Page 4: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/4.jpg)
Extending MapReduce
![Page 5: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/5.jpg)
MapReduce Overview
• Record = (Key, Value)
• Key : Comparable, Serializable
• Value: Serializable
• Logical Phases: Input, Map, Shuffle, Reduce, Output
![Page 6: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/6.jpg)
Map
• Input: (Key1, Value1)
• Output: List(Key2, Value2)
• Projections, Filtering, Transformation
![Page 7: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/7.jpg)
Shuffle
• Input: List(Key2, Value2)
• Output
• Sort(Partition(List(Key2, List(Value2))))
• Provided by Hadoop : Several Customizations Possible
![Page 8: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/8.jpg)
Reduce
• Input: List(Key2, List(Value2))
• Output: List(Key3, Value3)
• Aggregations
![Page 9: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/9.jpg)
MapReduce DataFlow
![Page 10: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/10.jpg)
Configuration• Unified Mechanism for
• Configuring Daemons
• Runtime environment for Jobs/Tasks
• Defaults: *-default.xml
• Site-Specific: *-site.xml
• final parameters
![Page 11: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/11.jpg)
<configuration> <property> <name>mapred.job.tracker</name> <value>head.server.node.com:9001</value> </property> <property> <name>fs.default.name</name> <value>hdfs://head.server.node.com:9000</value> </property> <property> <name>mapred.child.java.opts</name> <value>-Xmx512m</value> <final>true</final> </property>....</configuration>
Example
![Page 12: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/12.jpg)
Extending Input Phase• Convert ByteStream to List(Key, Value)
• Several Formats pre-packaged
• TextInputFormat<long, Text>!
• SequenceFileInputFormat<K,V>!
• KeyValueTextInputFormat<Text,Text>!
• Specify InputFormat for each job
• JobConf.setInputFormat()
![Page 13: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/13.jpg)
InputFormat
• getSplits() : From Input descriptors, get Input Splits, such that each Split can be processed independently
•<FileName, startOffset, length>!
• getRecordReader() : From an InputSplit, get list of Records
![Page 14: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/14.jpg)
Industry Use Case !
Surveillance Video Anomaly Detection
![Page 15: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/15.jpg)
Acknowledgements
• Victor Fang
• Regu Radhakrishnan
• Derek Lin
• Sameer Tiwari
![Page 16: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/16.jpg)
Anomaly Detection in Surveillance Video
• Detect anomalous objects in a restricted perimeter
• Typical large enterprise collects TB’s video per day
• Hadoop MapReduce runs computer vision algorithms in parallel and captures violation events
• Post-Incident monitoring enabled by Interactive Query
![Page 17: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/17.jpg)
Video DataFlow
• Timestamped Video Files as input
• Distributed Video Transcoding : ETL in Hadoop
• Distributed Video Analytics in Hadoop/HAWQ
• Insights in relational DB
![Page 18: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/18.jpg)
Real World Video Data
• Benchmark Surveillance videos from UK Home Office (iLids)
• CCTV Video footage depicting scenarios central to Govt requirements
![Page 19: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/19.jpg)
Common Video Standards
• MPEG & ITU responsible for most video standards
• MPEG-2 (1995) Widely adopted in DVDs, TV, Set Top boxes
![Page 20: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/20.jpg)
MPEG Standard Format
• Sequence of encoded video frames
• Compression by eliminating:
• Redundancy in Time: Inter-Frame Encoding
• Redundancy in Space: Intra-Frame Encoding
![Page 21: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/21.jpg)
Motion Compensation
• I-Frame: Intra-Frame encoding
• P-Frame: Predicated frame from previous frame
• B-Frame: Predicted frame from both previous & next frame
![Page 22: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/22.jpg)
Distributed MPEG Decoding
• HDFS splits large files in 64 MB/128 MB blocks
• Each HDFS block can be processed independently by a Map task
• Can we decode individual video frames from an arbitrary HDFS block in an MPEG File ?
![Page 23: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/23.jpg)
Splitting MPEG-2
• Header Information available only once per file
• Group of Pictures (GOP) header repeats
• Each GOP starts with an I-Frame and ends with an I-Frame
• Each GOP can be decoded independently
• First and last GOP may straddle HDFS blocks
![Page 24: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/24.jpg)
MPEG2InputFormat
• Derived from FileInputFormat
• getSplits() : Identical to FileInputFormat
• InputSplit = HDFS Block
•getRecordReader()!
•MPEG2RecordReader
![Page 25: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/25.jpg)
MPEG2RecordReader
• Start from beginning of block
• Search for the first GOP Header
• Locate an I-Frame, decode, keep in memory
• If P-Frame, decode using last frame
• If B-Frame, keep current frame in memory, read next frame, decode current frame
![Page 26: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/26.jpg)
Considerations for Input Format
• Use as little metadata as possible
• Number of Splits = Number of Map Tasks
• Combine small files
• Split determination happens in a single process, so should be metadata-based
• Affects scalability of MapReduce
![Page 27: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/27.jpg)
Scalability
• If one node processes k MB/s, then N nodes should process (k*N) MB/s
• If some fixed amount of data is processed in T minutes on one node, the N nodes should process same data in (T/N) minutes
• Linear Scalability
![Page 28: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/28.jpg)
Reduce LatencyMinimize Job Execution time
![Page 29: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/29.jpg)
Increase ThroughputMaximize amount of data processed per unit time
![Page 30: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/30.jpg)
Amdahl’s Law
S = N1+!(N !1)
![Page 31: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/31.jpg)
Multi-Phase Computations
• If computation C is split into N different parts, C1..CN
• If partial computation Ci can be speeded up by a factor of Si
![Page 32: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/32.jpg)
Amdahl’s Law, Restated
€
S =
Cii=1
N
∑Ci
Sii=1
N
∑
![Page 33: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/33.jpg)
Amdahl’s Law• Suppose Job has 5 phases: P0 is 10 seconds, P1,
P2, P3 are 200 seconds each, and P4 is 10 seconds
• Sequential runtime = 620 seconds • P1, P2, P3 parallelized on 100 machines with
speedup of 80 (Each executes in 2.5 seconds)
• After parallelization, runtime = 27.5 seconds • Effective Speedup: (620s/27.5s) = 22.5
![Page 34: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/34.jpg)
MapReduce Workflow
![Page 35: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/35.jpg)
Extending Shuffle
![Page 36: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/36.jpg)
Why Shuffle ?
• Often, the most expensive phase in MapReduce, involves slow disks and network
• Map tasks partition, sort and serialize outputs, and write to local disk
• Reduce tasks pull individual Map outputs over network, merge, and may spill to disk
![Page 37: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/37.jpg)
Message Cost Model
€
T = α + Nβ
![Page 38: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/38.jpg)
Message Granularity
• For Gigabit Ethernet
• α = 300 μS
• β = 100 MB/s
• 100 Messages of 10KB each = 40 ms
• 10 Messages of 100 KB each = 13 ms
![Page 39: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/39.jpg)
Alpha-Beta• Common Mistake: Assuming that α is constant
• Scheduling latency for responder
• MR daemons time slice inversely proportional to number of concurrent tasks
• Common Mistake: Assuming that β is constant
• Network congestion
• TCP incast
![Page 40: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/40.jpg)
Efficient Hardware Platforms
• Mellanox - Hadoop Acceleration through Network-assisted Merge
• RoCE - Brocade, Cisco, Extreme, Arista...
• SSD - Velobit, Violin, FusionIO, Samsung..
• Niche - Compression, Encryption...
![Page 41: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/41.jpg)
Pluggable Shuffle & Sort• Replace HTTP-based pull with RDMA
• Avoid spilling altogether
• Replace default Sort implementation with Job-optimized sorting algorithm
• Experimental APIs
• google PluggableShuffleAndPluggableSort.html
![Page 42: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/42.jpg)
Mellanox UDA
• Developed jointly with Auburn University
• 2x Performance on TeraSort
• Reduces disk writes by 45%, disk reads by 15%
![Page 43: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/43.jpg)
Syncsort DMX-h
![Page 44: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/44.jpg)
Beyond MapReduce with YARN
![Page 45: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/45.jpg)
Single'App'
BATCH
HDFS
Single'App'
INTERACTIVE
Single'App'
BATCH
HDFS
Single'App'
BATCH
HDFS
Single'App'
ONLINE
Hadoop 1.0 (Image Courtesy Arun Murthy, Hortonworks)
![Page 46: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/46.jpg)
MapReduce 1.0 (Image Courtesy Arun Murthy, Hortonworks)
![Page 47: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/47.jpg)
Hadoop 2.0 (Image Courtesy Arun Murthy, Hortonworks)
HADOOP 1.0
HDFS%(redundant,*reliable*storage)*
MapReduce%(cluster*resource*management*
*&*data*processing)*
HDFS2%(redundant,*reliable*storage)*
YARN%(cluster*resource*management)*
Tez%(execu7on*engine)*
HADOOP 2.0
Pig%(data*flow)*
Hive%(sql)*
%Others%(cascading)*
*
Pig%(data*flow)*
Hive%(sql)*
%Others%(cascading)*
%
MR%(batch)*
RT%%Stream,%Graph%Storm,''Giraph'
*
Services%HBase'
*
![Page 48: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/48.jpg)
Applica'ons+Run+Na'vely+IN+Hadoop+
HDFS2+(Redundant,*Reliable*Storage)*
YARN+(Cluster*Resource*Management)***
BATCH+(MapReduce)+
INTERACTIVE+(Tez)+
STREAMING+(Storm,+S4,…)+
GRAPH+(Giraph)+
INLMEMORY+(Spark)+
HPC+MPI+(OpenMPI)+
ONLINE+(HBase)+
OTHER+(Search)+(Weave…)+
YARN Platform (Image Courtesy Arun Murthy, Hortonworks)
![Page 49: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/49.jpg)
NodeManager* NodeManager* NodeManager* NodeManager*
Container*1.1*
Container*2.4*
NodeManager* NodeManager* NodeManager* NodeManager*
NodeManager* NodeManager* NodeManager* NodeManager*
Container*1.2*
Container*1.3*
AM*1*
Container*2.2*
Container*2.1*
Container*2.3*
AM2*
Client2*
ResourceManager*
Scheduler*
YARN Architecture (Image Courtesy Arun Murthy, Hortonworks)
![Page 50: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/50.jpg)
YARN
• Yet Another Resource Negotiator
• Resource Manager
• Node Managers
• Application Masters
• Specific to paradigm, e.g. MR Application master (aka JobTracker)
![Page 51: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/51.jpg)
Beyond MapReduce
• Apache Giraph - BSP & Graph Processing
• Storm on Yarn - Streaming Computation
• HOYA - HBase on Yarn
• Hamster - MPI on Hadoop
• More to come ...
![Page 52: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/52.jpg)
Hamster• Hadoop and MPI on the same
cluster
• OpenMPI Runtime on Hadoop YARN
• Hadoop Provides: Resource Scheduling, Process monitoring, Distributed File System
• Open MPI Provides: Process launching, Communication, I/O forwarding
![Page 53: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/53.jpg)
Hamster Components
• Hamster Application Master
• Gang Scheduler, YARN Application Preemption
• Resource Isolation (lxc Containers)
• ORTE: Hamster Runtime
• Process launching, Wireup, Interconnect
![Page 54: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/54.jpg)
Resource Manager
Scheduler
AMService
Node Manager Node Manager Node Manager …
Proc/Container
Framework Daemon NS MPI
Scheduler HNP
MPI AM
Proc/Container
… RM-AM
AM-NM
RM-NodeManager Client Client-RM
Aux Srvcs
Proc/Container
Framework Daemon NS
Proc/Container
…
Aux Srvcs RM-
NodeManager
Hamster Architecture
![Page 55: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/55.jpg)
Hamster Scalability• Sufficient for small to medium HPC
workloads
• Job launch time gated by YARN resource scheduler
Launch WireUp Collectives
Monitor
OpenMPI O(logN) O(logN) O(logN) O(logN)
Hamster O(N) O(logN) O(logN) O(logN)
![Page 56: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/56.jpg)
GraphLab + Hamster on Hadoop
!
![Page 57: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/57.jpg)
About GraphLab
• Graph-based, High-Performance distributed computation framework
• Started by Prof. Carlos Guestrin in CMU in 2009
• Recently founded Graphlab Inc to commercialize Graphlab.org
![Page 58: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/58.jpg)
GraphLab Features• Topic Modeling (e.g. LDA)
• Graph Analytics (Pagerank, Triangle counting)
• Clustering (K-Means)
• Collaborative Filtering
• Linear Solvers
• etc...
![Page 59: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/59.jpg)
Only Graphs are not Enough
• Full Data processing workflow required ETL/Postprocessing, Visualization, Data Wrangling, Serving
• MapReduce excels at data wrangling
• OLTP/NoSQL Row-Based stores excel at Serving
• GraphLab should co-exist with other Hadoop frameworks
![Page 60: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/60.jpg)
Coming Soon…
![Page 61: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/61.jpg)
Extending HDFS
![Page 62: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/62.jpg)
HCFS
• Hadoop Compatible File Systems
• FileSystem, FileContext
• S3, Local FS, webhdfs
• Azure Blob Storage, CassandraFS, Ceph, CleverSafe, Google Cloud Storage, Gluster, Lustre, QFS, EMC ViPR (more to come)
![Page 63: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/63.jpg)
New Dataset
• Reuse Namenode and Datanode implementations
• Substitute a different DataSet implementation: FsDatasetSpi, FsVolumeSpi
• Jira: HDFS-5194
![Page 64: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/64.jpg)
Extending Namenode
• Pluggable Namespace: HDFS-5324, HDFS-5389
• Pluggable Block Management: HDFS-5477
• Requires fine-grained locking in Namenode: HDFS-5453
![Page 65: Extending Hadoop for Fun & Profit](https://reader033.vdocument.in/reader033/viewer/2022051012/54138c668d7f7299698b465c/html5/thumbnails/65.jpg)
Questions ?