store app a shared storage appliance for efficient and scalable virtualized hadoop clusters
TRANSCRIPT
StoreApp:A Shared Storage Appliance for Efficient and Scalable Virtualized Hadoop Clusters
LIU KaiEmail: [email protected]: http://kiwenlau.com/
National Institute of Informatics, Japan
04/15/2023 1LIU Kai, National Institute of Informatics
Contents
Introduction (What?) Motivation (Why?) Implementation (How?) Personal Ideas
04/15/2023 2LIU Kai, National Institute of Informatics
Introduction – What is StoreApp?
04/15/2023 3LIU Kai, National Institute of Informatics
Background
Hadoop (version 1): for big data storage and computation Hadoop Distributed File System (HDFS): for storage Hadoop MapReduce Framework: for computation Master/Slave Architecture Storage(DataNode) and computation(TaskTracker) co-locate in a node
04/15/2023 LIU Kai, National Institute of Informatics 4
DataNodeTaskTracker
…
Slave Slave Slave Slave
NameNodeJobTracker
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
Master
Physical MachineOr Virtual Machine
Overview
What is StoreApp? A Hadoop plugin For speeding up Hadoop running in virtual machines Separate storage (DataNode) from computation (TaskTracker)
04/15/2023 LIU Kai, National Institute of Informatics 5
TaskTracker
DataNode
TaskTracker
TaskTracker
Physical machine Physical machine
Virtual machineDataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
Virtual machine
Benefit
Improve HDFS throughput by 78.3% Storage VM has higher priority in scheduling than computation VM Consolidating storage into one VM reduce I/O contentions
Reduce job completion time by 61% Most Hadoop jobs are data intensive Their performance are bottlenecked by slow disk access
04/15/2023 LIU Kai, National Institute of Informatics 6
Motivation – Why do we need StoreApp?
04/15/2023 7LIU Kai, National Institute of Informatics
Challenge 1
Can’t add or remove nodes easily Rebalancing data incurs significant data movement Cannot utilize the elasticity of virtual machines
04/15/2023 LIU Kai, National Institute of Informatics 8
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
Physical MachineVirtual Machine
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
…
Solution 1
Separate storage from computation Adding or removing computation node need no data movement Finding optimal number of computation nodes for each Hadoop job
04/15/2023 LIU Kai, National Institute of Informatics 9
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker…
Physical MachineVirtual Machine
Challenge 2
Colocated Virtual Machines often access disk concurrently Random IO operations will compete with each other Significantly degrade the Hadoop Job performance
04/15/2023 LIU Kai, National Institute of Informatics 10
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
Physical MachineVirtual Machine
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
…
Solution 2
Separate storage from computation Each physical machine only has one storage virtual machine Only the storage Virtual Machine is IO intensive No serious concurrent IO operations
04/15/2023 LIU Kai, National Institute of Informatics 11
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker…
Physical MachineVirtual Machine
Challenge 3
Can’t schedule Virtual Machines efficiently IO intensive VMs can be prioritized since they consume less CPU However, every VM is IO intensive!
04/15/2023 LIU Kai, National Institute of Informatics 12
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
Physical MachineVirtual Machine
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
DataNodeTaskTracker
…
Solution 3
Separate storage from computation Only the storage Virtual Machine is IO intensive The storage Virtual Machine will receive a higher priority
04/15/2023 LIU Kai, National Institute of Informatics 13
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker
TaskTracker
DataNode
TaskTracker
TaskTracker…
Physical MachineVirtual Machine
Implementation – How to design StoreApp?
04/15/2023 14LIU Kai, National Institute of Informatics
Architecture
04/15/2023 LIU Kai, National Institute of Informatics 15
A StoreApp manager and multiple storage nodes The StoreApp manager run on the master node Each physical machine has one storage node
Components
StoreApp manager Coordinate the operations of all data nodes
Scheduler Scheduling tasks according to data locations
HDFS Proxy Receive all HDFS requests and forward them to DataNode
Shuffler Receive map output and push them to DataNode
04/15/2023 LIU Kai, National Institute of Informatics 16
HDFS Prefetching
04/15/2023 LIU Kai, National Institute of Informatics 17
Read the whole block b1 instead of needed partial records Unused data of block b1 is kept in the memory Read consecutive block into memory to form input split s1
task0 task1
Automated Cluster Resizing
04/15/2023 LIU Kai, National Institute of Informatics 18
Dynamically change Cluster Size during the job execution The iterative algorithm can search for the optimal cluster size
Personal Ideas
04/15/2023 19LIU Kai, National Institute of Informatics
Pros and cons
Pros Simple idea but shows good result Show clear logic of locating and solving problems
Cons Restrict to Hadoop 1 No open source
04/15/2023 LIU Kai, National Institute of Informatics 20
Future direction
From Hadoop 1 to Hadoop 2 Hadoop 2 is quite different with Hadoop 1 Hadoop 2 can support more application framework like Spark
From Virtual Machine to container Container is a more lightweight virtualization technology Container is more Resource efficient than Virtual Machine Container is more easy to scale than Virtual Machine
04/15/2023 LIU Kai, National Institute of Informatics 21
References
Yanfei Guo, et al. "StoreApp: A Shared Storage Appliance for Efficient and Scalable Virtualized Hadoop Clusters”, INFOCOM, 4, 2015
04/15/2023 LIU Kai, National Institute of Informatics 22
Thank you!
04/15/2023 LIU Kai, National Institute of Informatics 23