![Page 1: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/1.jpg)
Author: Shubin Zhang, et al.
Institute of Computing Technology, Beijing, China
Reported by: Tzu-Li Tai
National Cheng Kung University, Taiwan
High Performance Parallel and Distributed Systems Lab
2009 IEEE 15th International Conference on Parallel and Distributed Systems
![Page 2: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/2.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
A. Background and Motivation
B. Goals and Design Decisions
C. System Overview
D. System Details
E. Experimental Results and Analysis
F. Conclusion and Future Works
G. Future Studies for Topic
E. Discussion: Our Chances
![Page 3: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/3.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
Background and Motivation
Pre-Notes:
- Published in 2009 (1st paper on topic)
- Outdated hardware/software and data size
- Focus on methodology and reasoning of using
distributed cache in Hadoop
- Learn possible tackle points and what to avoid
![Page 4: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/4.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
Background and Motivation
• Shuffle time becomes the bottleneck
M
Inter.
Data
M
Inter.
Data
M
Inter.
Data
R
HDFS pipeline replication
HDFS HDFS HDFS
1 local write
2 remote read
GOAL
![Page 5: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/5.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
Goals and Design Decisions
• Target clusters are small-scale
- Bandwidth is not scarce
- Node failures are uncommon- Commodity machines
- Heterogeneous
- GB Ethernet
![Page 6: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/6.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
Goals and Design Decisions
• Stay close to the original
• Retain fault-tolerance (!)
• Local decision-making
![Page 7: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/7.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
Goals and Design Decisions
• Low-latency, high-throughput access to map
outputs: global storage system
- No central coordinator
- Uniform global namespace- Low-latency, high-throughput data access
- Concurrent access
- Large capacity
- Scalable
⇒ 𝑫𝒊𝒔𝒕𝒓𝒊𝒃𝒖𝒕𝒆𝒅𝑴𝒆𝒎𝒐𝒓𝒚 𝑪𝒂𝒄𝒉𝒆
![Page 8: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/8.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
System Overview
• Use Memcached: http://memcached.org/
- Open-source distributed memory caching system
- Daemon processes on servers
- Global K-V store
![Page 9: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/9.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
System Overview
• Map side
! Buffer details
![Page 10: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/10.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
System Overview
[Extra] from O’Reilly
![Page 11: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/11.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
System Overview
• Reduce side
![Page 12: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/12.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
System Overview
[Extra] from O’Reilly
![Page 13: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/13.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
System Details
1. Memory Cache Capacity
M
M
M
M
M
M
M
M
M
M
M
R
R
![Page 14: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/14.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
System Details
1. Memory Cache Capacity
𝑆𝑖𝑧𝑒𝑚𝑒𝑚𝑐𝑎𝑐ℎ𝑒𝑑 = 𝑚𝑐 × 𝑠 × (𝑟 − 𝑟𝑎)
𝒎𝒄: completed map tasks
𝒔: avg. map output size
𝒓: total no. of reduce tasks
𝒓𝒂: no. of early scheduled reduce tasks
𝑆𝑖𝑧𝑒𝑚𝑒𝑚𝑐𝑎𝑐ℎ𝑒𝑑𝑀𝑖𝑛 = 𝑚 × 𝑠 × (𝑟 − 𝑟𝑎)
𝒎: total no. of map tasks
![Page 15: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/15.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
System Details
2. Network Traffic Demand
𝑆ℎ𝑢𝑓𝑓𝑙𝑒𝑑 𝐷𝑎𝑡𝑎 = 2 ∗ 𝑆𝑖𝑧𝑒𝑚𝑒𝑚𝑐𝑎𝑐ℎ𝑒𝑑
M R
![Page 16: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/16.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
System Details
2. Network Traffic Demand
M R
- Double amount of data shuffled (!)
- A compression algorithm is used on map outputs to
lessen network traffic
![Page 17: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/17.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
System Overview
• Map side
! Hashing function?
![Page 18: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/18.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
System Details
3. Fault Tolerance
• Map task failure
- rerun outputs not yet in memcache/disk
• Reduce task failure (!)
- For inputs that are not yet deleted from memcache, copy and execute
- For inputs that are already deleted from memcache, rerun the map task
• Memcached Server failure (!)
- Reinitialize all related map tasks
• Tasktracker failure
- All currently running map tasks and reduce tasks needs to be reinitialized
- Memcache data is still valid, so reduce tasks can still access them
![Page 19: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/19.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
System Details
3. Fault Tolerance
Reduce task failure
M R
![Page 20: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/20.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
System Details
3. Fault Tolerance
Memcached Server Failure
M
M
M
![Page 21: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/21.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
Experimental Results and Analysis
Environment
• Hardware
- Intel Pentium 4, 2.8GHz processor
- 2GB RAM
- 80GB 7200RPM SATA disk
• Software
- RedHat AS4.4, kernel 2.6.9 OS
- Hadoop 0.19.1
- Memcached 1.2.8
- Memcached client for Java 2.5.1
![Page 22: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/22.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
Experimental Results and Analysis
Hadoop+Memcached Setup
1 node
NameNode +
JobTracker +
Memcached Server (1GB RAM)
1~6 nodes
DataNode + TaskTracker
• 2 map slots + 2 reduce slots per
TaskTracker
• 4MB HDFS file block
• 5 shuffle threads in reduce tasks
![Page 23: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/23.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
Experimental Results and Analysis
Benchmark Applications
• Wordcount
- 491.4 MB English text
• Spatial Join Algorithm
- 2 data sets from TIGER/Line files
![Page 24: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/24.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
Experimental Results and Analysis
1. Impact of different node numbers
• No. of reduce tasks: 2*n
• Wordcount improvement: 43.1%
• Spatial Join improvement: 32.9%
![Page 25: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/25.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
Experimental Results and Analysis
2. Impact on job progress
*Note: Hadoop job progress calculation
- For Map tasks: % of input processed
- For Reduce tasks:
1/3 (copy) + 1/3 (sort) + 1/3 (actual processing)
![Page 26: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/26.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
Experimental Results and Analysis
2. Impact on job progress - WordCount
!
![Page 27: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/27.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
Experimental Results and Analysis
2. Impact on job progress – Spatial Join
!
![Page 28: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/28.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
Experimental Results and Analysis
2. Impact on job progress - Extra
33%
66%
sort
copy
reduce
![Page 29: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/29.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
Conclusion and Future Works
• Enhanced Hadoop to accelerate data
shuffling by using distributed memory
cache (memcached)
• Prototype performs much better than
original Hadoop under moderate load.
• Will modify task scheduling algorithm
(earlier reduce tasks)
![Page 30: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/30.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
Future Studies for Topic
• Dache: A Data Aware Caching for Big Data
Applications Using The MapReduce Framework,
2013 IEEE INFOCOM
• A Distributed Cache for Hadoop Distributed File
System in Real-Time Cloud Services,
2012 ACM/IEEE GRID
![Page 31: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/31.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
Discussion: Our Chances
1. Necessity of using Memcached?
![Page 32: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/32.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
Discussion: Our Chances
1. Necessity of using Memcached?
Properties for map task buffer:
io.sort.mb: buffer size
io.sort.spill.percent: spill-to-disk threshold
Hypothesis:
• Achieve map intermediate output local cache
• Modify reduce shuffle threads + TaskTracker
• RDD for Fault Tolerance?
![Page 33: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/33.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
Discussion: Our Chances
2. Moving the idea to YARN
![Page 34: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/34.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
Discussion: Our Chances
2. Moving the idea to YARN
NodeManager
NodeManager NodeManager
MR
Application
Manager
MAP
Task
REDUCE
Task
“Shuffle and Sort”
NodeManager
Auxiliary Service
![Page 35: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/35.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
Discussion: Our Chances
2. Moving the idea to YARN
yarn-site.xml
• Entire shuffle and sort phase is implemented as a
pluggable aux. service in YARN
![Page 36: Accelerating MapReduce with Distributed Memory Cache](https://reader033.vdocument.in/reader033/viewer/2022052410/554e8cc2b4c90573338b4b49/html5/thumbnails/36.jpg)
HPDS Lab, Institute of Computer and Communication Engineering, Electrical Engineering - NCKU
Discussion: Our Chances
3. Iterative applications
NodeManager
MR
Application
Manager
NodeManager
MR
Application
Manager
“Result caching + reuse”
NodeManager
Auxiliary Service
NM
M
NM
R
NM
M
NM
R