quartet: harmonizing task scheduling and caching for ... · analyses for the masses data collection...
TRANSCRIPT
Quartet: Harmonizing task scheduling and caching for cluster computing
Francis Deslauriers, Peter McCormick,George Amvrosiadis, Ashvin Goel &
Angela Demke Brown
June 23, 2016
Analyses for the masses
• Data collection is cheap, so datasets are growing exponentially
• Cluster computing makes it easy to analyze these datasets, enabling:◦ Queries on entire datasets◦ Analysts running queries on the same corpus◦ Tuning queries
2 of 18
Data is often re-accessed
• Result is many queries running on the same large datasets• Leads to significant data reuse
1
10
100
1,000
10,000
100,000
1 100 10,000 1,000,000
File
access f
req
uency
Input file rank by descending access frequency
CC-b
CC-c
CC-d
CC-e
FB-2010
Facebook, and Cloudera customers[VLDB’12]
0.0 0.2 0.4 0.6 0.8 1.0
Cumulative distribution
of input paths
0.0
0.2
0.4
0.6
0.8
1.0
CD
F o
f path
accesses
OpenCloud
M45
WebMining
CMU academicclusters [VLDB’13]
3 of 18
Data is often re-accessed
• Result is many queries running on the same large datasets• Leads to significant data reuse
1
10
100
1,000
10,000
100,000
1 100 10,000 1,000,000
File
access f
req
uency
Input file rank by descending access frequency
CC-b
CC-c
CC-d
CC-e
FB-2010
Facebook, and Cloudera customers[VLDB’12]
0.0 0.2 0.4 0.6 0.8 1.0
Cumulative distribution
of input paths
0.0
0.2
0.4
0.6
0.8
1.0
CD
F o
f path
accesses
OpenCloud
M45
WebMining
CMU academicclusters [VLDB’13]
3 of 18
Data reuse does not help
• We should expect data reuse improves performance due to caching
• We find that jobs do not see the benefits of reuse
4 of 18
Example: Repeating a Hadoop job
5 of 18
Example: Repeating a Hadoop job
5 of 18
Example: Repeating a Hadoop job
5 of 18
Example: Repeating a Hadoop job
5 of 18
Example: Repeating a Hadoop job
5 of 18
Example: Repeating a Hadoop job
5 of 18
Example: Repeating a Hadoop job
5 of 18
Missed Opportunities
• Working sets don’t fit in the page cache
• Jobs consume data independently of one another
6 of 18
Solution
• Key idea◦ Reorder work to first consume cached data◦ Jobs are made of small tasks with no ordering requirements
• Challenges◦ Cache visibility: Jobs need to know what data is cached on the different nodes◦ Task reordering: Jobs need to reorder their tasks based on this knowledge
• Our Quartet system addresses both these challenges
7 of 18
Challenge 1: Cache Visibility
• Datanodes collect information about HDFSblocks that reside in memory
◦ Requires the Duet kernel module that informsapplications when pages are cached or evicted
• Nodes send this information periodically to theQuartet Manager◦ Changes to the number of resident pages of each
block
8 of 18
Challenge 2: Task Reordering
1. Application Master registers blocks of interestwith the Quartet manager
2. Quartet manager informs Application Masterabout cached blocks
3. Application Master prioritizes and places tasksbased on block residency information
9 of 18
Example: Repeating a Hadoop job
10 of 18
Example: Repeating a Hadoop job
10 of 18
Example: Repeating a Hadoop job
10 of 18
Example: Repeating a Hadoop job
10 of 18
Example: Repeating a Hadoop job
10 of 18
Experiments
• Spark and Hadoop implementations
• 24 nodes with a total of 384 GB of memory
• Different input sizes:◦ Smaller than physical memory (256 GB)◦ Slightly larger (512 GB)◦ Approximately 3 times (1024 GB)
• 3 replicas per block
11 of 18
Results - Cache Hit Rate of the second job
256GB 512GB 1024GB 256GB 512GB 1024GB 0
20
40
60
80
100
Hadoop Spark C
ache
hit
rate
(%
)
Job size
Baseline Quartet
112GB
240GB
22GB
262GB
1GB
249GB
107GB
251GB
3GB
287GB
0GB
284GB
12 of 18
Results - Runtime reduction of the second job
256GB 512GB 1024GB 256GB 512GB 1024GB 0
20
40
60
80
100
Hadoop Spark N
orm
aliz
ed r
untim
e of
sec
ond
job
(%)
Job size
Baseline Quartet
13 of 18
Conclusions
• Observation: Workloads show significant amount of reuse
• Problem: Jobs are unable to take advantage of this reuse
• Solution:◦ Add visibility on what is cached in each of the cluster nodes◦ Reorder tasks to take advantage of this cached data
• Future work: More realistic workloads and scalability
14 of 18
Conclusion
Thank you!
15 of 18
Conclusion
Questions?
16 of 18
Network Overhead
• Watchers updates are aggregated per HDFS blocks (128-256MB)
• Upper bound is the storage bandwidth
• Manager notifications is proportional to the size of the input and hardware◦ 10-100 KB/s in our experiments
17 of 18
Related Work
HDFS Cache Manager
• Requires manual changes in case of change of popularity
• Can’t be used when input is larger than memory
PACMan
• Avoiding stragglers in single wave of Mappers
• Modify cache eviction policy to ensure that entire computation stages have memorylocality
18 of 18