experience report: a characteristic study on out of memory errors in distributed data-parallel...
TRANSCRIPT
![Page 1: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/1.jpg)
1
Experience Report:A Characteristic Study on Out of Memory Errors
in Distributed Data-Parallel Applications
2015-11-05
Lijie Xu, Wensheng Dou, Feng Zhu, Chushu Gao, Jie Liu, Hua Zhong, and Jun Wei
Institute of SoftwareChinese Academy of Sciences
![Page 2: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/2.jpg)
2
Background
Data-intensive applications are common Web indexing Graph analysis Text processing Machine learning
Run atop distributed data-parallel frameworks MapReduce (e.g., Apache Hadoop) MapReduce-like (e.g., Apache Spark, Microsoft Dryad)
![Page 3: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/3.jpg)
3
Background
Distributed data-parallel applications can be written by Raw code
• Hadoop: Java• Spark: Scala, Java, Python
High-level languages/libraries• SQL-like: Pig Latin, HiveQL, SparkSQL• ML: Apache Mahout, Spark MLlib• Graph: Spark GraphX
![Page 4: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/4.jpg)
4
Background
Runtime of the distributed data-parallel application An application has multiple jobs, and each job has multiple tasks
map(k,v)
reduce(k,list(v))
map(k,v)
map(k,v)
reduce(k,list(v))
Map task (mapper)
Reduce task (reducer)
On-disk
A MapReduce job
Data partition
![Page 5: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/5.jpg)
5
Motivation – Memory problems
High memory consumption is common Data-parallel applications process large data in memory
A MapReduce job
Mapper’s memory usage
Reducer’s memory usage
![Page 6: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/6.jpg)
6
Motivation – Memory problems
Out of memory error is common When memory consumption exceeds the memory limit
Mapper’s memory usage
Reducer’s memory usage
OOM
FATAL org.apache.hadoop.mapred.Child: Error running child : java.lang.OutOfMemoryError: Java heap spaceat java.util.Arrays.copyOf(Arrays.java:2882)...at cloud9.ComputeCooccurrenceMatrixStripes$MyReducer. reduce(ComputeCooccurrenceMatrixStripes.java:136)at org.apache.hadoop.mapred.Child.main(Child.java:404)
![Page 7: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/7.jpg)
7
Motivation – Memory problems
OOM errors directly lead to the job failure Re-executing the failed map/reduce tasks cannot help
Mapper’s memory usage
Reducer’s memory usage
OOM occurs again
![Page 8: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/8.jpg)
8
Motivation – Memory problem examples
Some memory issues in StackOverflow.com and Hadoop/Spark mailing list Why the mapper is consuming so much memory?
I got very surprised to see the job failing in the map phase with “Out of memory error”. What can be the cause of such error?
My question is how to deal with out of memory problem, I added some property configuration into xml file, but it didn’t work. Increasing number of reducers doesn’t work for me either.
![Page 9: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/9.jpg)
9
Overview
Empirical study : Understanding the causes and fixes of OOM errors Research questions Methodology Study results Conclusions
![Page 10: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/10.jpg)
10
Research questions
RQ 1 What are the OOM root causes?
RQ 2 Are there any common fix patterns?
RQ 3 Are there any new fault-tolerant mechanisms?
![Page 11: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/11.jpg)
11
Methodology – Subject collection
Application source Hadoop and Spark applications (open-source and widely used)
Bug source No special bug repository for OOM errors Open forums, such as
Professional discussion Hadoop/Spark committers, experienced developers, and users
![Page 12: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/12.jpg)
12
Methodology – Subject collection
aGoogle keywordsin open forums
1,151 memory issues
Select OOM errors
276 OOM errors
Select OOM errors that have identified
causes
123 OOM errors
Keyword criteria:
Keywords: “Hadoop out of memory”, “Spark OOM”
Selection criteria:
The error occurs in the applications not the service components
Selection criteria:
Experts identified the causesUsers identified the causesWe identified the causes
![Page 13: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/13.jpg)
13
Methodology – Subjects
Distribution of the studied 123 OOM errors Causes are identified by experts (66), users (45), us (12)
Including diverse applications raw code, high-level languages/libraries
Framework Sources Raw code Pig Hive Mahout Cloud9 GraphX MLlib Total Reproduced
Hadoop StackOverflow.com 20 4 2 4 0 0 0 30 16
Hadoop mailing list 5 5 1 0 1 0 0 12 6
Developers’ blogs 2 1 0 0 0 0 0 3 2
MapReduce books 8 3 0 0 0 0 0 11 2
Spark Spark mailing list 16 0 0 0 0 1 2 19 3
StackOverflow.com 42 0 0 0 0 1 5 48 14
Total 93 13 3 4 1 2 7 123 43
![Page 14: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/14.jpg)
14
Methodology – Subjects
Distribution of the studied 123 OOM errors Causes are identified by experts (66), users (45), us (12)
Including diverse applications raw code, high-level languages/libraries
Framework Sources Raw code Pig Hive Mahout Cloud9 GraphX MLlib Total Reproduced
Hadoop StackOverflow.com 20 4 2 4 0 0 0 30 16
Hadoop mailing list 5 5 1 0 1 0 0 12 6
Developers’ blogs 2 1 0 0 0 0 0 3 2
MapReduce books 8 3 0 0 0 0 0 11 2
Spark Spark mailing list 16 0 0 0 0 1 2 19 3
StackOverflow.com 42 0 0 0 0 1 5 48 14
Total 93 13 3 4 1 2 7 123 43
![Page 15: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/15.jpg)
15
Methodology – Subjects
Distribution of the studied 123 OOM errors Causes are identified by experts (66), users (45), us (12)
Including diverse applications raw code, high-level languages/libraries
Framework Sources Raw code Pig Hive Mahout Cloud9 GraphX MLlib Total Reproduced
Hadoop StackOverflow.com 20 4 2 4 0 0 0 30 16
Hadoop mailing list 5 5 1 0 1 0 0 12 6
Developers’ blogs 2 1 0 0 0 0 0 3 2
MapReduce books 8 3 0 0 0 0 0 11 2
Spark Spark mailing list 16 0 0 0 0 1 2 19 3
StackOverflow.com 42 0 0 0 0 1 5 48 14
Total 93 13 3 4 1 2 7 123 43
![Page 16: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/16.jpg)
RQ1: OOM cause patterns – Data storage
Category: Large data stored in the framework ( 12% ) Pattern 1: Large data buffered by the framework (8 errors, 6%)
• Large map buffer (500MB), large shuffle buffer (70%)
Pattern 1: Large data buffered by the framework (8 errors, 6%)
16
map(k,v)
reduce(k,list(v))
map(k,v)
map(k,v)
reduce(k,list(v))
Map task
Reduce task
Pattern 1
![Page 17: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/17.jpg)
RQ1: OOM cause patterns – Data storage
Category: Large data stored in the framework ( 12% ) Pattern 2: Large data cached in the framework (7 errors, 6%)
• Users explicitly cache large data for reuse (e.g., for next job)
Pattern 2: Large data cached in the framework (7 errors, 6%)
17
Pattern 2Cached data
map(k,v)
reduce(k,list(v))
map(k,v)
map(k,v)
reduce(k,list(v))
Map taskReduce task
map(k,v)
map(k,v)
First job Second job
![Page 18: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/18.jpg)
18
RQ1: OOM cause patterns – Dataflow
Category: Abnormal dataflow (large runtime data) 用户配置的 partition number 过小, partition functio
• Small partition number, unbalanced partition function
Pattern 1: Improper data partition (16 errors, 13%)
map(k,v)
reduce(k,list(v))
map(k,v)
map(k,v)
reduce(k,list(v))
Reduce taskMap task
Pattern 1
![Page 19: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/19.jpg)
19
RQ1: OOM cause patterns – Dataflow
Category: Abnormal dataflow (large runtime data) Pattern 2: Hotspot key (23 errors, 18%)
• Some <k, list(v)> groups are extremely large• Case: Frequent words (e.g., “the”) occur in much more pages
Pattern 2: Hotspot key (23 errors, 18%)
3 2
4 1
5 6
3 7
2 1
3 6
1 3
map(k,v)
map(k,v)
3 2
5 6
3 7
3 6
1 3
1 3
3 2
3 7
3 6
3 9
3 1
4 1
2 1
reduce(k, list(v))
reduce(k, list(v))
1 3
3 2
<k3, list(v)>
<k1, list(v)>
![Page 20: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/20.jpg)
20
k3 v2
RQ1: OOM cause patterns – Dataflow
Category: Abnormal dataflow (large runtime data) Pattern 3: Large single key/value record (7 errors, 6%)
• Symptom: Some <k, v> records are extremely large• Case: a 350MB record (a single line full of character a)
Pattern 3: Large single key/value record (7 errors, 6%)
k1 v1
k2 v2
2 1
3 6
1 3
map(k,v)
map(k,v)
k3 v3
k5 v5
k3 v3
k1 v1
k3 v1
k4 v4
reduce(k, list(v))
reduce(k, list(v))
k1 v1
k3 v1
k3 v2<k3, v1> is very large
![Page 21: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/21.jpg)
21
RQ1: OOM cause patterns – User code
Category: Memory-consuming user code Pattern 1: Improper data partition (16 errors, 13%)
• Loading large data before processing the recordsPattern 1: Large external data loaded in user code (8 errors, 6%)
public class Mapper { private Object buffer;
public void setup() { buffer.load(dictionary1GB); } public void map(K key, V value) { Object iResults1GB = process(key, value); buffer.add(iResults1GB); //Optional emit(newKey, newValue); } }
![Page 22: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/22.jpg)
22
RQ1: OOM cause patterns – User code
Category: Memory-consuming user code Pattern 2: Hotspot key (23 errors, 18%)
• During processing a single <k, v> record• Cases: large input record, Cartesian product
Pattern 2: Large intermediate results (6 errors, 5%)
Pattern 2
public class Mapper { private Object buffer;
public void setup() { buffer.load(dictionary1GB); } public void map(K key, V value) { Object iResults1GB = process(key, value); buffer.add(iResults1GB); //Optional emit(newKey, newValue); } }
![Page 23: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/23.jpg)
23
RQ1: OOM cause patterns – User code
Category: Memory-consuming user code Pattern 3: Large single key/value record (7 errors, 6%)
• Accumulating large intermediate computing results • Related to: large input split, hotspot key, large data partition
Pattern 3: Large accumulated results (40 errors, 33%)
Pattern 3
public class Mapper { private Object buffer;
public void setup() { buffer.load(dictionary1GB); } public void map(K key, V value) { Object iResults1MB = process(key, value); buffer.add(iResults1MB); //Optional emit(newKey, newValue); } }
![Page 24: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/24.jpg)
24
RQ2: Fix patterns – Data storage related fixes
Category: Data storage related fix patterns Lower framework buffer size
• e.g., change io.sort.mb from 300MB to 100MBPattern 1: Lower framework buffer size
map(k,v)
reduce(k,list(v))
map(k,v)
map(k,v)
reduce(k,list(v))
![Page 25: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/25.jpg)
25
RQ2: Fix patterns – Data storage related fixes
Category: Data storage related fix patterns Lower framework buffer size
• e.g., change io.sort.mb from 300MB to 100MBPattern 1: Lower framework buffer size
map(k,v)
reduce(k,list(v))
map(k,v)
map(k,v)
reduce(k,list(v))
![Page 26: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/26.jpg)
26
RQ2: Fix patterns – Data storage related fixes
Category: Data storage related fix patterns Lower framework buffer size
• e.g., change spark.storage.memoryFraction from 0.66 to 0.1• e.g., change MEMORY_ONlY cache to DISK_ONLY persistence
Pattern 2: Lower the cache threshold
map(k,v)
reduce(k,list(v))
map(k,v)
map(k,v)
reduce(k,list(v))
Map taskReduce task
map(k,v)
map(k,v)
![Page 27: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/27.jpg)
27
RQ2: Fix patterns – Data storage related fixes
Category: Data storage related fix patterns Lower framework buffer size
• e.g., change spark.storage.memoryFraction from 0.66 to 0.1• e.g., change MEMORY_ONlY cache to DISK_ONLY persistence
Pattern 2: Lower the cache threshold
map(k,v)
reduce(k,list(v))
map(k,v)
map(k,v)
reduce(k,list(v))
Map taskReduce task
map(k,v)
map(k,v)
![Page 28: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/28.jpg)
28
RQ2: Fix patterns – Dataflow related fixes
Category: Dataflow-related fix patterns Lower framework buffer size
• e.g., enlarge partition number, use range partition function
Pattern 1: Change partition number/function
map(k,v)
map(k,v)
map(k,v)
reduce(k,list(v))
Reduce taskMap task
reduce(k,list(v))
![Page 29: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/29.jpg)
29
RQ2: Fix patterns – Dataflow related fixes
Category: Dataflow-related fix patterns Lower framework buffer size
• e.g., enlarge partition number, use range partition function
Pattern 1: Change partition number/function
map(k,v)
map(k,v)
map(k,v)
Reduce taskMap task
reduce(k,list(v))
reduce(k,list(v))
reduce(k,list(v))
![Page 30: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/30.jpg)
31
RQ2: Fix patterns – Dataflow related fixes
Category: Dataflow-related fix patterns Lower framework buffer size
• e.g., using composite key: <k, v> => <(k, k’), v>Pattern 2: Redesign the key
3 2
4 1
5 6
3 7
2 1
3 6
1 3
map(k,v)
map(k,v)
3 2
5 6
3 7
3 6
1 3
1 3
3 2
3 7
3 6
3 9
3 1
4 1
2 1
reduce(k, list(v))
reduce(k, list(v))
1 3
3 2
(K, V)
![Page 31: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/31.jpg)
32
RQ2: Fix patterns – Dataflow related fixes
Category: Dataflow-related fix patterns Lower framework buffer size
• e.g., using composite key: <k, v> => <(k, k’), v>Pattern 2: Redesign the key
3 2
4 1
5 6
3 7
2 1
3 6
1 3
map(k,v)
map(k,v)
3 2
5 6
3 7
3 6
1 3
1,1 3
3,1 2
3,1 7
3,1 6
3,2 9
3,2 1
4 1
2 1
reduce(k, list(v))
reduce(k, list(v))
1 3
3 2
(K,1), V(K,2), V
![Page 32: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/32.jpg)
33
RQ2: Fix patterns – Dataflow related fixes
Category: Dataflow-related fix patterns Lower framework buffer size
• e.g., a super giant string => split it to multiple stringsPattern 3: Split the large single record
k5 v5
k1 v1
k2 v2
2 1
3 6
1 3
map(k,v)
map(k,v)
k3 v3
k5 v5
k3 v3
k1 v1
k3 v3
k4 v4
reduce(k, list(v))
reduce(k, list(v))
k1 v1
k3 v3
k5 v5
![Page 33: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/33.jpg)
34
RQ2: Fix patterns – Dataflow related fixes
Category: Dataflow-related fix patterns Lower framework buffer size
• e.g., a super giant string => split it to multiple stringsPattern 3: Split the large single record
k5 v5
k1 v1
k2 v2
k2 v2
k2 v2
2 1
3 6
1 3
map(k,v)
map(k,v)
k3 v3
k5 v5
k3 v3
k1 v1
k3 v3
k3 v3
k3 v3k4 v4
reduce(k, list(v))
reduce(k, list(v))
k1 v1
k3 v3
k5 v5
![Page 34: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/34.jpg)
35
RQ2: Fix patterns – User code related fixes
Category: User code related fix patterns Lower framework buffer sizePattern 1: Change accumulation to streaming operators
A B
1 3
2 1
3 2
3 6
3 7
5 6
6 1
A B
3 2
6 1
5 6
3 7
2 1
3 6
1 3
groupBy(A), then Sort(B)
![Page 35: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/35.jpg)
36
RQ2: Fix patterns – User code related fixes
Category: User code related fix patterns Lower framework buffer sizePattern 1: Change accumulation to streaming operators
3 2
4 1
5 6
3 7
2 1
3 6
1 3
map(k,v)
map(k,v)
3 2
5 6
3 7
3 6
1 3
1 3
3 2
3 7
3 6
2 1
4 1
4 1
2 1
reduce(k, list(v))
ArrayList.add(2, 7, 6, …)ArrayList.sort()
reduce(k, list(v))
2 1
4 1
1 3
3 2
3 6
3 7
A B OOM
![Page 36: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/36.jpg)
37
RQ2: Fix patterns – User code related fixes
Category: User code related fix patterns Lower framework buffer sizePattern 1: Change accumulation to streaming operators
3 2
4 1
5 6
3 7
2 1
3 6
1 3
map(k,v)
map(k,v)
3,2 2
5,6 6
3,7 7
3,6 6
1,3 3
1,3 3
3,2 2
3,6 6
3,7 7
2,1 1
4,1 1
4,1 1
2,1 1
2 1
4 1
1 3
3 2
3 6
3 7
A B
reduce(k, list(v))
read( (k1, v1), v )output (k2, v)
reduce(k, list(v))
streaming operator
![Page 37: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/37.jpg)
38
RQ2: Fix patterns – User code related fixes
Category: User code related fix patterns Lower framework buffer size
• e.g., add combine() to perform partial aggregationPattern 2: Do the accumulation in several passes
map(k,v)
reduce(k,list(v))
map(k,v)
map(k,v)
reduce(k,list(v))
Reduce task
Map task
combine(k,v)
combine(k,v)
combine(k,v)
combine(k,list(v))
combine(k,list(v))
![Page 38: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/38.jpg)
39
RQ2: Fix patterns – User code related fixes
Category: User code related fix patterns Lower framework buffer size
• e.g., add combine() to perform partial aggregationPattern 2: Do the accumulation in several passes
map(k,v)
reduce(k,list(v))
map(k,v)
map(k,v)
reduce(k,list(v))
Reduce task
Map task
combine(k,v)
combine(k,v)
combine(k,v)
combine(k,list(v))
combine(k,list(v))
![Page 39: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/39.jpg)
40
RQ2: Fix patterns – User code related fixes
Category: User code related fix patterns Lower framework buffer size
• e.g., map() emits partial accumulated results every 1000 recordsPattern 3: Spill the accumulated results into disk + On-disk merge
<K1, V1><K2, V2>
<K1002, V1002>
map()
<K1’, V1’><K2’, V2’>
……
<K1001, V1001>
<K1000, V1000> accumulated results
<K1000’, V1000’>
Disk
![Page 40: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/40.jpg)
41
RQ2: Fix patterns – User code related fixes
Category: User code related fix patterns Lower framework buffer size
• e.g., skip the extremely large single <k, v> record• Useful while invoking a third-party library without code• Useful while we do not need precise results
Pattern 4: Skip the abnormal data
<K1, V1><K2, V2>
<K3, V3>
<K4, V4>
Map()
<K1’, V1’><K2’, V2’>
<K4’, V4’>
![Page 41: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/41.jpg)
42
RQ3: Potential fault-tolerant mechanisms
Enable dynamic memory management Automatically balance the runtime memory usage of the
framework and user code
or
Buffered/cached data
processed records
Pause the task Resume the task
Spill into disk Read backWhen memory usage is lower
Running task
accumulated results
![Page 42: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/42.jpg)
43
RQ3: Potential fault-tolerant mechanisms
Provide memory+disk data structures 18 OOM errors occur in ArrayList, Map, Set, Queue, etc. New data structures
• For aggregating <k, v> records • For storing accumulated results• Provide common APIs as C++ STL and Java Collections• Automatically swap between memory and disk
Disk
Mem Element1
Element2
Element3
K1 V1
K2 V2
K3 V3
NewMapNewArrayList
![Page 43: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/43.jpg)
44
Related work
Failure study on big data applications Li et al. [ICSE ’13]
• studied 250 failures in SCOPE jobs • did not target OOM errors
Kavulya et al. [CCGrid ’10] • analyzed the performance problems and failures in Hadoop jobs • studied Array indexing errors and IOException errors
![Page 44: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/44.jpg)
45
Conclusions
Studied 123 OOM errors in Hadoop/Spark applications Summarized the common OOM root causes
large buffered/cached data abnormal dataflow memory-consuming user code
Summarized the common fix patterns Proposed two fault-tolerant mechanisms
![Page 45: Experience Report: A Characteristic Study on Out of Memory Errors in Distributed Data-Parallel Applications 2015-11-05 Lijie Xu, Wensheng Dou, Feng Zhu,](https://reader035.vdocument.in/reader035/viewer/2022062805/5697c02a1a28abf838cd82ef/html5/thumbnails/45.jpg)
46
Thanks! Q&A