![Page 1: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/1.jpg)
7: Shortcomings in the MapReduce Paradigm
Zubair Nabi
April 19, 2013
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 1 / 31
![Page 2: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/2.jpg)
Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 2 / 31
![Page 3: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/3.jpg)
Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 3 / 31
![Page 4: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/4.jpg)
Users1
Adobe: Several areas from social services to unstructured data storageand processing
eBay: 532 nodes cluster storing 5.3PB of data
Facebook: Used for reporting/analytics; one cluster with 1100 nodes(12PB) and another with 300 nodes (3PB)
LinkedIn: 3 clusters with collectively 4000 nodes
Twitter: To store and process Tweets and log files
Yahoo!: Multiple clusters with collectively 40000 nodes; largest clusterhas 4500 nodes!
1http://wiki.apache.org/hadoop/PoweredByZubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
![Page 5: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/5.jpg)
Users1
Adobe: Several areas from social services to unstructured data storageand processing
eBay: 532 nodes cluster storing 5.3PB of data
Facebook: Used for reporting/analytics; one cluster with 1100 nodes(12PB) and another with 300 nodes (3PB)
LinkedIn: 3 clusters with collectively 4000 nodes
Twitter: To store and process Tweets and log files
Yahoo!: Multiple clusters with collectively 40000 nodes; largest clusterhas 4500 nodes!
1http://wiki.apache.org/hadoop/PoweredByZubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
![Page 6: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/6.jpg)
Users1
Adobe: Several areas from social services to unstructured data storageand processing
eBay: 532 nodes cluster storing 5.3PB of data
Facebook: Used for reporting/analytics; one cluster with 1100 nodes(12PB) and another with 300 nodes (3PB)
LinkedIn: 3 clusters with collectively 4000 nodes
Twitter: To store and process Tweets and log files
Yahoo!: Multiple clusters with collectively 40000 nodes; largest clusterhas 4500 nodes!
1http://wiki.apache.org/hadoop/PoweredByZubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
![Page 7: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/7.jpg)
Users1
Adobe: Several areas from social services to unstructured data storageand processing
eBay: 532 nodes cluster storing 5.3PB of data
Facebook: Used for reporting/analytics; one cluster with 1100 nodes(12PB) and another with 300 nodes (3PB)
LinkedIn: 3 clusters with collectively 4000 nodes
Twitter: To store and process Tweets and log files
Yahoo!: Multiple clusters with collectively 40000 nodes; largest clusterhas 4500 nodes!
1http://wiki.apache.org/hadoop/PoweredByZubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
![Page 8: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/8.jpg)
Users1
Adobe: Several areas from social services to unstructured data storageand processing
eBay: 532 nodes cluster storing 5.3PB of data
Facebook: Used for reporting/analytics; one cluster with 1100 nodes(12PB) and another with 300 nodes (3PB)
LinkedIn: 3 clusters with collectively 4000 nodes
Twitter: To store and process Tweets and log files
Yahoo!: Multiple clusters with collectively 40000 nodes; largest clusterhas 4500 nodes!
1http://wiki.apache.org/hadoop/PoweredByZubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
![Page 9: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/9.jpg)
Users1
Adobe: Several areas from social services to unstructured data storageand processing
eBay: 532 nodes cluster storing 5.3PB of data
Facebook: Used for reporting/analytics; one cluster with 1100 nodes(12PB) and another with 300 nodes (3PB)
LinkedIn: 3 clusters with collectively 4000 nodes
Twitter: To store and process Tweets and log files
Yahoo!: Multiple clusters with collectively 40000 nodes; largest clusterhas 4500 nodes!
1http://wiki.apache.org/hadoop/PoweredByZubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 4 / 31
![Page 10: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/10.jpg)
But all is not well
Over the years, Hadoop has become a one-size-fits-all solution to dataintensive computing
As early as 2008, David DeWitt and Michael Stonebraker asserted thatMapReduce was a “major step backwards” for data intensivecomputingThey opined:
I MapReduce is a major step backwards in database access because itnegates schema and is too low-level
I It has a sub-optimal implementation as it, makes use of brute forceinstead of indexing, does not handle skew, and uses data pull instead ofpush
I It is just rehashing old database conceptsI It is missing most DBMS functionalities, such as updates, transactions,
etc.I It is incompatible with DBMS tools, such as human visualization, data
replication from one DBMS to another, etc.
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
![Page 11: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/11.jpg)
But all is not well
Over the years, Hadoop has become a one-size-fits-all solution to dataintensive computing
As early as 2008, David DeWitt and Michael Stonebraker asserted thatMapReduce was a “major step backwards” for data intensivecomputing
They opined:I MapReduce is a major step backwards in database access because it
negates schema and is too low-levelI It has a sub-optimal implementation as it, makes use of brute force
instead of indexing, does not handle skew, and uses data pull instead ofpush
I It is just rehashing old database conceptsI It is missing most DBMS functionalities, such as updates, transactions,
etc.I It is incompatible with DBMS tools, such as human visualization, data
replication from one DBMS to another, etc.
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
![Page 12: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/12.jpg)
But all is not well
Over the years, Hadoop has become a one-size-fits-all solution to dataintensive computing
As early as 2008, David DeWitt and Michael Stonebraker asserted thatMapReduce was a “major step backwards” for data intensivecomputingThey opined:
I MapReduce is a major step backwards in database access because itnegates schema and is too low-level
I It has a sub-optimal implementation as it, makes use of brute forceinstead of indexing, does not handle skew, and uses data pull instead ofpush
I It is just rehashing old database conceptsI It is missing most DBMS functionalities, such as updates, transactions,
etc.I It is incompatible with DBMS tools, such as human visualization, data
replication from one DBMS to another, etc.
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
![Page 13: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/13.jpg)
But all is not well
Over the years, Hadoop has become a one-size-fits-all solution to dataintensive computing
As early as 2008, David DeWitt and Michael Stonebraker asserted thatMapReduce was a “major step backwards” for data intensivecomputingThey opined:
I MapReduce is a major step backwards in database access because itnegates schema and is too low-level
I It has a sub-optimal implementation as it, makes use of brute forceinstead of indexing, does not handle skew, and uses data pull instead ofpush
I It is just rehashing old database conceptsI It is missing most DBMS functionalities, such as updates, transactions,
etc.I It is incompatible with DBMS tools, such as human visualization, data
replication from one DBMS to another, etc.
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
![Page 14: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/14.jpg)
But all is not well
Over the years, Hadoop has become a one-size-fits-all solution to dataintensive computing
As early as 2008, David DeWitt and Michael Stonebraker asserted thatMapReduce was a “major step backwards” for data intensivecomputingThey opined:
I MapReduce is a major step backwards in database access because itnegates schema and is too low-level
I It has a sub-optimal implementation as it, makes use of brute forceinstead of indexing, does not handle skew, and uses data pull instead ofpush
I It is just rehashing old database concepts
I It is missing most DBMS functionalities, such as updates, transactions,etc.
I It is incompatible with DBMS tools, such as human visualization, datareplication from one DBMS to another, etc.
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
![Page 15: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/15.jpg)
But all is not well
Over the years, Hadoop has become a one-size-fits-all solution to dataintensive computing
As early as 2008, David DeWitt and Michael Stonebraker asserted thatMapReduce was a “major step backwards” for data intensivecomputingThey opined:
I MapReduce is a major step backwards in database access because itnegates schema and is too low-level
I It has a sub-optimal implementation as it, makes use of brute forceinstead of indexing, does not handle skew, and uses data pull instead ofpush
I It is just rehashing old database conceptsI It is missing most DBMS functionalities, such as updates, transactions,
etc.
I It is incompatible with DBMS tools, such as human visualization, datareplication from one DBMS to another, etc.
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
![Page 16: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/16.jpg)
But all is not well
Over the years, Hadoop has become a one-size-fits-all solution to dataintensive computing
As early as 2008, David DeWitt and Michael Stonebraker asserted thatMapReduce was a “major step backwards” for data intensivecomputingThey opined:
I MapReduce is a major step backwards in database access because itnegates schema and is too low-level
I It has a sub-optimal implementation as it, makes use of brute forceinstead of indexing, does not handle skew, and uses data pull instead ofpush
I It is just rehashing old database conceptsI It is missing most DBMS functionalities, such as updates, transactions,
etc.I It is incompatible with DBMS tools, such as human visualization, data
replication from one DBMS to another, etc.
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 5 / 31
![Page 17: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/17.jpg)
Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 6 / 31
![Page 18: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/18.jpg)
Introduction
Due to the uneven distribution of intermediate key/value pairs somereduce workers end up doing more work
Such reducers become “stragglers”
A large number of real-world applications follow long-tailed distributions(Zipf-like)
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 7 / 31
![Page 19: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/19.jpg)
Introduction
Due to the uneven distribution of intermediate key/value pairs somereduce workers end up doing more work
Such reducers become “stragglers”
A large number of real-world applications follow long-tailed distributions(Zipf-like)
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 7 / 31
![Page 20: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/20.jpg)
Introduction
Due to the uneven distribution of intermediate key/value pairs somereduce workers end up doing more work
Such reducers become “stragglers”
A large number of real-world applications follow long-tailed distributions(Zipf-like)
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 7 / 31
![Page 21: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/21.jpg)
Wordcount and skew
Text corpora have a Zipfian skew, i.e. a very small number of wordsaccount for most occurrences
For instance, of 242,758 words in the dataset used to generate thefigure, the 10, 100, and 1000 most frequent words account for 22%,43%, and 64% of the entire set
Such skewed intermediate results lead to uneven distribution ofworkload across reduce workers
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 8 / 31
![Page 22: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/22.jpg)
Wordcount and skew
Text corpora have a Zipfian skew, i.e. a very small number of wordsaccount for most occurrences
For instance, of 242,758 words in the dataset used to generate thefigure, the 10, 100, and 1000 most frequent words account for 22%,43%, and 64% of the entire set
Such skewed intermediate results lead to uneven distribution ofworkload across reduce workers
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 8 / 31
![Page 23: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/23.jpg)
Wordcount and skew
Text corpora have a Zipfian skew, i.e. a very small number of wordsaccount for most occurrences
For instance, of 242,758 words in the dataset used to generate thefigure, the 10, 100, and 1000 most frequent words account for 22%,43%, and 64% of the entire set
Such skewed intermediate results lead to uneven distribution ofworkload across reduce workers
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 8 / 31
![Page 24: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/24.jpg)
Page rank and skew
Even Google’s implementation of its core PageRank algorithm isplagued by the skew problem
Google uses PageRank to calculate a webpage’s relevance for a givensearch query
I Map: Emit the outlinks for each pageI Reduce: Calculate rank per page
The skew in intermediate data exists due to the huge disparity in thenumber of incoming links across pages on the Internet
The scale of the problem is evident when we consider the fact thatGoogle currently indexes more than 25 billion webpages with skewedlinks
For instance, Facebook has 49,376,609 incoming links (at the time ofwriting) while the personal webpage of the presenter only has 4 (=))
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
![Page 25: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/25.jpg)
Page rank and skew
Even Google’s implementation of its core PageRank algorithm isplagued by the skew problemGoogle uses PageRank to calculate a webpage’s relevance for a givensearch query
I Map: Emit the outlinks for each pageI Reduce: Calculate rank per page
The skew in intermediate data exists due to the huge disparity in thenumber of incoming links across pages on the Internet
The scale of the problem is evident when we consider the fact thatGoogle currently indexes more than 25 billion webpages with skewedlinks
For instance, Facebook has 49,376,609 incoming links (at the time ofwriting) while the personal webpage of the presenter only has 4 (=))
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
![Page 26: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/26.jpg)
Page rank and skew
Even Google’s implementation of its core PageRank algorithm isplagued by the skew problemGoogle uses PageRank to calculate a webpage’s relevance for a givensearch query
I Map: Emit the outlinks for each page
I Reduce: Calculate rank per page
The skew in intermediate data exists due to the huge disparity in thenumber of incoming links across pages on the Internet
The scale of the problem is evident when we consider the fact thatGoogle currently indexes more than 25 billion webpages with skewedlinks
For instance, Facebook has 49,376,609 incoming links (at the time ofwriting) while the personal webpage of the presenter only has 4 (=))
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
![Page 27: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/27.jpg)
Page rank and skew
Even Google’s implementation of its core PageRank algorithm isplagued by the skew problemGoogle uses PageRank to calculate a webpage’s relevance for a givensearch query
I Map: Emit the outlinks for each pageI Reduce: Calculate rank per page
The skew in intermediate data exists due to the huge disparity in thenumber of incoming links across pages on the Internet
The scale of the problem is evident when we consider the fact thatGoogle currently indexes more than 25 billion webpages with skewedlinks
For instance, Facebook has 49,376,609 incoming links (at the time ofwriting) while the personal webpage of the presenter only has 4 (=))
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
![Page 28: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/28.jpg)
Page rank and skew
Even Google’s implementation of its core PageRank algorithm isplagued by the skew problemGoogle uses PageRank to calculate a webpage’s relevance for a givensearch query
I Map: Emit the outlinks for each pageI Reduce: Calculate rank per page
The skew in intermediate data exists due to the huge disparity in thenumber of incoming links across pages on the Internet
The scale of the problem is evident when we consider the fact thatGoogle currently indexes more than 25 billion webpages with skewedlinks
For instance, Facebook has 49,376,609 incoming links (at the time ofwriting) while the personal webpage of the presenter only has 4 (=))
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
![Page 29: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/29.jpg)
Page rank and skew
Even Google’s implementation of its core PageRank algorithm isplagued by the skew problemGoogle uses PageRank to calculate a webpage’s relevance for a givensearch query
I Map: Emit the outlinks for each pageI Reduce: Calculate rank per page
The skew in intermediate data exists due to the huge disparity in thenumber of incoming links across pages on the Internet
The scale of the problem is evident when we consider the fact thatGoogle currently indexes more than 25 billion webpages with skewedlinks
For instance, Facebook has 49,376,609 incoming links (at the time ofwriting) while the personal webpage of the presenter only has 4 (=))
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
![Page 30: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/30.jpg)
Page rank and skew
Even Google’s implementation of its core PageRank algorithm isplagued by the skew problemGoogle uses PageRank to calculate a webpage’s relevance for a givensearch query
I Map: Emit the outlinks for each pageI Reduce: Calculate rank per page
The skew in intermediate data exists due to the huge disparity in thenumber of incoming links across pages on the Internet
The scale of the problem is evident when we consider the fact thatGoogle currently indexes more than 25 billion webpages with skewedlinks
For instance, Facebook has 49,376,609 incoming links (at the time ofwriting) while the personal webpage of the presenter only has 4 (=))
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 9 / 31
![Page 31: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/31.jpg)
Zipf distributions are everywhere
Followed by Inverted Indexing, Publish/Subscribe systems, frauddetection, and various clustering algorithms
P2P systems have Zipf distributions too both in terms of users andcontent
Web caching schemes as well as email and social networks
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 10 / 31
![Page 32: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/32.jpg)
Zipf distributions are everywhere
Followed by Inverted Indexing, Publish/Subscribe systems, frauddetection, and various clustering algorithms
P2P systems have Zipf distributions too both in terms of users andcontent
Web caching schemes as well as email and social networks
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 10 / 31
![Page 33: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/33.jpg)
Zipf distributions are everywhere
Followed by Inverted Indexing, Publish/Subscribe systems, frauddetection, and various clustering algorithms
P2P systems have Zipf distributions too both in terms of users andcontent
Web caching schemes as well as email and social networks
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 10 / 31
![Page 34: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/34.jpg)
Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 11 / 31
![Page 35: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/35.jpg)
Introduction
In the MapReduce model, tasks which take exceptionally long arelabelled “stragglers”
The framework launches a speculative copy of each straggler onanother machine expecting it to finish quickly
Without this, the overall job completion time is dictated by the sloweststraggler
On Google clusters, speculative execution can reduce job completionby 44%
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 12 / 31
![Page 36: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/36.jpg)
Introduction
In the MapReduce model, tasks which take exceptionally long arelabelled “stragglers”
The framework launches a speculative copy of each straggler onanother machine expecting it to finish quickly
Without this, the overall job completion time is dictated by the sloweststraggler
On Google clusters, speculative execution can reduce job completionby 44%
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 12 / 31
![Page 37: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/37.jpg)
Introduction
In the MapReduce model, tasks which take exceptionally long arelabelled “stragglers”
The framework launches a speculative copy of each straggler onanother machine expecting it to finish quickly
Without this, the overall job completion time is dictated by the sloweststraggler
On Google clusters, speculative execution can reduce job completionby 44%
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 12 / 31
![Page 38: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/38.jpg)
Introduction
In the MapReduce model, tasks which take exceptionally long arelabelled “stragglers”
The framework launches a speculative copy of each straggler onanother machine expecting it to finish quickly
Without this, the overall job completion time is dictated by the sloweststraggler
On Google clusters, speculative execution can reduce job completionby 44%
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 12 / 31
![Page 39: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/39.jpg)
Hadoop’s assumptions regarding speculation
1 All nodes are equal, i.e. they can perform work at more or less thesame rate
2 Tasks make progress at a constant rate throughout their lifetime
3 There is no cost of launching a speculative cost on an otherwise idleslot/node
4 The progress score of a task captures the fraction of its total work thatit has done. Specifically, the shuffle, merge, and reduce logic phaseseach take roughly 1/3 of the total time
5 As tasks finish in waves, a task with a low progress score is most likelya straggler
6 Tasks within the same phase, require roughly the same amount of work
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
![Page 40: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/40.jpg)
Hadoop’s assumptions regarding speculation
1 All nodes are equal, i.e. they can perform work at more or less thesame rate
2 Tasks make progress at a constant rate throughout their lifetime
3 There is no cost of launching a speculative cost on an otherwise idleslot/node
4 The progress score of a task captures the fraction of its total work thatit has done. Specifically, the shuffle, merge, and reduce logic phaseseach take roughly 1/3 of the total time
5 As tasks finish in waves, a task with a low progress score is most likelya straggler
6 Tasks within the same phase, require roughly the same amount of work
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
![Page 41: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/41.jpg)
Hadoop’s assumptions regarding speculation
1 All nodes are equal, i.e. they can perform work at more or less thesame rate
2 Tasks make progress at a constant rate throughout their lifetime
3 There is no cost of launching a speculative cost on an otherwise idleslot/node
4 The progress score of a task captures the fraction of its total work thatit has done. Specifically, the shuffle, merge, and reduce logic phaseseach take roughly 1/3 of the total time
5 As tasks finish in waves, a task with a low progress score is most likelya straggler
6 Tasks within the same phase, require roughly the same amount of work
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
![Page 42: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/42.jpg)
Hadoop’s assumptions regarding speculation
1 All nodes are equal, i.e. they can perform work at more or less thesame rate
2 Tasks make progress at a constant rate throughout their lifetime
3 There is no cost of launching a speculative cost on an otherwise idleslot/node
4 The progress score of a task captures the fraction of its total work thatit has done. Specifically, the shuffle, merge, and reduce logic phaseseach take roughly 1/3 of the total time
5 As tasks finish in waves, a task with a low progress score is most likelya straggler
6 Tasks within the same phase, require roughly the same amount of work
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
![Page 43: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/43.jpg)
Hadoop’s assumptions regarding speculation
1 All nodes are equal, i.e. they can perform work at more or less thesame rate
2 Tasks make progress at a constant rate throughout their lifetime
3 There is no cost of launching a speculative cost on an otherwise idleslot/node
4 The progress score of a task captures the fraction of its total work thatit has done. Specifically, the shuffle, merge, and reduce logic phaseseach take roughly 1/3 of the total time
5 As tasks finish in waves, a task with a low progress score is most likelya straggler
6 Tasks within the same phase, require roughly the same amount of work
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
![Page 44: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/44.jpg)
Hadoop’s assumptions regarding speculation
1 All nodes are equal, i.e. they can perform work at more or less thesame rate
2 Tasks make progress at a constant rate throughout their lifetime
3 There is no cost of launching a speculative cost on an otherwise idleslot/node
4 The progress score of a task captures the fraction of its total work thatit has done. Specifically, the shuffle, merge, and reduce logic phaseseach take roughly 1/3 of the total time
5 As tasks finish in waves, a task with a low progress score is most likelya straggler
6 Tasks within the same phase, require roughly the same amount of work
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 13 / 31
![Page 45: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/45.jpg)
Assumptions 1 and 2
1 All nodes are equal, i.e. they can perform work at more or less thesame rate
2 Tasks make progress at a constant rate throughout their lifetime
Both breakdown in heterogeneous environments which consist ofmultiple generations of hardware
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 14 / 31
![Page 46: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/46.jpg)
Assumptions 1 and 2
1 All nodes are equal, i.e. they can perform work at more or less thesame rate
2 Tasks make progress at a constant rate throughout their lifetime
Both breakdown in heterogeneous environments which consist ofmultiple generations of hardware
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 14 / 31
![Page 47: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/47.jpg)
Assumptions 1 and 2
1 All nodes are equal, i.e. they can perform work at more or less thesame rate
2 Tasks make progress at a constant rate throughout their lifetime
Both breakdown in heterogeneous environments which consist ofmultiple generations of hardware
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 14 / 31
![Page 48: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/48.jpg)
Assumption 3
3 There is no cost of launching a speculative cost on an otherwise idleslot/node
Breaks down due to shared resources
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 15 / 31
![Page 49: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/49.jpg)
Assumption 3
3 There is no cost of launching a speculative cost on an otherwise idleslot/node
Breaks down due to shared resources
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 15 / 31
![Page 50: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/50.jpg)
Assumption 4
4 The progress score of a task captures the fraction of its total work thatit has done. Specifically, the shuffle, merge, and reduce logic phaseseach take roughly 1/3 of the total time
Breaks down due the fact that in reduce tasks the shuffle phase takesthe longest time as opposed to the other 2
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 16 / 31
![Page 51: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/51.jpg)
Assumption 4
4 The progress score of a task captures the fraction of its total work thatit has done. Specifically, the shuffle, merge, and reduce logic phaseseach take roughly 1/3 of the total time
Breaks down due the fact that in reduce tasks the shuffle phase takesthe longest time as opposed to the other 2
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 16 / 31
![Page 52: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/52.jpg)
Assumption 5
5 As tasks finish in waves, a task with a low progress score is most likelya straggler
Breaks down due to the fact that task completion is spread across timedue to uneven workload
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 17 / 31
![Page 53: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/53.jpg)
Assumption 5
5 As tasks finish in waves, a task with a low progress score is most likelya straggler
Breaks down due to the fact that task completion is spread across timedue to uneven workload
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 17 / 31
![Page 54: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/54.jpg)
Assumption 6
6 Tasks within the same phase, require roughly the same amount of work
Breaks down due to data skew
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 18 / 31
![Page 55: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/55.jpg)
Assumption 6
6 Tasks within the same phase, require roughly the same amount of work
Breaks down due to data skew
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 18 / 31
![Page 56: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/56.jpg)
Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 19 / 31
![Page 57: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/57.jpg)
Introduction
The one-input, two-stage data flow is extremely rigid for ad-hocanalysis of large datasets
Hacks need to be put into place for different data flow, such as joins ormultiple stages
Custom code has to be written for common DB operations, such asprojection and filtering
The opaque nature of map and reduce functions makes it impossible toperform optimizations, such as operator reordering
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 20 / 31
![Page 58: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/58.jpg)
Introduction
The one-input, two-stage data flow is extremely rigid for ad-hocanalysis of large datasets
Hacks need to be put into place for different data flow, such as joins ormultiple stages
Custom code has to be written for common DB operations, such asprojection and filtering
The opaque nature of map and reduce functions makes it impossible toperform optimizations, such as operator reordering
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 20 / 31
![Page 59: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/59.jpg)
Introduction
The one-input, two-stage data flow is extremely rigid for ad-hocanalysis of large datasets
Hacks need to be put into place for different data flow, such as joins ormultiple stages
Custom code has to be written for common DB operations, such asprojection and filtering
The opaque nature of map and reduce functions makes it impossible toperform optimizations, such as operator reordering
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 20 / 31
![Page 60: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/60.jpg)
Introduction
The one-input, two-stage data flow is extremely rigid for ad-hocanalysis of large datasets
Hacks need to be put into place for different data flow, such as joins ormultiple stages
Custom code has to be written for common DB operations, such asprojection and filtering
The opaque nature of map and reduce functions makes it impossible toperform optimizations, such as operator reordering
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 20 / 31
![Page 61: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/61.jpg)
Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 21 / 31
![Page 62: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/62.jpg)
Introduction
In case of MapReduce, the entire output of a map or a reduce taskneeds to be materialized to local storage before the next stage cancommence
Simplifies fault-tolerance
Reducers have to pull their input instead of the mappers pushing it
Negates pipelining, result estimation, and continuous queries (streamprocessing)
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 22 / 31
![Page 63: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/63.jpg)
Introduction
In case of MapReduce, the entire output of a map or a reduce taskneeds to be materialized to local storage before the next stage cancommence
Simplifies fault-tolerance
Reducers have to pull their input instead of the mappers pushing it
Negates pipelining, result estimation, and continuous queries (streamprocessing)
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 22 / 31
![Page 64: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/64.jpg)
Introduction
In case of MapReduce, the entire output of a map or a reduce taskneeds to be materialized to local storage before the next stage cancommence
Simplifies fault-tolerance
Reducers have to pull their input instead of the mappers pushing it
Negates pipelining, result estimation, and continuous queries (streamprocessing)
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 22 / 31
![Page 65: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/65.jpg)
Introduction
In case of MapReduce, the entire output of a map or a reduce taskneeds to be materialized to local storage before the next stage cancommence
Simplifies fault-tolerance
Reducers have to pull their input instead of the mappers pushing it
Negates pipelining, result estimation, and continuous queries (streamprocessing)
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 22 / 31
![Page 66: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/66.jpg)
Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 23 / 31
![Page 67: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/67.jpg)
Introduction
1 Not all applications can be broken down into just two-phases, such ascomplex SQL-like queries
2 Tasks take in just one input and produce one output
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 24 / 31
![Page 68: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/68.jpg)
Introduction
1 Not all applications can be broken down into just two-phases, such ascomplex SQL-like queries
2 Tasks take in just one input and produce one output
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 24 / 31
![Page 69: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/69.jpg)
Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 25 / 31
![Page 70: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/70.jpg)
Introduction
1 Hadoop is widely employed for iterative computations
2 For machine learning applications, the Apache Mahout library is usedatop Hadoop
3 Mahout uses an external driver program to submit multiple jobs toHadoop and perform a convergence test
4 No fault-tolerance and overhead of job submission
5 Loop-invariant data is materialized to storage
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
![Page 71: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/71.jpg)
Introduction
1 Hadoop is widely employed for iterative computations
2 For machine learning applications, the Apache Mahout library is usedatop Hadoop
3 Mahout uses an external driver program to submit multiple jobs toHadoop and perform a convergence test
4 No fault-tolerance and overhead of job submission
5 Loop-invariant data is materialized to storage
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
![Page 72: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/72.jpg)
Introduction
1 Hadoop is widely employed for iterative computations
2 For machine learning applications, the Apache Mahout library is usedatop Hadoop
3 Mahout uses an external driver program to submit multiple jobs toHadoop and perform a convergence test
4 No fault-tolerance and overhead of job submission
5 Loop-invariant data is materialized to storage
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
![Page 73: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/73.jpg)
Introduction
1 Hadoop is widely employed for iterative computations
2 For machine learning applications, the Apache Mahout library is usedatop Hadoop
3 Mahout uses an external driver program to submit multiple jobs toHadoop and perform a convergence test
4 No fault-tolerance and overhead of job submission
5 Loop-invariant data is materialized to storage
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
![Page 74: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/74.jpg)
Introduction
1 Hadoop is widely employed for iterative computations
2 For machine learning applications, the Apache Mahout library is usedatop Hadoop
3 Mahout uses an external driver program to submit multiple jobs toHadoop and perform a convergence test
4 No fault-tolerance and overhead of job submission
5 Loop-invariant data is materialized to storage
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 26 / 31
![Page 75: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/75.jpg)
Outline
1 Hadoop everywhere!
2 Skew
3 Heterogeneous Environment
4 Low-level Programming Interface
5 Strictly Batch-processing
6 Single-input/single output and Two-phase
7 Iterative and Recursive Applications
8 Incremental Computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 27 / 31
![Page 76: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/76.jpg)
Introduction
1 Most workloads processed by MapReduce are incremental by nature,i.e. MapReduce jobs often run repeatedly with small changes in theirinput
2 For instance, most iterations of PageRank run with very smallmodifications
3 Unfortunately, even with a small change in input, MapReducere-performs the entire computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 28 / 31
![Page 77: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/77.jpg)
Introduction
1 Most workloads processed by MapReduce are incremental by nature,i.e. MapReduce jobs often run repeatedly with small changes in theirinput
2 For instance, most iterations of PageRank run with very smallmodifications
3 Unfortunately, even with a small change in input, MapReducere-performs the entire computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 28 / 31
![Page 78: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/78.jpg)
Introduction
1 Most workloads processed by MapReduce are incremental by nature,i.e. MapReduce jobs often run repeatedly with small changes in theirinput
2 For instance, most iterations of PageRank run with very smallmodifications
3 Unfortunately, even with a small change in input, MapReducere-performs the entire computation
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 28 / 31
![Page 79: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/79.jpg)
References
1 MapReduce: A major step backwards:http://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html
2 Matei Zaharia, Andy Konwinski, Anthony D. Joseph, Randy Katz, andIon Stoica. 2008. Improving MapReduce performance inheterogeneous environments. In Proceedings of the 8th USENIXconference on Operating systems design and implementation(OSDI’08). USENIX Association, Berkeley, CA, USA, 29-42.
3 Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar,and Andrew Tomkins. 2008. Pig latin: a not-so-foreign language fordata processing. In Proceedings of the 2008 ACM SIGMODinternational conference on Management of data (SIGMOD ’08). ACM,New York, NY, USA, 1099-1110.
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 29 / 31
![Page 80: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/80.jpg)
References (2)
4 Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein,Khaled Elmeleegy, and Russell Sears. 2010. MapReduce online. InProceedings of the 7th USENIX conference on Networked systemsdesign and implementation (NSDI’10). USENIX Association, Berkeley,CA, USA.
5 Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and DennisFetterly. 2007. Dryad: distributed data-parallel programs fromsequential building blocks. In Proceedings of the 2nd ACMSIGOPS/EuroSys European Conference on Computer Systems 2007(EuroSys ’07). ACM, New York, NY, USA, 59-72.
6 Derek G. Murray, Malte Schwarzkopf, Christopher Smowton, StevenSmith, Anil Madhavapeddy, and Steven Hand. 2011. CIEL: a universalexecution engine for distributed data-flow computing. In Proceedings ofthe 8th USENIX conference on Networked systems design andimplementation (NSDI’11). USENIX Association, Berkeley, CA, USA.
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 30 / 31
![Page 81: Topic 7: Shortcomings in the MapReduce Paradigm](https://reader034.vdocument.in/reader034/viewer/2022052505/555066b3b4c905ae3f8b569c/html5/thumbnails/81.jpg)
References (3)
7 Pramod Bhatotia, Alexander Wieder, Rodrigo Rodrigues, Umut A.Acar, and Rafael Pasquin. 2011. Incoop: MapReduce for incrementalcomputations. In Proceedings of the 2nd ACM Symposium on CloudComputing (SOCC ’11). ACM, New York, NY, USA.
Zubair Nabi 7: Shortcomings in the MapReduce Paradigm April 19, 2013 31 / 31