hadoop: beyond mapreduce
DESCRIPTION
Overview of the above and beyond MapReduce, for the HPC/science community. Key point: move up the stack, reuse what is there. But: some of these people are capable of writing their own YARN apps, so they should be encouraged to do so if they see a need.TRANSCRIPT
![Page 1: Hadoop: Beyond MapReduce](https://reader030.vdocument.in/reader030/viewer/2022012910/54c6c5004a795966208b45ad/html5/thumbnails/1.jpg)
© Hortonworks Inc. 2013
Hadoop: Beyond MapReduce
Steve Loughran, [email protected]@steveloughran
Big Data workshop, June 2013
![Page 2: Hadoop: Beyond MapReduce](https://reader030.vdocument.in/reader030/viewer/2022012910/54c6c5004a795966208b45ad/html5/thumbnails/2.jpg)
© Hortonworks Inc.
Hadoop MapReduce
1. Map: events <k,v>* pairs2. Reduce: <k,[v1, v2,.. vn]> <k,v'>
•Map trivially parallelisable on blocks in a file•Reduce parallelise on keys•MapReduce engine can execute Map and Reduce sequences against data
•HDFS provides data location for work placement
Page 2
![Page 3: Hadoop: Beyond MapReduce](https://reader030.vdocument.in/reader030/viewer/2022012910/54c6c5004a795966208b45ad/html5/thumbnails/3.jpg)
© Hortonworks Inc.
MapReduce democratised big data
•Conceptual model easy to grasp•Can write and test locally, superlinear scaleup
•Tools and stack
Page 3
You don't need to understand parallel coding to run appsacross 1000 machines
![Page 4: Hadoop: Beyond MapReduce](https://reader030.vdocument.in/reader030/viewer/2022012910/54c6c5004a795966208b45ad/html5/thumbnails/4.jpg)
© Hortonworks Inc. 2012
The stack is key to use
Page 4
Kafka
![Page 5: Hadoop: Beyond MapReduce](https://reader030.vdocument.in/reader030/viewer/2022012910/54c6c5004a795966208b45ad/html5/thumbnails/5.jpg)
© Hortonworks Inc. 2012
Example: Pig
Page 5
generated = LOAD '$src/$srcfile' USING PigStorage(',' , '-noschema') AS (line: int, gaussian: double, b: boolean, c:chararray );sorted = ORDER generated BY c ASC;result = FILTER sorted BY gaussian >= 0;
![Page 6: Hadoop: Beyond MapReduce](https://reader030.vdocument.in/reader030/viewer/2022012910/54c6c5004a795966208b45ad/html5/thumbnails/6.jpg)
© Hortonworks Inc.
Example: Apache Giraph
•Graph nodes in RAM•exchange data with peers at barriers•use cases: PageRank, Friend-of-Friend
•But also: modelling cells in a heart
Bulk-Synchronous-Parallel -read Pregel paper
Page 6
![Page 7: Hadoop: Beyond MapReduce](https://reader030.vdocument.in/reader030/viewer/2022012910/54c6c5004a795966208b45ad/html5/thumbnails/7.jpg)
© Hortonworks Inc.
But there is a lot more we can do
Page 7
![Page 8: Hadoop: Beyond MapReduce](https://reader030.vdocument.in/reader030/viewer/2022012910/54c6c5004a795966208b45ad/html5/thumbnails/8.jpg)
© Hortonworks Inc.
New Algorithms and runtimes
•Giraph for graph work•Stream processing: Storm• Iterative and chained processing: Dryad-style•Long-lived processes
Page 8
![Page 9: Hadoop: Beyond MapReduce](https://reader030.vdocument.in/reader030/viewer/2022012910/54c6c5004a795966208b45ad/html5/thumbnails/9.jpg)
© Hortonworks Inc.
Production-side issues
•Scale to 10K nodes•Eliminate SPOFs & Bottlenecks• Improve versioning by moving MR engine user-side
•Avoid having dedicated servers for other roles
Page 9
![Page 10: Hadoop: Beyond MapReduce](https://reader030.vdocument.in/reader030/viewer/2022012910/54c6c5004a795966208b45ad/html5/thumbnails/10.jpg)
© Hortonworks Inc. 2012
YARN: Yet Another Resource Negotiator
App Master manages the
app
AM can request
containers and run code in
them
![Page 11: Hadoop: Beyond MapReduce](https://reader030.vdocument.in/reader030/viewer/2022012910/54c6c5004a795966208b45ad/html5/thumbnails/11.jpg)
© Hortonworks Inc.
YARN vs Other Resource Negotiators
•MapReduce #1 initial use case•Failures:AM handles worker failures, YARN handles AM failures
•Scheduling Locality: sources of data, destinations. AM gets provides location requests along with (CPU, RAM
Page 11
![Page 12: Hadoop: Beyond MapReduce](https://reader030.vdocument.in/reader030/viewer/2022012910/54c6c5004a795966208b45ad/html5/thumbnails/12.jpg)
© Hortonworks Inc.
Pig/Hive-MR versus Pig/Hive-Tez
Page 12
I/O Synchronization
BarrierI/O Pipelining
Pig/Hive - Tez Pig/Hive - Tez
SELECT a.state, COUNT(*)
FROM a JOIN b ON (a.id = b.id)
GROUP BY a.state
![Page 13: Hadoop: Beyond MapReduce](https://reader030.vdocument.in/reader030/viewer/2022012910/54c6c5004a795966208b45ad/html5/thumbnails/13.jpg)
© Hortonworks Inc.
FastQuery: Beyond Batch with YARN
Page 13
Tez Generalizes Map-Reduce
Simplified execution plans processdata more efficiently
Always-On Tez Service
Low latency processing forall Hadoop data processing
![Page 14: Hadoop: Beyond MapReduce](https://reader030.vdocument.in/reader030/viewer/2022012910/54c6c5004a795966208b45ad/html5/thumbnails/14.jpg)
© Hortonworks Inc.
You too can write a distributed execution framework -if you need to
Page 14
![Page 15: Hadoop: Beyond MapReduce](https://reader030.vdocument.in/reader030/viewer/2022012910/54c6c5004a795966208b45ad/html5/thumbnails/15.jpg)
© Hortonworks Inc.
Start with the work in progress
•Hamster: MPI•Storm-YARN from Yahoo!•Hoya: HBase on YARN me
And start with other people's code•Continuuity Weave -looks best place to start
Page 15
![Page 16: Hadoop: Beyond MapReduce](https://reader030.vdocument.in/reader030/viewer/2022012910/54c6c5004a795966208b45ad/html5/thumbnails/16.jpg)
© Hortonworks Inc.
What are the services and algorithms we are going to need?
Page 16
![Page 17: Hadoop: Beyond MapReduce](https://reader030.vdocument.in/reader030/viewer/2022012910/54c6c5004a795966208b45ad/html5/thumbnails/17.jpg)
© Hortonworks Inc
http://hortonworks.com/careers/Page 17
P.S: we are hiring