hadoop: beyond mapreduce

© Hortonworks Inc. 2013 Hadoop: Beyond MapReduce Steve Loughran, Hortonworks [email protected] @steveloughran Big Data workshop, June 2013

Upload: steve-loughran

Post on 27-Jan-2015




4 download


Overview of the above and beyond MapReduce, for the HPC/science community. Key point: move up the stack, reuse what is there. But: some of these people are capable of writing their own YARN apps, so they should be encouraged to do so if they see a need.


Page 1: Hadoop: Beyond MapReduce

© Hortonworks Inc. 2013

Hadoop: Beyond MapReduce

Steve Loughran, [email protected]@steveloughran

Big Data workshop, June 2013

Page 2: Hadoop: Beyond MapReduce

© Hortonworks Inc.

Hadoop MapReduce

1. Map: events <k,v>* pairs2. Reduce: <k,[v1, v2,.. vn]> <k,v'>

•Map trivially parallelisable on blocks in a file•Reduce parallelise on keys•MapReduce engine can execute Map and Reduce sequences against data

•HDFS provides data location for work placement

Page 2

Page 3: Hadoop: Beyond MapReduce

© Hortonworks Inc.

MapReduce democratised big data

•Conceptual model easy to grasp•Can write and test locally, superlinear scaleup

•Tools and stack

Page 3

You don't need to understand parallel coding to run appsacross 1000 machines

Page 4: Hadoop: Beyond MapReduce

© Hortonworks Inc. 2012

The stack is key to use

Page 4


Page 5: Hadoop: Beyond MapReduce

© Hortonworks Inc. 2012

Example: Pig

Page 5

generated = LOAD '$src/$srcfile' USING PigStorage(',' , '-noschema') AS (line: int, gaussian: double, b: boolean, c:chararray );sorted = ORDER generated BY c ASC;result = FILTER sorted BY gaussian >= 0;

Page 6: Hadoop: Beyond MapReduce

© Hortonworks Inc.

Example: Apache Giraph

•Graph nodes in RAM•exchange data with peers at barriers•use cases: PageRank, Friend-of-Friend

•But also: modelling cells in a heart

Bulk-Synchronous-Parallel -read Pregel paper

Page 6

Page 7: Hadoop: Beyond MapReduce

© Hortonworks Inc.

But there is a lot more we can do

Page 7

Page 8: Hadoop: Beyond MapReduce

© Hortonworks Inc.

New Algorithms and runtimes

•Giraph for graph work•Stream processing: Storm• Iterative and chained processing: Dryad-style•Long-lived processes

Page 8

Page 9: Hadoop: Beyond MapReduce

© Hortonworks Inc.

Production-side issues

•Scale to 10K nodes•Eliminate SPOFs & Bottlenecks• Improve versioning by moving MR engine user-side

•Avoid having dedicated servers for other roles

Page 9

Page 10: Hadoop: Beyond MapReduce

© Hortonworks Inc. 2012

YARN: Yet Another Resource Negotiator

App Master manages the


AM can request

containers and run code in


Page 11: Hadoop: Beyond MapReduce

© Hortonworks Inc.

YARN vs Other Resource Negotiators

•MapReduce #1 initial use case•Failures:AM handles worker failures, YARN handles AM failures

•Scheduling Locality: sources of data, destinations. AM gets provides location requests along with (CPU, RAM

Page 11

Page 12: Hadoop: Beyond MapReduce

© Hortonworks Inc.

Pig/Hive-MR versus Pig/Hive-Tez

Page 12

I/O Synchronization

BarrierI/O Pipelining

Pig/Hive - Tez Pig/Hive - Tez

SELECT a.state, COUNT(*)

FROM a JOIN b ON (a.id = b.id)

GROUP BY a.state

Page 13: Hadoop: Beyond MapReduce

© Hortonworks Inc.

FastQuery: Beyond Batch with YARN

Page 13

Tez Generalizes Map-Reduce

Simplified execution plans processdata more efficiently

Always-On Tez Service

Low latency processing forall Hadoop data processing

Page 14: Hadoop: Beyond MapReduce

© Hortonworks Inc.

You too can write a distributed execution framework -if you need to

Page 14

Page 15: Hadoop: Beyond MapReduce

© Hortonworks Inc.

Start with the work in progress

•Hamster: MPI•Storm-YARN from Yahoo!•Hoya: HBase on YARN me

And start with other people's code•Continuuity Weave -looks best place to start

Page 15

Page 16: Hadoop: Beyond MapReduce

© Hortonworks Inc.

What are the services and algorithms we are going to need?

Page 16

Page 17: Hadoop: Beyond MapReduce

© Hortonworks Inc

http://hortonworks.com/careers/Page 17

P.S: we are hiring