Download - Presenters: Abhishek Verma, Nicolas Zea. Map Reduce Clean abstraction Extremely rigid 2 stage group-by aggregation Code reuse and maintenance

Presenters: Abhishek Verma, Nicolas Zea

Map Reduce Clean abstraction Extremely rigid 2 stage group-by aggregation Code reuse and maintenance difficult

Google → MapReduce, Sawzall Yahoo → Hadoop, Pig Latin Microsoft → Dryad, DryadLINQ Improving MapReduce in heterogeneous

environment

k1 v1

k2 v2

k1 v3

k2 v4

k1 v5

map

k1 v1

k1 v3

k1 v5

k2 v2

k2 v4

Outputrecords

map reduc

e

reduce

Inputrecords

Split

Split

shuffle

k1 v1

k1 v3

k2 v2

Local QSort

k1 v5

k2 v4

Extremely rigid data flow Other flows hacked in

Stages Joins Splits Common operations must be coded by hand

Join, filter, projection, aggregates, sorting,distinct Semantics hidden inside map-reduce fns

Difficult to maintain, extend, and optimize

M R

M R M R

Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins

Research

Pigs Eat Anything Can operate on data w/o metadata : relational, nested, or

unstructured. Pigs Live Anywhere

Not tied to one particular parallel framework Pigs Are Domestic Animals

Designed to be easily controlled and modified by its users. UDFs : transformation functions, aggregates, grouping functions, and

conditionals. Pigs Fly

Processes data quickly(?)

6

Dataflow language Procedural : different from SQL

Quick Start and Interoperability Nested Data Model UDFs as First-Class Citizens Parallelism Required Debugging Environment

7

Data Model Atom : 'cs' Tuple: ('cs', 'ece', 'ee') Bag: { ('cs', 'ece'), ('cs')} Map: [ 'courses' → { ('523', '525', '599'}]

Expressions Fields by position $0 Fields by name f1, Map Lookup #

8

Find the top 10 most visited pages in each category

URLCatego

ryPageRa

nk

cnn.com News 0.9

bbc.com News 0.8

flickr.com

Photos 0.7

espn.com Sports 0.9

Visits URL Info

User URL Time

Amy cnn.com 8:00

Amy bbc.com 10:00

Amy flickr.com 10:05

Fred cnn.com 12:00

Load Visits

Group by url

Foreach urlgenerate count

Load Url Info

Join on url

Group by category

Foreach categorygenerate top10

urls

visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);

urlInfo = load ‘/data/urlInfo’ as (url, category,pRank);

visitCounts = join visitCounts by url, urlInfo by url;

gCategories = group visitCounts by category;topUrls = foreach gCategories

generate top(visitCounts,10);

store topUrls into ‘/data/topUrls’;






store topUrls into ‘/data/topUrls’;Operates directly over files






store topUrls into ‘/data/topUrls’;

Schemas 0ptional can be assigned dynamically






store topUrls into ‘/data/topUrls’;UDFs can be used in every construct

LOAD: specifying input data FOREACH: per-tuple processing FLATTEN: eliminate nesting FILTER: discarding unwanted data COGROUP: getting related data together

GROUP, JOIN STORE: asking for output Other: UNION, CROSS, ORDER, DISTINCT

15

Every group or join operation forms a map-reduce

boundary

Other operations pipelined into map and reduce phases

Load Visits

Group by url

Foreach urlgenerate count

Load Url Info

Join on url

Group by category

Foreach categorygenerate top10

urls

Map1

Reduce1 Map2

Reduce2

Map3

Reduce3

Write-run-debug cycle Sandbox dataset Objectives:

Realism Conciseness Completeness

Problems: UDFs

18

Optional “safe” query optimizer Performs only high-confidence rewrites

User interface Boxes and arrows UI Promote collaboration, sharing code fragments

and UDFs Tight integration with a scripting language

Use loops, conditionals of host language

Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu,

Ulfar Erlingsson, Pradeep Kumar Gunda, Jon Currey

Files, TCP, FIFO, NetworkFiles, TCP, FIFO, Networkjob schedule

data plane

control plane

NSNS PDPD PDPDPDPD

V V V

Job manager cluster

Collection<T> collection;bool IsLegal(Key);string Hash(Key);

var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};

Partition

Collection

C# objects

Partitioning: Hash, Range, RoundRobin

Apply, Fork Hints

Collection<T> collection;bool IsLegal(Key k);string Hash(Key);

var results = from c in collection where IsLegal(c.key) select new { Hash(c.key),

c.value};

C#

collection

results

C# C# C#

Vertexcode

Queryplan(Dryad job)Data

DryadLINQ

Client machine

(11)

Distributed query

plan

C#

Query Expr

Data center

Output TablesResults

Input Tables

Invoke Query

Output DryadTa

ble

Dryad Execution

C# Objects

JM

ToDryadTable

foreach

LINQ expressions converted to execution plan graph (EPG)

similar to database query plan

DAG

annotated with metadata properties

EPG is skeleton of Dryad DFG

as long as native operations are used, properties can propagate helping optimization

Pipelining

Multiple operations in a single process

Removing redundancy

Eager Aggregation

Move aggregations in front of partitionings

I/O Reduction

Try to use TCP and in-memory FIFO instead of disk space

As information from job becomes available, mutate execution graph Dataset size based

decisions▪ Intelligent

partitioning of data

Aggregation can turn into tree to improve I/O based on locality Example if part of

computation is done locally, then aggregated before being sent across network

TeraSort - scalability

240 computer cluster of 2.6Ghz dual core AMD Opterons

Sort 10 billion 100-byte records on 10-byte key

Each computer stores 3.87 GBs

DryadLINQ vs Dryad - SkyServer

Dryad is hand optimized

No dynamic optimization overhead

DryadLINQ is 10% native code

High level and data type transparent

Automatic optimization friendly

Manual optimizations using Apply operator

Leverage any system running LINQ framework

Support for interacting with SQL databases

Single computer debugging made easy

Strong typing, narrow interface

Deterministic replay execution

Dynamic optimizations appear data intensive What kind of overhead?

EPG analysis overhead -> high latency No real comparison with other systems Progress tracking is difficult

No speculation Will Solid State Drives diminish advantages of MapReduce? Why not use Parallel Databases? MapReduce Vs Dryad How different from Sawzall and Pig?

Language Sawzall Pig Latin DryadLINQ

Built by Google Yahoo Microsoft

Programming Imperative ImperativeImperative & Declarative

Hybrid

Resemblance to SQL

Least Moderate Most

Execution EngineGoogle

MapReduceHadoop Dryad

Performance * Very Efficient5-10 times

slower1.3-2 times

slower

ImplementationInternal, inside

Google

Open Source Apache-License

Internal, inside Microsoft

ModelOperate per

recordSequence of

MRDAGs

Usage Log Analysis+ Machine Learning

+ Iterative computations

Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica

University of California at Berkeley

Speculative tasks executed only if no failed or waiting avail. Notion of progress

3 phases of execution

1.Copy phase

2.Sort phase

3.Reduce phase Each phase weighted by % data processed

Determines whether a job failed or is a straggler and available for speculation

1. Nodes can perform work at exactly the same rate

2. Tasks progress at a constant rate throughout time

3. There is no cost to launching a speculative task on an idle node

4. The three phases of execution take approximately same time

5. Tasks with a low progress score are stragglers

6. Maps and Reduces require roughly the same amount of work

Virtualization breaks down homogeneity

Amazon EC2 - multiple vm’s on same physical host

Compete for memory/network bandwidth

Ex: two map tasks can compete for disk bandwidth, causing one to be a straggler

Progress threshold in Hadoop is fixed and assumes low progress = faulty node Too Many speculative tasks executed Speculative execution can harm running tasks

Task’s phases are not equal

Copy phase typically the most expensive due to network communication cost

Causes rapid jump from 1/3 progress to 1 of many tasks, creating fake stragglers

Real stragglers get usurped

Unnecessary copying due to fake stragglers

Progress score means anything with >80% never speculatively executed

Longest Approximate Time to End

Primary assumption: best task to execute is the one that finishes furthest into the future

Secondary: tasks make progress at approx. constant rate

Progress Rate = ProgressScore/T*

T = time task has run for

Time to completion = (1-ProgressScore)/T

Launch speculative jobs on fast nodes best chance to overcome straggler vs using first

available node Cap on total number of speculative tasks ‘Slowness’ minimum threshold Does not take into account data locality

Sort

EC2 test cluster 1.0-1.2 Ghz

Opteron/Xeon w/1.7 GB mem

Sort

Manually slowed down 8 VM’s with background processes

Grep WordCount

1.Make decisions early2.Use finishing times3.Nodes are not equal4.Resources are precious

Focusing work on small vm’s fair? Would it be better to pay for large vm and

implement system with more customized control?

Could this be used in other systems? Progress tracking is key

Is this a fundamental contribution? Or just an optimization? “Good” research?

Download - Presenters: Abhishek Verma, Nicolas Zea. Map Reduce Clean abstraction Extremely rigid 2 stage group-by aggregation Code reuse and maintenance

Top Related