Presenters: Abhishek Verma, Nicolas Zea
Map Reduce Clean abstraction Extremely rigid 2 stage group-by aggregation Code reuse and maintenance difficult
Google → MapReduce, Sawzall Yahoo → Hadoop, Pig Latin Microsoft → Dryad, DryadLINQ Improving MapReduce in heterogeneous
environment
k1 v1
k2 v2
k1 v3
k2 v4
k1 v5
map
k1 v1
k1 v3
k1 v5
k2 v2
k2 v4
Outputrecords
map reduc
e
reduce
Inputrecords
Split
Split
shuffle
k1 v1
k1 v3
k2 v2
Local QSort
k1 v5
k2 v4
Extremely rigid data flow Other flows hacked in
Stages Joins Splits Common operations must be coded by hand
Join, filter, projection, aggregates, sorting,distinct Semantics hidden inside map-reduce fns
Difficult to maintain, extend, and optimize
M R
M R M R
Christopher Olston, Benjamin Reed, Utkarsh Srivastava, Ravi Kumar, Andrew Tomkins
Research
Pigs Eat Anything Can operate on data w/o metadata : relational, nested, or
unstructured. Pigs Live Anywhere
Not tied to one particular parallel framework Pigs Are Domestic Animals
Designed to be easily controlled and modified by its users. UDFs : transformation functions, aggregates, grouping functions, and
conditionals. Pigs Fly
Processes data quickly(?)
6
Dataflow language Procedural : different from SQL
Quick Start and Interoperability Nested Data Model UDFs as First-Class Citizens Parallelism Required Debugging Environment
7
Data Model Atom : 'cs' Tuple: ('cs', 'ece', 'ee') Bag: { ('cs', 'ece'), ('cs')} Map: [ 'courses' → { ('523', '525', '599'}]
Expressions Fields by position $0 Fields by name f1, Map Lookup #
8
Find the top 10 most visited pages in each category
URLCatego
ryPageRa
nk
cnn.com News 0.9
bbc.com News 0.8
flickr.com
Photos 0.7
espn.com Sports 0.9
Visits URL Info
User URL Time
Amy cnn.com 8:00
Amy bbc.com 10:00
Amy flickr.com 10:05
Fred cnn.com 12:00
Load Visits
Group by url
Foreach urlgenerate count
Load Url Info
Join on url
Group by category
Foreach categorygenerate top10
urls
visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);
urlInfo = load ‘/data/urlInfo’ as (url, category,pRank);
visitCounts = join visitCounts by url, urlInfo by url;
gCategories = group visitCounts by category;topUrls = foreach gCategories
generate top(visitCounts,10);
store topUrls into ‘/data/topUrls’;
visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);
urlInfo = load ‘/data/urlInfo’ as (url, category,pRank);
visitCounts = join visitCounts by url, urlInfo by url;
gCategories = group visitCounts by category;topUrls = foreach gCategories
generate top(visitCounts,10);
store topUrls into ‘/data/topUrls’;Operates directly over files
visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);
urlInfo = load ‘/data/urlInfo’ as (url, category,pRank);
visitCounts = join visitCounts by url, urlInfo by url;
gCategories = group visitCounts by category;topUrls = foreach gCategories
generate top(visitCounts,10);
store topUrls into ‘/data/topUrls’;
Schemas 0ptional can be assigned dynamically
visits = load ‘/data/visits’ as (user, url, time);gVisits = group visits by url;visitCounts = foreach gVisits generate url, count(visits);
urlInfo = load ‘/data/urlInfo’ as (url, category,pRank);
visitCounts = join visitCounts by url, urlInfo by url;
gCategories = group visitCounts by category;topUrls = foreach gCategories
generate top(visitCounts,10);
store topUrls into ‘/data/topUrls’;UDFs can be used in every construct
LOAD: specifying input data FOREACH: per-tuple processing FLATTEN: eliminate nesting FILTER: discarding unwanted data COGROUP: getting related data together
GROUP, JOIN STORE: asking for output Other: UNION, CROSS, ORDER, DISTINCT
15
Every group or join operation forms a map-reduce
boundary
Other operations pipelined into map and reduce phases
Load Visits
Group by url
Foreach urlgenerate count
Load Url Info
Join on url
Group by category
Foreach categorygenerate top10
urls
Map1
Reduce1 Map2
Reduce2
Map3
Reduce3
Write-run-debug cycle Sandbox dataset Objectives:
Realism Conciseness Completeness
Problems: UDFs
18
Optional “safe” query optimizer Performs only high-confidence rewrites
User interface Boxes and arrows UI Promote collaboration, sharing code fragments
and UDFs Tight integration with a scripting language
Use loops, conditionals of host language
Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu,
Ulfar Erlingsson, Pradeep Kumar Gunda, Jon Currey
Files, TCP, FIFO, NetworkFiles, TCP, FIFO, Networkjob schedule
data plane
control plane
NSNS PDPD PDPDPDPD
V V V
Job manager cluster
Collection<T> collection;bool IsLegal(Key);string Hash(Key);
var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};
Partition
Collection
C# objects
Partitioning: Hash, Range, RoundRobin
Apply, Fork Hints
Collection<T> collection;bool IsLegal(Key k);string Hash(Key);
var results = from c in collection where IsLegal(c.key) select new { Hash(c.key),
c.value};
C#
collection
results
C# C# C#
Vertexcode
Queryplan(Dryad job)Data
DryadLINQ
Client machine
(11)
Distributed query
plan
C#
Query Expr
Data center
Output TablesResults
Input Tables
Invoke Query
Output DryadTa
ble
Dryad Execution
C# Objects
JM
ToDryadTable
foreach
LINQ expressions converted to execution plan graph (EPG)
similar to database query plan
DAG
annotated with metadata properties
EPG is skeleton of Dryad DFG
as long as native operations are used, properties can propagate helping optimization
Pipelining
Multiple operations in a single process
Removing redundancy
Eager Aggregation
Move aggregations in front of partitionings
I/O Reduction
Try to use TCP and in-memory FIFO instead of disk space
As information from job becomes available, mutate execution graph Dataset size based
decisions▪ Intelligent
partitioning of data
Aggregation can turn into tree to improve I/O based on locality Example if part of
computation is done locally, then aggregated before being sent across network
TeraSort - scalability
240 computer cluster of 2.6Ghz dual core AMD Opterons
Sort 10 billion 100-byte records on 10-byte key
Each computer stores 3.87 GBs
DryadLINQ vs Dryad - SkyServer
Dryad is hand optimized
No dynamic optimization overhead
DryadLINQ is 10% native code
High level and data type transparent
Automatic optimization friendly
Manual optimizations using Apply operator
Leverage any system running LINQ framework
Support for interacting with SQL databases
Single computer debugging made easy
Strong typing, narrow interface
Deterministic replay execution
Dynamic optimizations appear data intensive What kind of overhead?
EPG analysis overhead -> high latency No real comparison with other systems Progress tracking is difficult
No speculation Will Solid State Drives diminish advantages of MapReduce? Why not use Parallel Databases? MapReduce Vs Dryad How different from Sawzall and Pig?
Language Sawzall Pig Latin DryadLINQ
Built by Google Yahoo Microsoft
Programming Imperative ImperativeImperative & Declarative
Hybrid
Resemblance to SQL
Least Moderate Most
Execution EngineGoogle
MapReduceHadoop Dryad
Performance * Very Efficient5-10 times
slower1.3-2 times
slower
ImplementationInternal, inside
Open Source Apache-License
Internal, inside Microsoft
ModelOperate per
recordSequence of
MRDAGs
Usage Log Analysis+ Machine Learning
+ Iterative computations
Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica
University of California at Berkeley
Speculative tasks executed only if no failed or waiting avail. Notion of progress
3 phases of execution
1.Copy phase
2.Sort phase
3.Reduce phase Each phase weighted by % data processed
Determines whether a job failed or is a straggler and available for speculation
1. Nodes can perform work at exactly the same rate
2. Tasks progress at a constant rate throughout time
3. There is no cost to launching a speculative task on an idle node
4. The three phases of execution take approximately same time
5. Tasks with a low progress score are stragglers
6. Maps and Reduces require roughly the same amount of work
Virtualization breaks down homogeneity
Amazon EC2 - multiple vm’s on same physical host
Compete for memory/network bandwidth
Ex: two map tasks can compete for disk bandwidth, causing one to be a straggler
Progress threshold in Hadoop is fixed and assumes low progress = faulty node Too Many speculative tasks executed Speculative execution can harm running tasks
Task’s phases are not equal
Copy phase typically the most expensive due to network communication cost
Causes rapid jump from 1/3 progress to 1 of many tasks, creating fake stragglers
Real stragglers get usurped
Unnecessary copying due to fake stragglers
Progress score means anything with >80% never speculatively executed
Longest Approximate Time to End
Primary assumption: best task to execute is the one that finishes furthest into the future
Secondary: tasks make progress at approx. constant rate
Progress Rate = ProgressScore/T*
T = time task has run for
Time to completion = (1-ProgressScore)/T
Launch speculative jobs on fast nodes best chance to overcome straggler vs using first
available node Cap on total number of speculative tasks ‘Slowness’ minimum threshold Does not take into account data locality
Sort
EC2 test cluster 1.0-1.2 Ghz
Opteron/Xeon w/1.7 GB mem
Sort
Manually slowed down 8 VM’s with background processes
Grep WordCount
1.Make decisions early2.Use finishing times3.Nodes are not equal4.Resources are precious
Focusing work on small vm’s fair? Would it be better to pay for large vm and
implement system with more customized control?
Could this be used in other systems? Progress tracking is key
Is this a fundamental contribution? Or just an optimization? “Good” research?