sailfish: a framework for large scale data processing sriram rao cisl@microsoft oct. 15, 2012
TRANSCRIPT
![Page 1: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/1.jpg)
Sailfish: A Framework For Large Scale Data Processing
Sriram RaoCISL@Microsoft
Oct. 15, 2012
![Page 2: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/2.jpg)
Joint Work With Colleagues…
• Raghu Ramakrishnan (at Yahoo and now at Microsoft)
• Adam Silberstein (at Yahoo and now at LinkedIn)
• Mike Ovsiannikov and Damian Reeves (at Quantcast)
![Page 3: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/3.jpg)
Motivation
• “Big data” is a booming industry:– Collect massive amounts of data (10’s of TB/day)– Use data intensive compute frameworks (Hadoop,
Cosmos, Map-Reduce) to extract value from the collected data
– Volume of data processed is bragging rights• How do frameworks handle data at scale?– Not well studied in the literature
![Page 4: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/4.jpg)
M/R Dataflow
DFS 𝑀 0
DFS 𝑀 1
𝑅0
𝑅1
𝑅2
DFS
DFS
DFS
𝑆𝑒𝑒𝑘𝑠
𝑆𝑘𝑒𝑤
![Page 5: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/5.jpg)
• Intermediate data transfer is seek intensive => I/O’s are small/random– # of disk seeks for transferring intermediate data proportional to M * R
Disk Overheads
DFS 𝑀 1
𝑅0
𝑅1
𝑅2
𝑀 0DFS
![Page 6: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/6.jpg)
Why Is Scale Important?
• Yahoo cluster workload characteristics:–Vast majority of jobs (about 95%) are small–A minority of jobs (about 5%) are big• Involve 1000’s of tasks that are run on many
machines in the cluster• Run for several hours processing TB’s of data• Size of intermediate data (i.e., map output) is at
least as big as the input
![Page 7: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/7.jpg)
Can We Minimize Seeks?
• Problem space: Size of intermediate data exceeds amount of RAM in cluster
• Reducer reads data from disk => one seek per reducer
• Lower bound is proportional to R
![Page 8: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/8.jpg)
Solving A Seek Problem…
• Minimize disk seeks via “group commit” is well known
• Why isn’t this idea implemented?• Difficult to implement in the past– Datacenter bandwidth is a contended resource– Any solution that mentions “network” was
beginning of a (futile) negotiation
![Page 9: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/9.jpg)
What Is New…
• Network b/w in a datacenter is going up…– Lower “over-subscription”• 1/5/10-Gbps between any pair
• Can we leverage this trend to do distributed aggregation and improve disk performance?
• Building using this trend is being explored:– Flat Datacenter Storage (OSDI’12): Blob store– ThemisMR (SOCC’12): M/R at scale
![Page 10: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/10.jpg)
Key Ideas
1. I-files, a network-wide data aggregation mechanism
2. Observe intermediate data in I-files during map phase to plan reduce phase𝑅2
𝑀 0
𝑀 1
𝑀 2
RAM
RAM
RAM𝑅1
𝑅0
I-file
I-file
I-file
![Page 11: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/11.jpg)
Our Work
• Build I-files by extending a DFS• Build Sailfish (by modifying Hadoop) in which I-files
are used to transport intermediate data– Leverage I-files to gather statistics on intermediate data
to plan reduce phase:• (1) # of reducers depend on data, (2) handle skew
– Eliminate tuning parameters• No more map-side tuning, choosing # of reducers
• Results show 20% to 5x speedup on a representative mix of (large) real jobs/datasets at Yahoo
![Page 12: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/12.jpg)
Talk Outline
• Motivation
• I-files: A data aggregation mechanism• Sailfish: Map-Reduce using I-files
• Experimental Evaluation• Summary and On-going work
![Page 13: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/13.jpg)
Using I-files for Intermediate Data
• I-files are a container for data aggregation in general
• Per I-file aggregator:– Buffers data from writers
in RAM– “Group commit” data to
disk
• # of disk seeks is proportional to R
Aggregator
𝑅2
𝑀 0
𝑀 1
𝑀 2
RAM
RAM
RAM𝑅1
𝑅0
I-file
I-file
I-file
![Page 14: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/14.jpg)
Issues
• Fault-tolerance• Scale• Skew: Suppose there is
skew in data written to I-files
• Hot-spots: – Suppose a partition
becomes hot– All map tasks generate
data for that partition
𝑅2
𝑀 0
𝑀 1
𝑀 2
RAM
RAM
RAM𝑅1
𝑅0
I-file
I-file
I-file
![Page 15: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/15.jpg)
Big data => Scale out!
![Page 16: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/16.jpg)
“Scale out” Aggregation• Build using distributed
aggregation (scale out)– Rather than 1 aggregator
per I-file, use multiple
• Bind subset of mappers to each aggregator
I-file
Aggregator…
𝑀 0
𝑀 10
Aggregator…
𝑀 11
𝑀 20
![Page 17: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/17.jpg)
What Does This get?
• Fault-tolerance– When an aggregator fails, need to re-run the subset
of maps that wrote to that aggregator• Re-run in parallel…
• Mitigate skew, hot-spots, better scale• Seeks: Goes up (!)– Reducer input is now stored at multiple aggregators
• Read from multiple places
– Will comeback to this issue…
![Page 18: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/18.jpg)
I-files Design -> Implementation
• Extend DFS to support data aggregation– Use KFS in our work (KFS ≅ HDFS)• Single metaserver (≅ HDFS NameNode)• Multiple Chunkservers (≅ HDFS DataNodes)• Files are striped across nodes in chunks
– Chunks can be variable in size with a fixed maximal value– Currently, max size of a chunk: 128MB
• Adapt KFS to support I-files– Leverage multi-writer capabilities of KFS
![Page 19: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/19.jpg)
KFS I-files Characteristics
• I-files provide a record-oriented interface• Append-only– Clients append records to an I-file :
record_append(fd, <key, value>)• Append on a file translates to append on a chunk• Records do not span chunk boundaries
– Chunkserver is the aggregator• Supports data retrieval by-key: – scan(fd, <key range>)
![Page 20: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/20.jpg)
Appending To KFS I-files
Map Task
record_append()
KFS metaserver
Chunk1
Chunk2
Map tasks
Map tasks
Alloc chunk
Bind to Chunk1
• Multiple appenders per chunk
• Multiple chunks appended to concurrently
![Page 21: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/21.jpg)
KFS I-files
• I-file constructed via sequential I/O in a distributed manner– Network-wide batching– Multiple appenders per chunk– Multiple chunks appended to concurrently
• On a per-chunk basis chunkserver responsible for that chunk is the aggregator– Chunkserver aggregates records and commits to disk– Append is atomic: Chunkserver serializes concurrent
appends
![Page 22: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/22.jpg)
Distributed Aggregation With KFS I-files
Design• Bind subset of mappers to
an aggregator• Use multiple aggregators
per I-file• Minimize # of aggregators
per I-file
Implementation• Multiple writers/chunk• Multiple chunks appended
to concurrently• # of chunks per I-file scales
based on data• Chunk allocation is key:
– Need to pack data into as few chunks as possible
![Page 23: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/23.jpg)
Talk Outline
• Motivation
• I-files: A data aggregation mechanism• Sailfish: Map-Reduce using I-files
• Experimental Evaluation• Summary and On-going work
![Page 24: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/24.jpg)
Sailfish: MapReduce Using I-files
• Modify Hadoop-0.20.2 to use I-files– Mappers append their output to per-partition I-
files using record_append()• Map output is appended concurrent to task execution
– During map phase, gather statistics on intermediate data and plan reduce phase• # of reducers, task assignment done at run-time
– Reducer scans its input from a per-partition I-file• Merge records from chunks and reduce()• For efficient scan(), sort and index an I-file chunk
![Page 25: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/25.jpg)
Sailfish Dataflow
DFS 𝑀 1
𝑀 0DFS
𝑅1
𝑅2
DFS
DFS
𝑅0
DFS𝑅3
DFS
![Page 26: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/26.jpg)
Sailfish Dataflow
DFS
DFS
DFS
DFS
SorterChunkserver𝑀 0
𝑀 1
Chunkserver
𝑀 2
𝑀 3
![Page 27: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/27.jpg)
Sailfish Dataflow
Chunkserver
Chunkserver
𝑅1
𝑅2
DFS
DFS
![Page 28: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/28.jpg)
Leveraging I-files
• Gather statistics on intermediate data whenever a chunk is sorted– Statistics are gathered during map phase as part of
execution• During sorting augment each chunk with an
index– Index supports efficient scans
![Page 29: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/29.jpg)
Reduce Phase Implementation
• Plan reduce phase based on data:– # of reduce tasks per I-file = Size of I-file / Work
per task– # of tasks scale based on data; handles skew
• On a per I-file basis, partition key space by constructing “split points”
• Each reduce task processes a range of keys within an I-file– “Hierarchical partitioning” of data in an I-file
![Page 30: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/30.jpg)
Sailfish: Reduce Phase
Objectives• Avoid specifying # of
reducers in a job at job submission time
• Handle skew• Auto-scale
Implementation• Gather statistics on I-files to
plan reduce phase– # of reduce tasks is
determined at run-time in a data dependent manner
• Hierarchical scheme– Partition map output into a
large number of I-files– Assign key-ranges within a I-
file to a reduce task• Reducers/I-file is data
dependent
![Page 31: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/31.jpg)
How many seeks?
• Goal: # of seeks proportional to R• With Sailfish,– A reducer reads input from all chunks of a single I-file – Suppose that each I-file has c chunks– # of seeks during read is proportional to c * R
• # of seeks during appends is also proportional to c * R
– But sorters also cause seeks• # of seeks during sorting is proportional to 2 * c * R
• Packing data into as few chunks as possible is critical for I-file effectiveness
![Page 32: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/32.jpg)
Talk Outline
• Motivation
• I-files: A data aggregation mechanism• Sailfish: Map-Reduce using I-files
• Experimental Evaluation• Summary and On-going work
![Page 33: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/33.jpg)
Experimental Evaluation
• Cluster comprises of ~150 machines (5 racks)– 2008-vintage machines
• 8 cores, 16GB RAM, 4-750GB drives/node
– 1Gbps between any pair of nodes• Used lzo compression for handling intermediate
data (for both Hadoop and Sailfish)• Evaluations involved:– Synthetic benchmark that generates its own data– Actual jobs/data run at Yahoo
![Page 34: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/34.jpg)
How did we do? (Synthetic Benchmark)
1 2 4 8 16 32 640
200
400
600
800
1000
1200
1400
1600
1800
Sailfish Hadoop
Intermediate Data Size (TB)
Runti
me
(min
.)
![Page 35: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/35.jpg)
How Many Seeks…
Hadoop• With Stock Hadoop, it is
proportional to M * R• # of map tasks generating
64TB data• M = (64TB / 1GB per map
task) = 65536
Sailfish• With Sailfish, it is proportional
to c * R• 64TB of intermediate data
split over 512 I-files – Chunks per I-file: ((64TB / 512) /
128MB)) = 1024– In practice: c varies from 1032
to 1048
• Results show that chunks are packed– Chunk allocation policy works
well in practice
![Page 36: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/36.jpg)
Sailfish: More data read per seek…
![Page 37: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/37.jpg)
Sailfish: Faster reduce phase…
![Page 38: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/38.jpg)
Sailfish In Practice• Use actual jobs+datasets that are used in
productionJob Name Characteristics Input size (TB) Int. data size (TB)
LogCount Data reduction 1.1 0.04LogProc Skew in map
output1.1 1.1
LogRead Skew in reduce input
1.1 1.1
Nday Model Incr. computation 3.54 3.54
Behavior Model Big data 3.6 9.47Click Attribution Big data 6.8 8.2Segment Exploder Data explosion 14.1 25.2
![Page 39: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/39.jpg)
How Did We Do…
LogCount LogProc LogRead Nday-Model BehaviorModel ClickAttr SegmentExploder0
100
200
300
400
500
600
700
800
900
Sailfish Hadoop
Job
Runti
me
(min
.)
![Page 40: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/40.jpg)
Handling Skew In Reducer Input
LogRead job
![Page 41: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/41.jpg)
Fault-tolerance
• Sailfish handles (temporary) loss of intermediate data via recomputes– Bookkeeping that tracks map tasks that wrote to
the lost block and re-run those• “Scale out” mitigates the impact of data loss– 15% increase in run-time for a run with failure
• Described in detail in the paper…
![Page 42: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/42.jpg)
Related Work
• ThemisMR (SOCC’12) addresses same problem as Sailfish– Does not (yet) support fault-tolerance: design space is small
clusters where failures are rare– Design requires reducer input to fit in RAM
• [Starfish] Parameter tuning (for Hadoop)– Construct a job profile and use that to tune Hadoop parameters – Gains are limited by Hadoop’s intermediate data handling
mechanisms• Lot of work in DB literature on handling skew
– Run job on a sample of input and collect statistics to construct partition boundaries
– Use statistics to drive actual run
![Page 43: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/43.jpg)
Summary
• Explore idea of network-wide aggregation to improve disk subsystem performance
• Develop I-files as a data aggregation construct– Implement I-files in KFS (a distributed filesystem)
• Use I-files to build Sailfish, a M/R infra– Sailfish improves job completion times: 20% to 5x
![Page 44: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/44.jpg)
On-going Work
• Extending Sailfish to support elasticity/preemption (Amoeba, SOCC’12)
• Working on integrating many of the core ideas in Sailfish into Hadoop 2.x (aka YARN)– Work started at Yahoo! Labs– Being continued in CISL@Microsoft• http://issues.apache.org/jira/browse/MAPREDUCE-458
4• http://issues.apache.org/jira/browse/YARN-45
![Page 45: Sailfish: A Framework For Large Scale Data Processing Sriram Rao CISL@Microsoft Oct. 15, 2012](https://reader034.vdocument.in/reader034/viewer/2022051618/56649d0c5503460f949e0ebe/html5/thumbnails/45.jpg)
Software Available
• Sailfish released as open source project– http://code.google.com/p/sailfish