streaming distributed data processing with silk #deim2014
DESCRIPTION
A framework written in Scala for describing distributed data processing programs.TRANSCRIPT
Streaming Distributed Data Processing with Silk
Taro L. SaitoUniversity of Tokyo
March 3rd, 2014 DEIM2014
1xerial.org/silk Twitter @taroleo
xerial.org/silk Twitter @taroleo
Streaming Distributed Data Processing with Silk
Distributed Data Processing
Translate this data processing program
into a cluster computing program
2
A B
A0
A1
A2
B1
B2
f
B0
C
C
g
map reduce
f g
xerial.org/silk Twitter @taroleo
Streaming Distributed Data Processing with Silk
Streaming Distributed Data Processing
What is streaming?
Silk: A framework for building and running complex workflows of distributed data processing
3
Af
B C
g
D E
F
G
xerial.org/silk Twitter @taroleo
Streaming Distributed Data Processing with Silk
Problem Definition
How do we run the distributed data processing while extending the program?
4
Af
B C
g
D E
F
G
xerial.org/silk Twitter @taroleo
Streaming Distributed Data Processing with Silk
Silk
Describing Dataflows in Scala A dataflow in Silk is a sequence of function calls
Type safe and concise syntax, easy to learn. Silk[A] : Set of type A object
5
xerial.org/silk Twitter @taroleo
Streaming Distributed Data Processing with Silk
Object-Oriented Dataflow Programming
Reusing and overriding dataflow programs
6
Streaming Distributed Data Processing with Silk
Big Data Volumes in Human Genome Analysis
Input: FASTQ file(s) 500GB (50x coverage, 200 million entries) DNA Sequencer (Illumina, PacBio, etc.)
f: An alignment program Output: Alignment results 750GB (sequence + alignment data)
Total storage space required: 1.2TB Computational time required: 1 days (using hundreds of CPUs)
Outputf
Input
University of Tokyo Genome Browser (UTGB)
7xerial.org/silk Twitter @taroleo
Streaming Distributed Data Processing with Silk
Varieties of Scientific Data and Analysis
WormTSS: http://wormtss.utgenome.org/ Integrating various data sources, hundreds of data analysis…
8xerial.org/silk Twitter @taroleo
Streaming Distributed Data Processing with Silk
Produced Thousands of Data Analysis Charts
Using R, JFreeChart, etc.
Need a automated pipeline to redo the entire analysis for answering the paper review within a month.
9xerial.org/silk Twitter @taroleo
Streaming Distributed Data Processing with Silk
Writing A Dataflow
Apply function f to the input A, then produce the output B This step may take more than 1 hours in big data analysis
10
A B
f
val B = A.map(f)
xerial.org/silk Twitter @taroleo
a Program v1
xerial.org/silk Twitter @taroleo
Streaming Distributed Data Processing with Silk
Distribution and Fault Tolerance
Resume only B2 = A2.map(f)
11
A0
A1
A2
B1
B2
f
B0
Failure!
A B
fa Program v1
Retry
Streaming Distributed Data Processing with Silk
Extending Dataflows
While running program v1, adding another code (program v2) How do we reuse the already computed result (B) to generate C?
12
A B
f
C
gProgram v1
Program v2
xerial.org/silk Twitter @taroleo
Streaming Distributed Data Processing with Silk
Marking to A Program
Storing intermediate results using variable names variable names := program markers!!
But, we lost variable names after compilation
Extracting AST and variable names upon compile time Using Scala Macros (Since Scala 2.10)
13
A B
f
val B = A.map(f)val C = B.map(g)
C
gProgram v1
Program v2
xerial.org/silk Twitter @taroleo
Streaming Distributed Data Processing with Silk
Scala Program (AST) to DAG Schedule (Logical Plan)
Translating a program (AST) into a set of Silk operations (DAG) val B = MapOp(input:A, output:B, function:f) val C = MapOp(input:B, output:C, function:g)
Operations in Silk can be nested val C = MapOp(input:MapOp(input:A, output:B, function:f), output:C, function:g)
14
A B
f
C
gProgram v1
Program v2
xerial.org/silk Twitter @taroleo
xerial.org/silk Twitter @taroleo
Streaming Distributed Data Processing with Silk
Weaving Silks
Data analysis code is independent from weavers
15
In-memory weaver
Cluster weaver
Hadoop weaver
Result
Silk[A] (operation DAG)
Weave Output
xerial.org/silk Twitter @taroleo
Streaming Distributed Data Processing with Silk
Cluster Weaver: Logical Plan to Physical Plan on Cluster
Logical plan GroupByOp(in:people, out:g, key: {_.dept.id})
Physical plan
16
I1
I2
I3
P1
P2
P3
P1
P2
P3
P1
P2
P3
S1
S2
S3
S1
S2
S3
S1
S2
S3
R1
S1
S1
S1
S2
S2
S2
S3
S3
S3
P1
P1
P1
P2
P2
P2
P3
P3
P3
R2
R3
Partition (hashing)
serialization shuffle deserialization merge sort
Silk[people]
Scatter
Silk[A]
Resource Table(CPU, memory)
User programbuilds workflows
Static optimization
DAG Schedule
• read file, toSilk• map, reduce, join, • groupBy• UNIX commands• etc.
• Register ClassBox• Submit schedule
Silk Master
dispatch
Silk Client
ZooKeeper Node Table
Slice Table
Task Scheduler
Task Status
Resource Monitor
Task Executor
Silk Client
Task Scheduler
Resource Monitor
Task Executor
• Submits tasks• Run-time optimization• Resource allocation• Monitoring resource usage• Launches Web UI• Manages assigned task status• Object serialization/deserialization• Serves slice data
ensemble mode(at least 3 ZK instances)
• Leader election• Collects locations of slices
and ClassBox jars• Watches active nodes• Watches available resources
Local ClassBox classpaths & local jar files
ClassBox Table
weave
• Dispatches tasks to clients• Manages master resource table• Authorizes resource allocation• Automatic recovery by
leader election in ZK
Data Server Data Server
Silk[A]
SilkSeq[A]SilkSingle[A]
weave
Asingle object
Seq[A]sequence of objects
Weaving Silk materializes objects
Local machine
Cluster
xerial.org/silk Twitter @taroleo 17
xerial.org/silk Twitter @taroleo
Streaming Distributed Data Processing with Silk
Static Optimization
Tree transformation map(f).map(g) => map(g ・ f) (Function composition) map(f).filter(p) => mapWithFilter(f, p) ( Reduces
intermediate data) Pushing-down selection Retrieves only accessed fields in an object
Analyzing the byte code of functions with ASM
Rewriting logical plans using pattern matching in Scala Easy to add optimization rules
18
xerial.org/silk Twitter @taroleo
Streaming Distributed Data Processing with Silk
Run-time Optimization
Adjusting the number of data splits According to the available cluster resources.
Multi-core execution
Omega-based task scheduler Sharing the cluster resource table between nodes
Each node determines how to use the resource Monitoring actual CPU/memory resources periodically
19
Streaming Distributed Data Processing with Silk
UNIX Command Workflows in Silk
c”(UNIX Command)”
20xerial.org/silk Twitter @taroleo
Streaming Distributed Data Processing with Silk
Buffer Management
Silk frequently uses distributed memory (like Spark) LArray[A]
Allocating Off-heap (outside JVM heap ) memories sun.misc.Unsafe Github : https://github.com/xerial/larray
Immediate memory deallocation (free) To eliminate OutOfMemoryException and GC-stall
Fast memory allocation Skips zero-filling
Object Serialization Extending msgpack
Scala Pickling Inject ser/dser codes
Off-heap objects
21xerial.org/silk Twitter @taroleo
xerial.org/silk Twitter @taroleo
Streaming Distributed Data Processing with Silk
Summary
Silk A framework for distributed data processing for all data scientists
including non-experts in distributed data processing (e.g. Biologists) Object-oriented data processing programming
Reuse, override and mix-in Optimizing data flow programs
Similar to query optimization in DBMS
Analyze Data as You Write Programs! Database research now enters program optimization.
In Future Workflow queries
Making queries against dataflow program Monitoring intermediate results
Multi-user program execution
22
Streaming Distributed Data Processing with Silk
http://xerial.org/silk
23xerial.org/silk Twitter @taroleo