streaming distributed data processing with silk #deim2014

Streaming Distributed Data Processing with Silk

Taro L. SaitoUniversity of Tokyo

[email protected]

March 3rd, 2014 DEIM2014

1xerial.org/silk Twitter @taroleo

mailto:[email protected]

xerial.org/silk Twitter @taroleo


Distributed Data Processing

Translate this data processing program

into a cluster computing program

2

A B

A0

A1

A2

B1

B2

f

B0

C

C

g

map reduce

f g



Streaming Distributed Data Processing

What is streaming? 　　

Silk: A framework for building and running complex workflows of distributed data processing

3

Af

B C

g

D E

F

G



Problem Definition

How do we run the distributed data processing while extending the program?

4

Af

B C

g

D E

F

G



Silk

Describing Dataflows in Scala A dataflow in Silk is a sequence of function calls

Type safe and concise syntax, easy to learn. Silk[A] : Set of type A object

5



Object-Oriented Dataflow Programming

Reusing and overriding dataflow programs

6


Big Data Volumes in Human Genome Analysis

Input: FASTQ file(s) 500GB (50x coverage, 200 million entries) DNA Sequencer (Illumina, PacBio, etc.)

f: An alignment program Output: Alignment results 750GB (sequence + alignment data)

Total storage space required: 1.2TB Computational time required: 1 days (using hundreds of CPUs)

Outputf

Input

University of Tokyo Genome Browser (UTGB)



Varieties of Scientific Data and Analysis

WormTSS: http://wormtss.utgenome.org/ Integrating various data sources, hundreds of data analysis…


http://wormtss.utgenome.org/


Produced Thousands of Data Analysis Charts

Using R, JFreeChart, etc.

Need a automated pipeline to redo the entire analysis for answering the paper review within a month.



Writing A Dataflow

Apply function f to the input A, then produce the output B This step may take more than 1 hours in big data analysis

10

A B

f

val B = A.map(f)


a Program v1



Distribution and Fault Tolerance

Resume only B2 = A2.map(f)

11

A0

A1

A2

B1

B2

f

B0

Failure!

A B

fa Program v1

Retry


Extending Dataflows

While running program v1, adding another code (program v2) How do we reuse the already computed result (B) to generate C?

12

A B

f

C

gProgram v1

Program v2



Marking to A Program

Storing intermediate results using variable names variable names := program markers!!

But, we lost variable names after compilation

Extracting AST and variable names upon compile time Using Scala Macros (Since Scala 2.10)

13

A B

f

val B = A.map(f)val C = B.map(g)

C

gProgram v1

Program v2



Scala Program (AST) to DAG Schedule (Logical Plan)

Translating a program (AST) into a set of Silk operations (DAG) val B = MapOp(input:A, output:B, function:f) val C = MapOp(input:B, output:C, function:g)

Operations in Silk can be nested val C = MapOp(input:MapOp(input:A, output:B, function:f), output:C, function:g)

14

A B

f

C

gProgram v1

Program v2




Weaving Silks

Data analysis code is independent from weavers

15

In-memory weaver

Cluster weaver

Hadoop weaver

Result

Silk[A] (operation DAG)

Weave Output



Cluster Weaver: Logical Plan to Physical Plan on Cluster

Logical plan GroupByOp(in:people, out:g, key: {_.dept.id})

Physical plan

16

I1

I2

I3

P1

P2

P3

P1

P2

P3

P1

P2

P3

S1

S2

S3

S1

S2

S3

S1

S2

S3

R1

S1

S1

S1

S2

S2

S2

S3

S3

S3

P1

P1

P1

P2

P2

P2

P3

P3

P3

R2

R3

Partition (hashing)

serialization shuffle deserialization merge sort

Silk[people]

Scatter

Silk[A]

Resource Table(CPU, memory)

User programbuilds workflows

Static optimization

DAG Schedule

• read file, toSilk• map, reduce, join, • groupBy• UNIX commands• etc.

• Register ClassBox• Submit schedule

Silk Master

dispatch

Silk Client

ZooKeeper Node Table

Slice Table

Task Scheduler

Task Status

Resource Monitor

Task Executor

Silk Client

Task Scheduler

Resource Monitor

Task Executor

• Submits tasks• Run-time optimization• Resource allocation• Monitoring resource usage• Launches Web UI• Manages assigned task status• Object serialization/deserialization• Serves slice data

ensemble mode(at least 3 ZK instances)

• Leader election• Collects locations of slices

and ClassBox jars• Watches active nodes• Watches available resources

Local ClassBox classpaths & local jar files

ClassBox Table

weave

• Dispatches tasks to clients• Manages master resource table• Authorizes resource allocation• Automatic recovery by

leader election in ZK

Data Server Data Server

Silk[A]

SilkSeq[A]SilkSingle[A]

weave

Asingle object

Seq[A]sequence of objects

Weaving Silk materializes objects

Local machine

Cluster

xerial.org/silk Twitter @taroleo 17



Static Optimization

Tree transformation map(f).map(g) => map(g ・ f) (Function composition) map(f).filter(p) => mapWithFilter(f, p) （ Reduces

intermediate data) Pushing-down selection Retrieves only accessed fields in an object

Analyzing the byte code of functions with ASM

Rewriting logical plans using pattern matching in Scala Easy to add optimization rules

18



Run-time Optimization

Adjusting the number of data splits According to the available cluster resources.

Multi-core execution

Omega-based task scheduler Sharing the cluster resource table between nodes

Each node determines how to use the resource Monitoring actual CPU/memory resources periodically

19


UNIX Command Workflows in Silk

c”(UNIX Command)”



Buffer Management

Silk frequently uses distributed memory (like Spark) LArray[A]

Allocating Off-heap (outside JVM heap ） memories sun.misc.Unsafe Github ： https://github.com/xerial/larray

Immediate memory deallocation (free) To eliminate OutOfMemoryException and GC-stall

Fast memory allocation Skips zero-filling

Object Serialization Extending msgpack

Scala Pickling Inject ser/dser codes

Off-heap objects


https://github.com/xerial/larray

https://github.com/xerial/larray



Summary

Silk A framework for distributed data processing for all data scientists

including non-experts in distributed data processing (e.g. Biologists) Object-oriented data processing programming

Reuse, override and mix-in Optimizing data flow programs

Similar to query optimization in DBMS

Analyze Data as You Write Programs! Database research now enters program optimization.

In Future Workflow queries

Making queries against dataflow program Monitoring intermediate results

Multi-user program execution

22


http://xerial.org/silk


streaming distributed data processing with silk #deim2014

Technology

orgsilk twitter

big data analysisxerial

big data volumes

various data sources

varieties of scientific

program markers

alignment program output

scala program ast