what is distributed computing, why we use apache spark

BigData, newborn technologies evolving fast. Why Apache Spark

outruns Apache Hadoop

Andy Petrella, NextlabXavier Tordoir, SilicoCloud

Andy

@Noootsab, I am@NextLab_be owner@SparkNotebook creator@Wajug co-driver@Devoxx4Kids organizerMaths & CSData lover: geo, open, massiveFool

Who are we?

Xavier

@xtordoirSilicoCloud-> Physics

-> Data analysis -> genomics

-> scalable systems-> ...

https://github.com/andypetrella/spark-notebook/

So what...Part I

● What○ distributed resources○ data○ managers

● Why:○ fastest○ smartest○ biggest

● How:○ Map Reduce○ Limitations○ Extensions

PART II● Spark

○ Model○ Caching and lineage○ Master and Workers○ Core example

● Beyond Processing○ Streaming○ SQL○ GraphX○ MLlib○ Example

● Use cases○ Parallel batch processing of

timeseries○ ADAM

Part I: The Distributed Age

What is a distributed environmentComputations needs three kind of resources:● CPU ● MEM● Data storage

However, it’s hard to extent each of them at will on a single machine

What is a distributed environmentLacking of one of these will result in higher response time or reduced accuracy.Unfortunately, it doesn’t matter how parallelized is the algorithm or optimized are the computations

If the solution can’t be inside, it must be outside.

What is a distributed environment

Distributed File SystemYou have 100 nodes in your cluster, but only 1 dataset.Will you replicate it on all nodes?

Extended case: your dataset is 1 Zettabyte (10⁹Tb)?

Lonesome solution:● split the file on nodes● axing the algorithm to access local data subsets

HDFS towards TachyonHadoop Distributed File SystemImplements GoogleFSStore and read files splitted and replicated on nodes1Zb file = 8E12 x 128Mb files

IOPs are expensive and require more CPU clocks than DRAM accessHence... Tachyon: memory-centric distributed file system

Nodes will fail, jobs cannotWe need resilience

Management

Resources are generally fewer than required by algorithm.We need scheduling

The requirements are fluctuatingWe need elasticity

Mesos and MarathonMesos: High available cluster managerNodes: attach or remove them on the flyNodes are offering resources -- Applications accept themNode crash: the application restarts the assigned tasks

Marathon: Meta application on MesosApplication crash: automatically restarted on different node

Why: for everybody and now ?

Fastest:1. Time to result2. Near real time processing

Runtime is smaller, Dev lifecyle is shorter→ no synchronization-hell

It can even be really interactive → consoles or notebooks tools.

Why for everybody and now

Why for everybody and nowNo bottlenecks → new-coming data are readily available for processing

Opens the doors for online models!

Why for everybody and nowSmartest: train more and more models, ensembling lots of them is no more a problem

More complex modelling can be tackled if required

Why for everybody and nowAccessing an higher level of accuracy is tricky and might require lots and lots of models.

Running a model takes quite some time, specially if the data has to be read every single time.

Example: Netflix contest winner (AT&T labs) ensembled 500 models to gain 10% accuracy.Although in 2009 it wasn’t possible to use it in production, today this could change.

Why for everybody and nowBiggest: no need for sampling big datasets

……

That’s it!

How!?Google papers stimulated the open software community, hence competitive tools now exist.

In the area of computation in distributed environment, there are two disruptive papers:● Google’s Mapreduce● Berkeley’s Spark

How!?MapReduce (Google white paper 2004):

Programming model for distributed data intensive computations

Helps dealing with parallelization, fault-tolerance, data distribution, load balancing

Functions:Map ≅ transform data to key value pairs

Reduce ≅ aggregate key value pairs per key (e.g. sum, max, count)

Mappers and Reducers are sent to data location (nodes)

How!?

Map

Reduce: apply a binary associative operator on all elements

Image from RxJava: https://github.com/ReactiveX/RxJava/wiki/Transforming-Observables

How!?

Hadoop implementation has some limitations

Mappers and Reducers ship functions to data while java is not a functional language

⇒ Composability is difficult and more IO/network operations are required

Iterative algorithms (e.g. stochastic gradient) have to read data at each step (while data has not changed, only parameters)

How!?

How!?MapReduce on steroids

I) Functional paradigm:- process built lazily based on simple concepts- Map and Reduce are two of them

II) Cache data in memory. No more IO.

So what...Part I

● What○ distributed resources○ data○ managers

● Why:○ fastest○ smartest○ biggest

● How:○ Map Reduce○ Limitations○ Extensions

PART II● Spark

○ Model○ Caching and lineage○ Master and Workers○ Core example

● Beyond Processing○ Streaming○ SQL○ GraphX○ MLlib○ Example (notebook)

● Use cases○ Parallel batch processing of

timeseries○ ADAM

Part II: Spark to the Rescue

RDDsThink of an RDD[T] as an immutable, distributed collection of objects of type T

• Resilient => Can be reconstructed in case of failure• Distributed => Transformations are parallelizable

operations• Dataset => Data loaded and partitioned across cluster

nodes (executors)

RDD[T]Data distribution hierarchy:- RDD[T]- Elements

[ x1, x2 ]

[ x10 ]

[ x8,x5,x6 ]

[ x11 ]

[ x14,x13 ]

[ x9,x16 ]

[ x3 ]

[ x7,x12 ]

[ x15 ]

[ x17,x4 ]

Executor 1

- Executors- Partitions

Executor 2 Executor 3 Executor 4

Execution

Execution is split in fundamental units: Tasks

Tasks running in parallel are grouped in Stages

Execution

Core1Task0(read/process/write)

Task0(read/process/write)








Stage2 Stage1 Stage0

Master and Workers

Spark StreamingWhen you have big fat streams behaving as one single collection

t

DStream[T]

RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]

DStreams: Discretized Streams (= Sequence of RDDs)

Spark SQL

Mapping: RDD -> “table”, Element Field -> “column”

MLLib: Distributed ML

Classification● linear SVM, logistic regression, classification trees, naive Bayes Models

Regression● SVM, regression trees, linear regression (regularized)

Clustering & dimensionality reduction● singular value decomposition, PCA, k-means clustering

“The library to teach them all”

GraphX

Connecting the dots

Graph processing at scale. > Take edges > Link nodes > Combine/Send messages

Use cases examples

- Parallel batch processing of time series- Bayesian Network in financial market- IoT platform (Lambda architecture)- OpenStreetMap cities topologies classification- Markov Chain in Land Use/Land Cover prediction- Genomics: ADAM

Genomics

Biological systems are very complexOne human sequence is 60Gb

ADAMCredits: AmpLab (UC Berkeley)

Stratification using 1000Genomes

http://www.1000genomes.org/

ref: http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg



http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg

Machine Learning model

Clustering: KMeans

ref: http://en.wikipedia.org/wiki/K-means_clustering

http://en.wikipedia.org/wiki/K-means_clustering

Machine Learning modelMLLib, KMeans

MLLib: ● Machine Learning Algorithms● Data structures (e.g. Vector)

Mashupprediction

Sample [NA20332] is in cluster #0 for population Some( ASW)

Sample [NA20334] is in cluster # 2 for population Some( ASW)

Sample [HG00120] is in cluster # 2 for population Some( GBR)

Sample [NA18560] is in cluster # 1 for population Some( CHB)

Mashup

#0 #1 #2

GBR 0 0 89ASW 54 0 7CHB 0 97 0

Cluster40 m3.xlarge160 cores + 600G

Eggo project (public genomics data in ADAM format on s3)

We…1000genomes in ADAM format on S3. Open Source GA4GH Interop services implementationMachine learning on 1000genomes

Genomic data and distributed computing

http://med-at-scale.s3.amazonaws.com/index.html

http://ga4gh.org/

http://www.slideshare.net/noootsab/lightning-fast-genomics-with-spark-adam-and-scala

http://www.slideshare.net/noootsab/lightning-fast-genomics-with-spark-adam-and-scala

The end (of the slides)

Thanks for your attention!

Xavier [email protected]

Andy [email protected]

what is distributed computing, why we use apache spark

Technology

distributed agewhat

distributed environmentlacking

single time

resources data managers

nodes1zb file

local data subsetshdfs

data analysis genomics

higher response time