what is distributed computing, why we use apache spark
TRANSCRIPT
![Page 1: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/1.jpg)
BigData, newborn technologies evolving fast. Why Apache Spark
outruns Apache Hadoop
Andy Petrella, NextlabXavier Tordoir, SilicoCloud
![Page 2: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/2.jpg)
Andy
@Noootsab, I am@NextLab_be owner@SparkNotebook creator@Wajug co-driver@Devoxx4Kids organizerMaths & CSData lover: geo, open, massiveFool
Who are we?
Xavier
@xtordoirSilicoCloud-> Physics
-> Data analysis -> genomics
-> scalable systems-> ...
![Page 3: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/3.jpg)
So what...Part I
● What○ distributed resources○ data○ managers
● Why:○ fastest○ smartest○ biggest
● How:○ Map Reduce○ Limitations○ Extensions
PART II● Spark
○ Model○ Caching and lineage○ Master and Workers○ Core example
● Beyond Processing○ Streaming○ SQL○ GraphX○ MLlib○ Example
● Use cases○ Parallel batch processing of
timeseries○ ADAM
![Page 4: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/4.jpg)
Part I: The Distributed Age
![Page 5: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/5.jpg)
What is a distributed environmentComputations needs three kind of resources:● CPU ● MEM● Data storage
However, it’s hard to extent each of them at will on a single machine
![Page 6: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/6.jpg)
What is a distributed environmentLacking of one of these will result in higher response time or reduced accuracy.Unfortunately, it doesn’t matter how parallelized is the algorithm or optimized are the computations
If the solution can’t be inside, it must be outside.
![Page 7: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/7.jpg)
What is a distributed environment
![Page 8: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/8.jpg)
Distributed File SystemYou have 100 nodes in your cluster, but only 1 dataset.Will you replicate it on all nodes?
Extended case: your dataset is 1 Zettabyte (10⁹Tb)?
Lonesome solution:● split the file on nodes● axing the algorithm to access local data subsets
![Page 9: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/9.jpg)
HDFS towards TachyonHadoop Distributed File SystemImplements GoogleFSStore and read files splitted and replicated on nodes1Zb file = 8E12 x 128Mb files
IOPs are expensive and require more CPU clocks than DRAM accessHence... Tachyon: memory-centric distributed file system
![Page 10: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/10.jpg)
Nodes will fail, jobs cannotWe need resilience
Management
Resources are generally fewer than required by algorithm.We need scheduling
The requirements are fluctuatingWe need elasticity
![Page 11: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/11.jpg)
Mesos and MarathonMesos: High available cluster managerNodes: attach or remove them on the flyNodes are offering resources -- Applications accept themNode crash: the application restarts the assigned tasks
Marathon: Meta application on MesosApplication crash: automatically restarted on different node
![Page 12: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/12.jpg)
Why: for everybody and now ?
Fastest:1. Time to result2. Near real time processing
![Page 13: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/13.jpg)
Runtime is smaller, Dev lifecyle is shorter→ no synchronization-hell
It can even be really interactive → consoles or notebooks tools.
Why for everybody and now
![Page 14: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/14.jpg)
Why for everybody and nowNo bottlenecks → new-coming data are readily available for processing
Opens the doors for online models!
![Page 15: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/15.jpg)
Why for everybody and nowSmartest: train more and more models, ensembling lots of them is no more a problem
More complex modelling can be tackled if required
![Page 16: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/16.jpg)
Why for everybody and nowAccessing an higher level of accuracy is tricky and might require lots and lots of models.
Running a model takes quite some time, specially if the data has to be read every single time.
Example: Netflix contest winner (AT&T labs) ensembled 500 models to gain 10% accuracy.Although in 2009 it wasn’t possible to use it in production, today this could change.
![Page 17: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/17.jpg)
Why for everybody and nowBiggest: no need for sampling big datasets
……
That’s it!
![Page 18: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/18.jpg)
How!?Google papers stimulated the open software community, hence competitive tools now exist.
In the area of computation in distributed environment, there are two disruptive papers:● Google’s Mapreduce● Berkeley’s Spark
![Page 19: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/19.jpg)
How!?MapReduce (Google white paper 2004):
Programming model for distributed data intensive computations
Helps dealing with parallelization, fault-tolerance, data distribution, load balancing
![Page 20: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/20.jpg)
Functions:Map ≅ transform data to key value pairs
Reduce ≅ aggregate key value pairs per key (e.g. sum, max, count)
Mappers and Reducers are sent to data location (nodes)
How!?
![Page 21: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/21.jpg)
Map
Reduce: apply a binary associative operator on all elements
Image from RxJava: https://github.com/ReactiveX/RxJava/wiki/Transforming-Observables
How!?
![Page 22: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/22.jpg)
Hadoop implementation has some limitations
Mappers and Reducers ship functions to data while java is not a functional language
⇒ Composability is difficult and more IO/network operations are required
Iterative algorithms (e.g. stochastic gradient) have to read data at each step (while data has not changed, only parameters)
How!?
![Page 23: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/23.jpg)
How!?MapReduce on steroids
I) Functional paradigm:- process built lazily based on simple concepts- Map and Reduce are two of them
II) Cache data in memory. No more IO.
![Page 24: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/24.jpg)
So what...Part I
● What○ distributed resources○ data○ managers
● Why:○ fastest○ smartest○ biggest
● How:○ Map Reduce○ Limitations○ Extensions
PART II● Spark
○ Model○ Caching and lineage○ Master and Workers○ Core example
● Beyond Processing○ Streaming○ SQL○ GraphX○ MLlib○ Example (notebook)
● Use cases○ Parallel batch processing of
timeseries○ ADAM
![Page 25: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/25.jpg)
Part II: Spark to the Rescue
![Page 26: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/26.jpg)
RDDsThink of an RDD[T] as an immutable, distributed collection of objects of type T
• Resilient => Can be reconstructed in case of failure• Distributed => Transformations are parallelizable
operations• Dataset => Data loaded and partitioned across cluster
nodes (executors)
![Page 27: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/27.jpg)
RDD[T]Data distribution hierarchy:- RDD[T]- Elements
[ x1, x2 ]
[ x10 ]
[ x8,x5,x6 ]
[ x11 ]
[ x14,x13 ]
[ x9,x16 ]
[ x3 ]
[ x7,x12 ]
[ x15 ]
[ x17,x4 ]
Executor 1
- Executors- Partitions
Executor 2 Executor 3 Executor 4
![Page 28: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/28.jpg)
Execution
Execution is split in fundamental units: Tasks
Tasks running in parallel are grouped in Stages
![Page 29: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/29.jpg)
Execution
Core1Task0(read/process/write)
Task0(read/process/write)
Task0(read/process/write)
Core2Task1(read/process/write)
Task1(read/process/write)
Task1(read/process/write)
Core3Task2(read/process/write)
Task2(read/process/write)
Task2(read/process/write)
Stage2 Stage1 Stage0
![Page 30: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/30.jpg)
Master and Workers
![Page 31: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/31.jpg)
Spark StreamingWhen you have big fat streams behaving as one single collection
t
DStream[T]
RDD[T] RDD[T] RDD[T] RDD[T] RDD[T]
DStreams: Discretized Streams (= Sequence of RDDs)
![Page 32: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/32.jpg)
Spark SQL
Mapping: RDD -> “table”, Element Field -> “column”
![Page 33: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/33.jpg)
MLLib: Distributed ML
Classification● linear SVM, logistic regression, classification trees, naive Bayes Models
Regression● SVM, regression trees, linear regression (regularized)
Clustering & dimensionality reduction● singular value decomposition, PCA, k-means clustering
“The library to teach them all”
![Page 34: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/34.jpg)
GraphX
Connecting the dots
Graph processing at scale. > Take edges > Link nodes > Combine/Send messages
![Page 35: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/35.jpg)
Use cases examples
- Parallel batch processing of time series- Bayesian Network in financial market- IoT platform (Lambda architecture)- OpenStreetMap cities topologies classification- Markov Chain in Land Use/Land Cover prediction- Genomics: ADAM
![Page 36: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/36.jpg)
Genomics
Biological systems are very complexOne human sequence is 60Gb
![Page 37: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/37.jpg)
ADAMCredits: AmpLab (UC Berkeley)
![Page 38: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/38.jpg)
Stratification using 1000Genomes
http://www.1000genomes.org/
ref: http://upload.wikimedia.org/wikipedia/en/e/eb/Genetic_Variation.jpg
![Page 39: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/39.jpg)
Machine Learning model
Clustering: KMeans
ref: http://en.wikipedia.org/wiki/K-means_clustering
![Page 40: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/40.jpg)
Machine Learning modelMLLib, KMeans
MLLib: ● Machine Learning Algorithms● Data structures (e.g. Vector)
![Page 41: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/41.jpg)
Mashupprediction
Sample [NA20332] is in cluster #0 for population Some( ASW)
Sample [NA20334] is in cluster # 2 for population Some( ASW)
Sample [HG00120] is in cluster # 2 for population Some( GBR)
Sample [NA18560] is in cluster # 1 for population Some( CHB)
![Page 42: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/42.jpg)
Mashup
#0 #1 #2
GBR 0 0 89ASW 54 0 7CHB 0 97 0
![Page 43: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/43.jpg)
Cluster40 m3.xlarge160 cores + 600G
![Page 44: What is Distributed Computing, Why we use Apache Spark](https://reader035.vdocument.in/reader035/viewer/2022062313/55a682e31a28ab42498b4683/html5/thumbnails/44.jpg)
Eggo project (public genomics data in ADAM format on s3)
We…1000genomes in ADAM format on S3. Open Source GA4GH Interop services implementationMachine learning on 1000genomes
Genomic data and distributed computing