![Page 1: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/1.jpg)
Foundations for Scaling Analytics in Apache Spark
Joseph K. Bradley September 19, 2016
® ™
![Page 2: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/2.jpg)
Who am I?
Apache Spark committer & PMC member Software Engineer @ Databricks (ML team) Machine Learning Department @ Carnegie Mellon
2
![Page 3: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/3.jpg)
Talk outline
Intro Apache Spark Machine Learning (and graphs) in Spark
Original implementations: RDDs
Future implementations: DataFrames
3
![Page 4: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/4.jpg)
• General engine for big data computing • Fast & scalable • Easy to use • APIs in Python, Scala, Java & R
4
Apache Spark
SparkSQL Streaming MLlib GraphX
Open source • Apache Software Foundation • 1000+ contributors • 250+ companies & universities
![Page 5: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/5.jpg)
It’s big • Spark beat Hadoop’s Gray Sort record by 3x
with 1/10 as many machines • Largest cluster size of 8000 Nodes (Tencent)
5
![Page 6: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/6.jpg)
MLlib: Spark’s ML library
6
ML tasks Classification Regression Recommendation Clustering Frequent itemsets
Data utilities Featurization Statistics Linear algebra
Workflow utilities Model import/export Pipelines DataFrames Cross validation
Goals Scale-out Standard library Extensible API
Challenges for big data • Iterative algorithms • Diverse algorithmic patterns • Many data types
![Page 7: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/7.jpg)
GraphX and GraphFrames
7
Graph algorithms Connected components PageRank Label propagation …
Graph queries Vertex degrees Subgraphs Motif finding …
Goals Scale-out Standard library Extensible API
Challenges for big data • Iterative algorithms • Many (big) joins • Many data types
![Page 8: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/8.jpg)
Talk outline
Intro Apache Spark Machine Learning (and graphs) in Spark
Original implementations: RDDs
Future implementations: DataFrames
8
![Page 9: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/9.jpg)
Talk outline
Intro Apache Spark Machine Learning (and graphs) in Spark
Original implementations: RDDs
Future implementations: DataFrames
9
![Page 10: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/10.jpg)
10
Map Reduce
master
Resilient Distributed Datasets (RDDs)
val myData: RDD[(String, Vector)] myData.map { _._2 * 0.5 }
![Page 11: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/11.jpg)
Resilient Distributed Datasets (RDDs)
11
![Page 12: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/12.jpg)
Resilient Distributed Datasets (RDDs)
12
Resiliency • Lineage • Caching &
checkpointing
![Page 13: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/13.jpg)
13
Compute gradient (Vector) for each row (training example)
Aggregate gradient
master
ML on RDDs
Broadcast gradient
![Page 14: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/14.jpg)
ML on RDDs: the good
Flexible: GLMs, trees, matrix factorization, etc. Scalable: E.g., Alternating Least Squares on Spotify data (2014) • 50+ million users x 30+ million songs • 50 billion ratings Cost ~ $10 • 32 r3.8xlarge nodes (spot instances) • For rank 10 with 10 iterations, ~1 hour running time.
14
![Page 15: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/15.jpg)
ML on RDDs: the challenges
Maintaining state
Python API
Iterator model
Data partitioning
15
![Page 16: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/16.jpg)
16
master
Maintaining state on master
Current state
![Page 17: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/17.jpg)
17
Maintaining state in RDDs
… … …
Current state
![Page 18: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/18.jpg)
Maintaining state
18
Cons of master • Single point of failure. • Cannot support large state (1 billion parameters) Cons of RDDs • More complex • Lineage becomes a problem à cache & checkpoint Unstated con: Developers have to choose 1 option!
![Page 19: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/19.jpg)
19
Python API (RDD-based)
Spark worker (JVM)
Python
Python
Data stored as Python objects à Serialization overhead
![Page 20: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/20.jpg)
20
Iterator model val rdd0: RDD[(String, Vector)] = …
val rdd1 = rdd0.map { (name, data) => (name.trim, normalizeVec(data))
}
val rdd2 = rdd1.map {
...
}
Arbitrary data types
Black box lambda functions
Iterative processing (especially in ML!)
à Boxed types à JVM object creation & GC
![Page 21: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/21.jpg)
21
Data partitioning: numPartitions
Selecting numPartitions can be critical. • Each task has overhead. • Overhead / parallelism trade-off. Different numPartitions for different jobs: • SQL: 200+ is reasonable • ML: 1 per compute core
![Page 22: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/22.jpg)
22
Data partitioning: co-partitioning
Algorithm • Join • Map • Iterate
Co-partitioning is critical for • ALS (matrix factorization) • Graph algorithms
![Page 23: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/23.jpg)
ML on RDDs: the challenges
Maintaining state (& lineage)
Python API
Iterator model
Data partitioning
23
![Page 24: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/24.jpg)
Talk outline
Intro Apache Spark Machine Learning (and graphs) in Spark
Original implementations: RDDs
Future implementations: DataFrames
24
![Page 25: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/25.jpg)
Talk outline
Intro Apache Spark Machine Learning (and graphs) in Spark
Original implementations: RDDs
Future implementations: DataFrames
25
![Page 26: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/26.jpg)
Spark DataFrames & Datasets
26
dept age name
Bio 48 HSmith
CS 34 ATuring
Bio 43 BJones
Chem 61 MKennedy
Data grouped into named columns
DSL for common tasks • Project, filter, aggregate, join, … • Statistics, n/a values, sketching, … • User-Defined Functions (UDFs) &
Aggregation (UDAFs)
data.groupBy(“dept”).avg(“age”)
Datasets: Strongly typed DataFrames
![Page 27: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/27.jpg)
27
Catalyst query optimizer
SQL
DataFrame
Dataset
Query Plan Optimized Query Plan RDDs
Catalyst transformations
Abstractions of user programs (Trees)
![Page 28: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/28.jpg)
Project Tungsten
Memory management • Off-heap (Java Unsafe API) • Avoid JVM GC • Compressed format
Code generation • Rewrite chain of iterators into single code blocks • Operate directly on compressed format
28
![Page 29: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/29.jpg)
DataFrames in ML and Graphs
API • DataFrame-based API in MLlib (spark.ml package) • GraphFrames (Spark package) Transformation & prediction Training
29
![Page 30: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/30.jpg)
Python API
30
0 2 4 6 8 10
RDD Scala RDD Python
Spark Scala DF Spark Python DF
Time to aggregate 10^6 Int pairs (secs) in Spark 1.4
better
![Page 31: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/31.jpg)
Transformation/prediction with DataFrames
User-Defined Types (UDTs) • Vector (sparse & dense) • Matrix (sparse & dense)
31
User-Defined Functions (UDFs) • Feature transformation • Model prediction
![Page 32: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/32.jpg)
Future work: model training Goal: Port all ML/graph algorithms to run on DataFrames for better speed & scalability. Currently: • Belief propagation • Connected components
32
![Page 33: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/33.jpg)
Catalyst in ML
What’s missing? • Concept of iteration • Handling caching and checkpointing
across many iterations • ML/Graph-specific optimizations for
Catalyst query planner
33
![Page 34: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/34.jpg)
Tungsten in ML
Partly done • Vector/Matrix UDTs • UDFs for some operations What’s missing? • Code generation for critical paths • Closer integration of Vector/Matrix types with Tungsten
34
![Page 35: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/35.jpg)
OOMing
DataFrames automatically spill to disk à Classic pain point of RDDs
35
java.lang.OutOfMemoryError
Goal: Smoothly scale, without custom per-algorithm optimizations
![Page 36: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/36.jpg)
To summarize... MLlib on RDDs • Required custom optimizations MLlib with a DataFrame-based API • Friendly API • Improvements for prediction In the future • Potential for even greater scaling for training • Simpler for non-experts to write new algorithms
36
![Page 37: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/37.jpg)
Get started Get involved • JIRA http://issues.apache.org • mailing lists http://spark.apache.org • Github http://github.com/apache/spark • Spark Packages http://spark-packages.org Learn more • New in Apache Spark 2.0
http://databricks.com/blog/2016/06/01 • MOOCs on EdX http://databricks.com/spark/training
37
Try out Apache Spark 2.0 in Databricks Community Edition http://databricks.com/ce
Many thanks to the community for contributions & support!
![Page 38: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/38.jpg)
Databricks Founded by the creators of Apache Spark Offers hosted service • Spark on EC2 • Notebooks • Visualizations • Cluster management • Scheduled jobs
38
We’re hiring!
![Page 39: Foundations for Scaling Analytics in Apache Spark · 2016-09-21 · Foundations for Scaling Analytics in Apache Spark Joseph K. Bradley September 19, 2016 ® ™](https://reader030.vdocument.in/reader030/viewer/2022041014/5ec4dc2671cf24623f2b06a2/html5/thumbnails/39.jpg)
Thank you! FB group: Databricks at CMU databricks.com/careers