a look ahead at spark 2.0
TRANSCRIPT
![Page 1: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/1.jpg)
A look ahead at Spark 2.0
Reynold Xin @rxin2016-03-30, Strata Conference
![Page 2: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/2.jpg)
About Databricks
Founded by creators of Spark in 2013
Cloud enterprise data platform- Managed Spark clusters- Interactive data science- Production pipelines- Data governance, security, …
![Page 3: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/3.jpg)
Today’s Talk
Looking back last 12 months
Looking forward to Spark 2.0• Project Tungsten, Phase 2• Structured Streaming• Unifying DataFrame & Dataset
Best resource for learning Spark
![Page 4: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/4.jpg)
A slide from 2013 …
![Page 5: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/5.jpg)
Programmability
5WordCount in 50+ lines of Java MR
WordCount in 3 lines of Spark
![Page 6: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/6.jpg)
What is Spark?
Unified engine across data workloads and platforms
…
SQLStreaming ML Graph Batch …
![Page 7: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/7.jpg)
2015: A Great Year for Spark
Most active open source project in (big) data
• 1000+ code contributors
New language: R
Widespread industry support & adoption
![Page 8: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/8.jpg)
“Spark is the Taylor Swiftof big data software.”
- Derrick Harris, Fortune
![Page 9: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/9.jpg)
![Page 10: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/10.jpg)
![Page 11: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/11.jpg)
Top Applications
29%
36%
40%
44%
52%
68%
Faud Detection / Security
User-Facing Services
Log Processing
Recommendation
Data Warehousing
Business Intelligence
![Page 12: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/12.jpg)
Diverse Runtime EnvironmentsHOW RESPONDENTS ARE
RUNNING SPARK
51%on a public cloud
MOST COMMON SPARK DEPLOYMENTENVIRONMENTS (CLUSTER MANAGERS)
48% 40% 11%Standalone mode YARN Mesos
Cluster Managers
![Page 13: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/13.jpg)
Spark 2.0
Next major release, coming in May
Builds on all we learned in past 2 years
![Page 14: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/14.jpg)
Versioning in Spark
In reality, we hate breaking APIs!Will not do so except for dependency conflicts (e.g. Guava) and experimental APIs
1 .6.0Patch version (only bug fixes)
Major version (may change APIs)
Minor version (adds APIs / features)
![Page 15: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/15.jpg)
Major Features in 2.0
Tungsten Phase 2speedups of 5-10x
Structured Streamingreal-time engine
on SQL/DataFrames
Unifying Datasetsand DataFrames
![Page 16: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/16.jpg)
Datasets & DataFrames
API foundation for the future
![Page 17: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/17.jpg)
Datasets and DataFrames
In 2015, we added DataFrames & Datasets as structured data APIs• DataFrames are collections of rows with a schema• Datasets add static types, e.g. Dataset[Person]• Both run on Tungsten
Spark 2.0 will merge these APIs: DataFrame = Dataset[Row]
![Page 18: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/18.jpg)
Examplecase class User(name: String, id: Int)case class Message(user: User, text: String)
dataframe = sqlContext.read.json(“log.json”) // DataFrame, i.e. Dataset[Row]messages = dataframe.as[Message] // Dataset[Message]
users = messages.filter(m => m.text.contains(“Spark”)).map(m => m.user) // Dataset[User]
pipeline.train(users) // MLlib takes either DataFrames or Datasets
![Page 19: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/19.jpg)
Benefits
Simpler to understand• Only kept Dataset separate to keep binary compatibility in 1.x
Libraries can take data of both forms
With Streaming, same API will also work on streams
![Page 20: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/20.jpg)
Long-Term
RDD will remain the low-level API in Spark
Datasets & DataFrames give richer semantics and optimizations• New libraries will increasingly use these as interchange format• Examples: Structured Streaming, MLlib, GraphFrames
![Page 21: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/21.jpg)
Structured Streaming
How do we simplify streaming?
![Page 22: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/22.jpg)
Integration Example
Streaming engine
Stream(home.html, 10:08)
(product.html, 10:09)
(home.html, 10:10)
. . .
What can go wrong?• Late events• Partial outputs to MySQL• State recovery on failure• Distributed reads/writes • ...
MySQL
Page Minute Visits
home 10:09 21
pricing 10:10 30
... ... ...
![Page 23: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/23.jpg)
ProcessingBusiness logic change & new ops
(windows, sessions)
Complex Programming Models
OutputHow do we define
output over time & correctness?
DataLate arrival, varying distribution over time, …
![Page 24: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/24.jpg)
The simplest way to perform streaming analyticsis not having to reason about streaming.
![Page 25: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/25.jpg)
Spark 2.0Infinite DataFrames
Spark 1.3Static DataFrames
Single API !
![Page 26: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/26.jpg)
Structured Streaming
High-level streaming API built on Spark SQL engine• Runs the same queries on DataFrames• Event time, windowing, sessions, sources & sinks
Unifies streaming, interactive and batch queries• Aggregate data in a stream, then serve using JDBC• Change queries at runtime• Build and apply ML models
See Michael/TD’s talks tomorrow for a deep dive!
![Page 27: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/27.jpg)
Tungsten Phase 2
Can we speed up Spark by 10X?
![Page 28: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/28.jpg)
Demo
Run a join on a large table with 1 billion records and a small table with 1000 records
In Spark 1.6, took 60+ seconds.
In Spark 2.0, took ~3 seconds.
![Page 29: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/29.jpg)
Scan
Filter
Project
Aggregate
select count(*) from store_saleswhere ss_item_sk = 1000
![Page 30: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/30.jpg)
Volcano Iterator Model
Standard for 30 years: almost all databases do it
Each operator is an “iterator” that consumes records from its input operator
class Filter {def next(): Boolean = {var found = falsewhile (!found && child.next()) {
found = predicate(child.fetch())}return found
}
def fetch(): InternalRow = {child.fetch()
}…
}
![Page 31: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/31.jpg)
What if we hire a college freshman toimplement this query in Java in 10 mins?
select count(*) from store_saleswhere ss_item_sk = 1000
var count = 0for (ss_item_sk in store_sales) {
if (ss_item_sk == 1000) {count += 1
}}
![Page 32: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/32.jpg)
Volcano model30+ years of database research
college freshmanhand-written code in 10 mins
vs
![Page 33: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/33.jpg)
Volcano 13.95 millionrows/sec
collegefreshman
125 millionrows/sec
Note: End-to-end, single thread, single column, and data originated in Parquet on disk
High throughput
![Page 34: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/34.jpg)
How does a student beat 30 years of research?
Volcano
1. Many virtual function calls
2. Data in memory (or cache)
3. No loop unrolling, SIMD, pipelining
hand-written code
1. No virtual function calls
2. Data in CPU registers
3. Compiler loop unrolling, SIMD, pipelining
Take advantage of all the information that is known after query compilation
![Page 35: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/35.jpg)
Scan
Filter
Project
Aggregate
long count = 0;for (ss_item_sk in store_sales) {
if (ss_item_sk == 1000) {count += 1;
}}
Tungsten Phase 2: Spark as a “Compiler”
Functionality of a general purpose execution engine; performance as if hand built system just to run your query
![Page 36: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/36.jpg)
DatabricksCommunity Edition
Best place to try & learn Spark.
![Page 37: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/37.jpg)
![Page 38: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/38.jpg)
Today’s talk
Spark has been growing explosively
Spark 2.0 doubles down on what made Spark attractive:• elegant APIs• cutting-edge performance
Learn Spark on Databricks Community Edition• join beta waitlist https://databricks.com/
![Page 39: A look ahead at spark 2.0](https://reader034.vdocument.in/reader034/viewer/2022051503/5872ec521a28abfa548b74af/html5/thumbnails/39.jpg)
Thank you.@rxin