what’s changed since then? · union join leftouterjoin rightouterjoin reduce count fold...
TRANSCRIPT
![Page 1: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/1.jpg)
![Page 2: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/2.jpg)
What’s changed since then? .
![Page 3: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/3.jpg)
Open source processing engine and set of libraries
Cloud service based on Spark
![Page 4: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/4.jpg)
Users:1
2
3
Hardware: I/O bottleneck ➡ compute
Delivery: the public cloud
![Page 5: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/5.jpg)
Pig
![Page 6: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/6.jpg)
84%
38% 38%
71%
31%
58%
18%
2014 Languages Used 2015 Languages Used
![Page 7: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/7.jpg)
“hdfs://...”
map line => parsePoint(line)
filter p => p.x > 100 count
![Page 8: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/8.jpg)
hides
![Page 9: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/9.jpg)
map
filter
groupBy
sort
union
join
leftOuterJoin
rightOuterJoin
reduce
count
fold
reduceByKey
cogroup
cross
zip
sample
take
first
partitionBy
mapWith
pipe
save
...
groupByKey
![Page 10: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/10.jpg)
map word => (word, 1)
groupByKey
map (k, vs) => (k, vs.sum)
Materializes all groupsas Seq[Int] objects
Then promptlyaggregates them
![Page 11: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/11.jpg)
structured data
SIGMOD 2015
![Page 12: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/12.jpg)
Logical Plan
Physical Plan
OptimizerRDDs
…
SQL
CodeGenerator
Data Frames
![Page 13: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/13.jpg)
DataFrames hold rows with a known schema and offer relational ops through a DSL
users = ctx.sql(“select * from hive.users”)
ca_users = users[users.state == “CA”]
ca_users.count()
ca_users.groupBy(“name”).avg(“age”)
ca_users.map(lambda row: row.name.upper())
Expression AST
![Page 14: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/14.jpg)
0 2 4 6 8 10
RDD Scala
RDD Python
DataFrame Scala
DataFrame Python
DataFrame R
Aggregation benchmark (s)
![Page 15: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/15.jpg)
Modular API based on scikit-learn
Relational + graph operations
All built on DataFramesenables cross-library optimization
![Page 16: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/16.jpg)
Users:1
2
3
Hardware: I/O bottleneck ➡ compute
Delivery: the public cloud
![Page 17: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/17.jpg)
2010
Storage50+MB/s(HDD)
Network 1Gbps
CPU ~3GHz
![Page 18: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/18.jpg)
2010 2016
Storage50+MB/s(HDD)
500+MB/s(SSD)
Network 1Gbps 10Gbps
CPU ~3GHz ~3GHz
![Page 19: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/19.jpg)
2010 2016
Storage50+MB/s(HDD)
500+MB/s(SSD)
10x
Network 1Gbps 10Gbps 10x
CPU ~3GHz ~3GHz
![Page 20: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/20.jpg)
• Many current systems are 2-10x off peak performance
![Page 21: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/21.jpg)
Results from Nested Vector Language (NVL) project at MIT
HyPerDatabase
TensorFlowWord2Vec
GraphMatPageRank
Current in-memorysystems
Hand tuned code
![Page 22: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/22.jpg)
Spark 1.6 14Mrows/s
Spark 2.0 125Mrows/s
![Page 23: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/23.jpg)
Users:1
2
3
Hardware: I/O bottleneck ➡ compute
Delivery: the public cloud
![Page 24: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/24.jpg)
• Multi-tenant
• Fully measured
• Elastic
• Continuously updated
Must design an organization, not a piece of software
![Page 25: What’s changed since then? · union join leftOuterJoin rightOuterJoin reduce count fold reduceByKey cogroup ... [users.state == ^CA _] ca_users.count() ca_users.groupBy( ^name _).avg(](https://reader034.vdocument.in/reader034/viewer/2022051914/6005354e074bf32dca3a28c4/html5/thumbnails/25.jpg)