spark meets telemetry
DESCRIPTION
A talk about Spark and Mozilla Telemetry.TRANSCRIPT
![Page 1: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/1.jpg)
SPARK MEETS TELEMETRY
Mozlandia 2014Roberto Agostino Vitillo
![Page 2: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/2.jpg)
TELEMETRY PINGS
![Page 3: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/3.jpg)
• If Telemetry is enabled, a ping is generated for each session
• Pings are sent to our backend infrastructure as json blobs
• Backend validates and stores pings on S3
TELEMETRY PINGS
![Page 4: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/4.jpg)
TELEMETRY PINGS
![Page 5: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/5.jpg)
TELEMETRY MAP-REDUCE
• Processes pings from S3 using a map reduce framework written in Python
• https://github.com/mozilla/telemetry-server
import json
def map(k, d, v, cx): j = json.loads(v) os = j['info']['OS'] cx.write(os, 1)
def reduce(k, v, cx): cx.write(k, sum(v))
![Page 6: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/6.jpg)
SHORTCOMINGS
• Not distributed, limited to a single machine
• Doesn’t support chains of map/reduce ops
• Doesn’t support SQL-like queries
• Batch oriented
![Page 7: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/7.jpg)
![Page 8: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/8.jpg)
source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html
![Page 9: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/9.jpg)
WHAT IS SPARK?
• In-memory data analytics cluster computing framework (up to 100x faster than Hadoop)
• Comes with over 80 distributed operations for grouping, filtering etc.
• Runs standalone or on Hadoop, Mesos and TaskCluster in the future (right Jonas?)
![Page 10: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/10.jpg)
WHY DO WE CARE?• In memory caching
• Interactive command line interface for EDA (think R command line)
• Comes with higher level libraries for machine learning and graph processing
• Works beautifully on a single machine without tedious setup; doesn’t depend on Hadoop/HDFS
• Scala, Python, Clojure and R APIs are available
![Page 11: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/11.jpg)
WHY DO WE REALLY CARE?
The easier we make it to get answers,the more questions we will ask
![Page 12: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/12.jpg)
MASHUP DEMO
![Page 13: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/13.jpg)
HOW DOES IT WORK?• User creates Resilient Distributed Datasets (RDDs),
transforms and executes them
• RDD operations are compiled to a DAG of operators
• DAG is compiled into stages
• A stage is executed in parallel as a series of tasks
![Page 14: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/14.jpg)
RDDA parallel dataset with partitions
Var A Var B Var Cobservationobservationobservationobservation
Partition
Partition
![Page 15: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/15.jpg)
DAGLogical graph of RDD operations
sc.textFile("input") .map(line => line.split(",")) .map(line => (line(0), line(1).toInt)) .reduceByKey(_ + _, 3)
map map reduceByKey
RDD[String] RDD[Array[String]] RDD[(String, Int)]RDD[(String, Int)]
read
P1
P2
P3
P4
![Page 16: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/16.jpg)
map map reduceByKey
RDD[String] RDD[Array[String]] RDD[(String, Int)]RDD[(String, Int)]
read
STAGE
Stage 1 Stage 2
P1
P2
P3
P4
![Page 17: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/17.jpg)
map mapshuffle
RDD[String] RDD[Array[String]] RDD[(String, Int)]
read input output
STAGE
Stage 1
readmapmap
shuffle
P1
P2
P3
P4
T1
T2
T3
T4
Set of tasks that can run in parallel
Stage 1
![Page 18: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/18.jpg)
STAGE
Stage 2Stage 1
Set of tasks that can run in parallel
![Page 19: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/19.jpg)
STAGE
• Tasks are the fundamental unit of work
• Tasks are serialised and shipped to workers
• Task execution
1. Fetch input
2. Execute
3. Output result
Set of tasks that can run in parallel
task 1
task 2
task 3
task 4
![Page 20: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/20.jpg)
HANDS-ON
![Page 21: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/21.jpg)
1. Visit telemetry-dash.mozilla.org and sign in using Persona.
2. Click “Launch an ad-hoc analysis worker”.
3. Upload your SSH public key (this allows you to log in to the server once it’s started up).
4. Click “Submit”
5. A Ubuntu machine will be started up on Amazon’s EC2 infrastructure.
HANDS-ON
![Page 22: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/22.jpg)
HANDS-ON• Connect to the machine through ssh
• Clone the starter template:
1. git clone https://github.com/vitillo/mozilla-telemetry-spark.git
2. cd mozilla-telemetry-spark && source aws/setup.sh
3. sbt console
• Open http://bit.ly/1wBHHDH
![Page 23: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/23.jpg)
TUTORIAL
![Page 24: Spark meets Telemetry](https://reader034.vdocument.in/reader034/viewer/2022042701/559c1ce51a28ab05158b46b0/html5/thumbnails/24.jpg)