spark meets telemetry

SPARK MEETS TELEMETRY

Mozlandia 2014Roberto Agostino Vitillo

TELEMETRY PINGS

• If Telemetry is enabled, a ping is generated for each session

• Pings are sent to our backend infrastructure as json blobs

• Backend validates and stores pings on S3

TELEMETRY PINGS

TELEMETRY PINGS

TELEMETRY MAP-REDUCE

• Processes pings from S3 using a map reduce framework written in Python

• https://github.com/mozilla/telemetry-server

import json

def map(k, d, v, cx): j = json.loads(v) os = j['info']['OS'] cx.write(os, 1)

def reduce(k, v, cx): cx.write(k, sum(v))

https://github.com/mozilla/telemetry-server

SHORTCOMINGS

• Not distributed, limited to a single machine

• Doesn’t support chains of map/reduce ops

• Doesn’t support SQL-like queries

• Batch oriented

source: http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html

http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html

WHAT IS SPARK?

• In-memory data analytics cluster computing framework (up to 100x faster than Hadoop)

• Comes with over 80 distributed operations for grouping, filtering etc.

• Runs standalone or on Hadoop, Mesos and TaskCluster in the future (right Jonas?)

WHY DO WE CARE?• In memory caching

• Interactive command line interface for EDA (think R command line)

• Comes with higher level libraries for machine learning and graph processing

• Works beautifully on a single machine without tedious setup; doesn’t depend on Hadoop/HDFS

• Scala, Python, Clojure and R APIs are available

WHY DO WE REALLY CARE?

The easier we make it to get answers,the more questions we will ask

MASHUP DEMO

HOW DOES IT WORK?• User creates Resilient Distributed Datasets (RDDs),

transforms and executes them

• RDD operations are compiled to a DAG of operators

• DAG is compiled into stages

• A stage is executed in parallel as a series of tasks

RDDA parallel dataset with partitions

Var A Var B Var Cobservationobservationobservationobservation

Partition

Partition

DAGLogical graph of RDD operations

sc.textFile("input") .map(line => line.split(",")) .map(line => (line(0), line(1).toInt)) .reduceByKey(_ + _, 3)

map map reduceByKey

RDD[String] RDD[Array[String]] RDD[(String, Int)]RDD[(String, Int)]

read

P1

P2

P3

P4

map map reduceByKey

RDD[String] RDD[Array[String]] RDD[(String, Int)]RDD[(String, Int)]

read

STAGE

Stage 1 Stage 2

P1

P2

P3

P4

map mapshuffle

RDD[String] RDD[Array[String]] RDD[(String, Int)]

read input output

STAGE

Stage 1

readmapmap

shuffle

P1

P2

P3

P4

T1

T2

T3

T4

Set of tasks that can run in parallel

Stage 1

STAGE

Stage 2Stage 1


STAGE

• Tasks are the fundamental unit of work

• Tasks are serialised and shipped to workers

• Task execution

1. Fetch input

2. Execute

3. Output result


task 1

task 2

task 3

task 4

HANDS-ON

1. Visit telemetry-dash.mozilla.org and sign in using Persona.

2. Click “Launch an ad-hoc analysis worker”.

3. Upload your SSH public key (this allows you to log in to the server once it’s started up).

4. Click “Submit”

5. A Ubuntu machine will be started up on Amazon’s EC2 infrastructure.

HANDS-ON

HANDS-ON• Connect to the machine through ssh

• Clone the starter template:

1. git clone https://github.com/vitillo/mozilla-telemetry-spark.git

2. cd mozilla-telemetry-spark && source aws/setup.sh

3. sbt console

• Open http://bit.ly/1wBHHDH

http://bit.ly/1wBHHDH

TUTORIAL

spark meets telemetry

Technology

telemetry pings

parallel tasks

stageset of tasks

stores pings

session pings

series of tasks

map reduceframework

machine learning