using spark in production at localytics - 2015-01-31

‹#›

January 31, 2015

Using Spark in Production

Pete Gamache / @gamache / [email protected]

mailto:[email protected]


Quick Introduction to Localytics

•Localytics is an app marketing service with a strong analytics engine at its core.

•We process billions of events per day in real time, using a backend made of Scala and AWS.

•Every data point we’ve ever received is available for querying, and we integrate analytics insight with marketing automation.

2


Original Architecture

•Rails

•MySQL

• It was 2009 and people did that back then

3


Current Backend Architecture

•Scala services loosely coupled through queues

•“Queuer” service is our front line, and collects and enqueues events sent from mobile or web apps

•“Processor” service handles deduplication, device recognition, and a lot more

•Data ends up in MPP DB (querying) and in S3 (export)

4


Our Processor Service

•Processor service is the place we have traditionally added new functionality

•After four years of this, it’s rather monolithic

•Parts are absolutely critical to our company; other parts are not

•Different parts scale differently

5


Rethinking Our Data Processing

•Treat the Processor service as the “source of truth”

•Build outboard services which consume this truth

•Simplify the existing Processor by moving some logic out

•Add specialized data sources to take burden off MPP DB

6


The Source of Truth

•S3 bucket containing batches of events, filed by date and hour of processing

•Events may (will) be out of order

•A single user’s session may (will) be split across files

•At press time: each file ~500MB of JSON objects; ~500 files per hour (~6 TB/day)

7


Side Note: Lambda Architecture

•Batch and stream processing, at once

•Most of what we do right now is batch processing, where “batch” means “all of our data ever”

•MPP DB provides arbitrary querying of entire data set

•Outboard services can provide a speed layer for simple/common queries

8


Feature Requirements

•Every event we receive has a number of attributes passed along with it •e.g., Country, Timezone, OS Version, etc.

•We want to keep the most recent attribute values for each of our users in a key-value store

•This provides fast real-time access to current truths, rather than sifting through a series of events for point-in-time facts

9


Apache Spark

•Most often envisioned as multi-node, long-lived cluster à la Hadoop

•This is a great fit for many batch-layer tasks

•Spark Streaming makes sliding-window querying very simple

•Spark SQL provides a familiar interface to non-production coders (business intelligence, data science, etc.)

10


A Few Gripes about Spark Clusters

•Weak tooling around cluster spin-up, monitoring, maintenance, scaling

•Tough to scale horizontally, compared to Amazon ELB

•Building fat JARs when you use any non-Spark dependencies at all is a god-damned nightmare

11


I’m Not Kidding About the Fat JARslibraryDependencies ++= Seq( // ("org.apache.spark" %% "spark-sql" % "1.1.0"). exclude("com.typesafe.akka", "akka-actor_2.10"). exclude("com.typesafe.akka", "akka-slf4j_2.10"). exclude("org.mortbay.jetty", "servlet-api"). exclude("javax.transaction", "jta"). exclude("commons-beanutils", "commons-beanutils-core"). exclude("commons-logging", "commons-logging"). exclude("org.slf4j", "jcl-over-slf4j"). exclude("org.slf4j", "slf4j-log4j12"). exclude("commons-collections", "commons-collections"). exclude("com.esotericsoftware.minlog", "minlog"), // ("com.typesafe.play" %% "play-json" % "2.4.0-M2"). exclude("com.typesafe.akka", "akka-slf4j_2.10"). exclude("javax.transaction", "jta"). exclude("commons-logging", "commons-logging"). exclude("org.slf4j", "jcl-over-slf4j").

12


Spark Standalone FTW

•Great for small bites of data (tens of GB)

•Few dependency issues

•Horizontal scale comes easy/free

•Hadoop file input adapters are excellent

•Works great for tests, too!13


A Proposed Product Architecture

•Processor service spits truth into S3 bucket •S3 Event Notifications are posted to an SQS queue •A pool of post-processing servers, each running Spark

standalone, drains the queue. Scale up on queue depth •Each S3 file generates ~25K updates to be applied •Actor pool of HTTP clients applies updates to key-value web

service •Akka ask pattern provides per-server synchronization •Play framework as app container

14


How’s it work?

•Great, that’s how

•Each server ingests 5-10 files at a time (~3-5GB) •Pre-tuning: T_spark=1m20s, 25K updates per minute on

c3.8xlarge EC2 instance (60GB RAM, 32 cores, 10GB network) •Post-tuning: T_spark=1m20s, 75K updates per minute on

c3.2xlarge EC2 instance (15GB RAM, 8 cores, 1GB network)

•That’s a 4x speedup on Spark and 12x on HTTP

15


Tuning

• Ingesting files in batches of 5 or more

•Replacing default Play logger with AsyncLogger

•HTTP tuning •Gave up play-ws for Apache HttpClient •Single-threaded client per actor •Keepalive •Preëmptive HTTP Basic Auth

16


Gotchas and Issues

•Mostly transient Spark issues •Network timeouts •File not found •Too many open files

•Solution: •Akka SupervisorStrategy + actor restart •SQS message timeout ensures data will be processed

17


Future Work

•Establish long-lived Spark Streaming cluster(s) for batch layer

•Create more standalone Spark services

•Disassemble the Processor monolith brick by brick

•???

18

‹#›

If this kind of thing sounds like fun, let’s talk! Email me at [email protected] or visit www.localytics.com/jobs/.

We’re Hiring!

mailto:[email protected]

http://www.localytics.com/jobs/

‹#›

Q&A

‹#›

Now let’s go have a beer.

Thanks!

using spark in production at localytics - 2015-01-31

Data & Analytics