all things open - spark & storm - where & when?

52
Spark & Storm: When & Where?

Upload: ian-pointer

Post on 19-Jan-2017

3.100 views

Category:

Software


0 download

TRANSCRIPT

Page 1: All Things Open - Spark & Storm - Where & When?

Spark & Storm: When & Where?

Page 2: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

The Leader in Big Data Consulting

● BI/Data Strategy○ Development of a business intelligence/ data architecture strategy.

● Installation○ Installation of Hadoop or relevant technology.

● Data Consolidation○ Load data from diverse sources into a single scalable repository.

● Streaming - Mammoth will write ingestion and/or analytics which operate on the data as it comes in as well as design dashboards, feeds or computer-driven decision making processes to derive insights and make decisions.

● Visualization Tools ○ Mammoth will set up visualization tool (ex: Tableau, Pentaho, etc…) We will also create initial reports and provide training to

necessary employees who will analyze the data.

Mammoth Data, based in downtown Durham (right above Toast)

Page 3: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Lead Consultant on all things DevOps and Spark

● @carsondial on Twitter

Me!

Page 4: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Quick overview of Spark Streaming

● Reasons why Spark Streaming can be tricky in practice

● Performance and tuning tips we’ve learnt over the past two years

● …and when to pack it all in and use Storm instead

What This Talk Is About

Page 5: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

This IS WEB SCALE!

Page 6: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● I kid, Rails!

● (mostly)

Beyond Web Scale

Page 7: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Spark & Storm - millions of requests / second on commodity hardware

● Different problems at different scales!

Beyond Web Scale

Page 8: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Directed Acyclic Graph Data Processing Engine

● Based around the Resilient Distributed Dataset (RDD) primitive

Spark

Page 9: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

Spark Streaming — Overview

Page 10: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

Spark Streaming — In Production?

● Yes!

● (Alibaba, AutoTrader, Cisco, Netflix, etc.)

Page 11: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Streaming by running batches very quickly!

● Batch length: can be as low as 0.5s / batch

● Every X seconds, get Y records (DStream/RDDs)

Spark Streaming — Overview

Page 12: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Using same implementation (mostly) for batch and stream processing (Lambda Architecture hipster points ahoy!)

● Access to rest of Spark - Dataframes, MLLib, GraphX, etc.

Spark Streaming — Good Things

Page 13: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● What happens if you can’t process Y records in X seconds?

● What happens if you require sub-second latency?

Spark Streaming — Bad Things!

Page 14: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

Spark Streaming — I’m so sorry.

Page 15: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● What happens if you can’t process Y records in X seconds?

● Data builds up in executors

● Executors run out of memory…

Spark Streaming — Bad Things!

Page 16: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● “Hey, we forgot to tell you Ops people that we have a major new client adding stuff into the firehose sometime today. That’s fine, right?”

Spark Streaming — Bad Things!

Page 17: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

Spark Streaming — It Will Be Okay

Page 18: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● As a former Ops person:

● WE WILL REMEMBER.

Spark Streaming — Bad Things!

Page 19: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Do you need low-latency?

● If so, a 10-minute nap is advisable!

● Everybody else, let’s dive in…

Spark Streaming — Tuning

Page 20: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

Spark Streaming — Tuning

Page 21: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

Spark Streaming — Down In The Hole

Page 22: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

Spark Streaming — Down In The Hole

Page 23: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Easiest method — alter the batch window until it’s all fine!

● Tiny batches provide tight execution times!

Spark Streaming — Down In The Hole

Page 24: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Use Kafka.

● Data source with the most love (e.g. exactly-once semantics without Write Ahead Logs and receiver-less operation in 1.3+)

● (other sources get the features…eventually)

Spark Streaming — Tuning

Page 25: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Use Scala.

● CPython = slower in execution

● PyPy is much faster…but…

● New features always come to Scala first.

Spark Streaming — Tuning

Page 26: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● (or Java if you really must)

Spark Streaming — Tuning

Page 27: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Spark Streaming = data receivers + Spark

● spark.cores.max = x * number of receivers

● For Great Data Locality and Parallelism!

Spark Streaming — Cores

Page 28: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Are you using a foreachRDD loop?

rdd.foreachRDD{ rdd =>

rdd.cache()

…rdd.unpersist()

}

Spark Streaming — Caching

Page 29: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● If routing to multiple stores / iterating over an RDD multiple times using cache() is a quick win

● It really shouldn’t work so well…

Spark Streaming — Caching

Page 30: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Hurrah for Spark 1.5!

● spark.streaming.backpressure.enabled = true

● Spark dynamically alters incoming data rates (keeping the data in Kafka rather than in the executors)

● Works for all data sources (for once!)

Spark Streaming — Backpressure

Page 31: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● I really need that low-latency response!

Storm

Page 32: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Directed Acyclic Graph Data Processing Engine

Storm

Page 33: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

Spark

“Very Good, Sir”

Page 34: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

Storm

“Here you go!”

Page 35: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Stream of tuples

● Bolts

● Spouts

● Topologies

Storm Concepts

Page 36: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Unbounded stream of tuples

● Tuples are defined via schema (usual base types plus custom serializers)

Storm — Streams

Page 37: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Sources of tuples in a topology

● Read from external sources (e.g. Kafka) and emitting them

● Can emit multiple streams from a spout!

Storm — Spouts

Page 38: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Where your processing happens● Roll your own aggregations / filtering / windowing● Bolts can feed into other bolts● Potentially easier to test than Spark Streaming● Many Bolt connectors for external sources (e.g. Cassandra,

Redis, Hive, etc)

Storm — Bolts

Page 39: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● The DAG of the spouts and bolts

● Built programmatically in code and submitted to the Storm cluster

● Flux - Do It In YAML (and then complain about whitespace)

Storm — Topologies

Page 40: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Each bolt or spout runs 'tasks' across the cluster

● How parallelism works in Storm

● Set in topology submission

Storm — Tasks

Page 41: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Where the topology runs

● 1 worker = 1 JVM

● Tasks run as threads on a worker

● Storm distributes tasks evenly across cluster

Storm — Workers

Page 42: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● True Streaming

● Tuples processed as they enter topology - low latency

● Scales far beyond Spark Streaming (currently)

Storm — Good Things

Page 43: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Battle-tested at Twitter & Yahoo!

● Yahoo! has 300-node clusters and working to support 1000+ nodes

● Single node clocked at over 1.5m tuples / second at Twitter

Storm — Good Things

Page 44: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Very DIY (bring your own aggregations, ML, etc)

● Your DAG construction may not be optimal

● Operationally more complex (and Storm WebUI is more primitive)

● Where’s Me REPL?

Storm — Bad Things

Page 45: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

Spark or Storm?

Page 46: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● SLA on latency?

Spark or Storm?

Page 47: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Storm!

● (though simply because it’s possible doesn’t mean you’ll get it!)

Spark or Storm?

Page 48: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Insane data needs (e.g. ~100m records/second?)

Spark or Storm?

Page 49: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Storm!

● (though, again, it’s not a magic bullet!)

Spark or Storm?

Page 50: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● For almost anything else? Spark.

● High-level vs. Low-level

● Each new version of Spark delivers improvements!

Spark or Storm?

Page 51: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

● Other frameworks that show promise:○ Flink○ Apex○ Samza○ Heron (Twitter’s not-public Storm replacement)

Other Listing Magazines Are Available

Page 52: All Things Open - Spark & Storm - Where & When?

www.mammothdata.com | @mammothdataco

Questions?