crunch spark with apache from mapreduce to · spark with apache crunch micah whitacre @mkwhit....

94
From MapReduce to Spark with Apache Crunch Micah Whitacre @mkwhit

Upload: others

Post on 22-May-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

From MapReduce to Spark with Apache

CrunchMicah Whitacre

@mkwhit

Page 2: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning
Page 3: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Invested in learning

Page 4: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Invested in learning

Setup production clusters

Page 5: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Invested in learning

Setup production clusters

Tuned everything

Page 6: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Current Strategy

1. Build MR Jobs as needed2. ????3. Profit

Page 7: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

We should switch to

Page 8: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Umm what would it take to switch?

Page 9: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Learn Spark’s API and processing patterns, ...

Page 10: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Refactor all our code, ...

Page 11: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Experiment with how to tune it all again, ...

Page 12: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning
Page 13: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

We don’t use plain MR it won’t be that bad...

Page 15: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning
Page 16: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

How Spark is Known..

Page 17: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

In Memory

How Spark is Known..

Page 18: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

In Memory

100x Faster than MapReduce

SQL, streaming, and complex analytics

How Spark is Known..

Page 19: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

A fast and general engine for large-scale

data processing.

Page 20: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Spark has an advanced Directed Acyclic Graph execution engine that

supports cyclic data flow and in-memory computing.

Page 21: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Spark has an advanced Directed Acyclic Graph execution engine that

supports cyclic data flow and in-memory computing.

Page 22: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning
Page 23: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

RDD

Page 24: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

RDD

Resilient Distributed Dataset

Page 25: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Locality Aware Scheduling

Page 26: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Locality Aware Scheduling

Scalability

Page 27: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Fault Tolerant

Locality Aware Scheduling

Scalability

Page 28: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Fault Tolerant

Locality Aware Scheduling

Scalability

Applications with working sets(Parallel ops on intermediate results)

Page 29: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Fault Tolerant

Locality Aware Scheduling

Scalability

Applications with working sets(Parallel ops on intermediate results)

Page 30: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Log Updates

Options?

Distributed Shared Memory + Checkpointing

Page 31: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Log Updates

Options?

Distributed Shared Memory + Checkpointing

Page 32: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Log (coarse-grained)

Updates

Page 33: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Immutable/Read Only

Partitioned

Bad for async updates to shared state

Page 34: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

RDDs lifecycle in memory tied to Spark Application

Page 35: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Transformations

Actions

Page 36: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Transformations

Actions

map, filter, flatmap, union, groupByKey, sample

reduce, collect, count, take

Page 37: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Transformations

Actions

lazily executed

return values to driver

Page 38: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

val sc = new SparkContext(new SparkConf())val charCounts = sc.textFile(args(0)) .flatMap(_.split(" ")) .flatMap(_.toCharArray).map((_, 1))charCounts.collect()// (‘a’, 1)(‘a’, 1)(‘b’, 1)(‘c’, 1)(‘e’, 1)

Page 39: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Apache Crunch Review

Page 40: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Page 41: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

Pipeline

Page 42: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Pipeline p = ...

Page 43: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Pipeline

Targets

Sources

Page 44: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

PCollection<String> values = p.read(source);...values.write(target);

Page 45: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Process Reference

Data

Process Raw

Person Data

Process Raw Data

using Reference

Filter Out Invalid Data

Group Data By Person

Create Person Record

Avro

CSV

CSV

PipelinePCollection

Page 46: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

PCollection<String> values = …PTable<String, Integer> counts = values.parallelDo(fn,ptype);

Page 47: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

DoFn

DoFn

Join FilterFn Group By Key MapFn

Pipeline

Page 48: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

DoFn

DoFn

Join FilterFn Group By Key MapFn

MRPipeline

Page 49: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Pipeline p = new MRPipeline( Driver.class, hadoopConfig);

Page 50: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

PCollection<String> values = p.read(...);<do processing>p.write(...);p.done();

Page 51: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

DoFn

DoFn

Join FilterFn Group By Key MapFn

MRPipeline

Map Reduce Reduce

Page 52: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Here’s what we need to do to switch...

Page 53: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

<dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>${sparkVersion}</version> <scope>provided</scope></dependency><dependency> <groupId>org.apache.crunch</groupId> <artifactId>crunch-spark</artifactId> <version>${crunchVersion}</version> <scope>compile</scope></dependency>

Page 54: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Pipeline p = new MRPipeline( Driver.class, hadoopConfig);

Page 55: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Pipeline p = new SparkPipeline( “spark://localhost:7077”, “Spark App Name”);

Page 56: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

hadoop jar myjar.jar com.example.Driver ...

Page 57: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

spark-submit

--class com.example.Driver

--master spark://localhost:7077

...

Page 58: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

DoFn

DoFn

Join FilterFn Group By Key MapFn

SparkPipeline

Job 1Stage 1

Stage 2 Stage 3

Page 59: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

That’s not too bad...

Page 60: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Well there are some differences to account for...

Page 61: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning
Page 62: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Crunch with MRPipeline minimizes I/O

Crunch with SparkPipeline defers planning to Spark

Page 63: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

MRPipeline SparkPipeline

Page 64: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

MRPipeline SparkPipeline

Supports multiple writes

Page 65: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

MRPipeline SparkPipeline

Supports multiple writes

Performs multiple writes in same task/stage

Page 66: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

MRPipeline SparkPipeline

Supports multiple writes

Performs multiple writes in same task/stage

Serial writes in separate

task/stages

Page 67: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

DoFn

DoFn

FilterFn Group By Key MapFn

Map Reduce

Page 68: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

DoFn

DoFn

FilterFn Group By Key MapFn

Job 1

Job 2

SparkPipeline

Page 69: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

DoFn

DoFn

FilterFn Group By Key MapFn

Map Reduce

Page 70: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

DoFn

DoFn

FilterFn Group By Key MapFn

Stage 1

Job 2

Job 3

SparkPipeline

Page 71: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

DoFnCompute

Expensive DoFn

DoFn

FilterFn

Page 72: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Spark is lazy

Action needed for something to happen

Page 73: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

DoFnCompute

Expensive DoFn

DoFn

FilterFn

Job 1

Page 74: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

DoFnCompute

Expensive DoFn

DoFn

FilterFn

Job 2

Page 75: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

DoFnCompute

Expensive DoFn

DoFn

FilterFn

Page 76: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Limit expensive computations

Keep RDDs around for reuse

Page 77: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Spark supports persisting RDDs in memory

rdd.persist()

Page 78: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

DISK_ONLY, DISK_ONLY_2, MEMORY_AND_DISK, MEMORY_AND_DISK_2,

MEMORY_AND_DISK_SER, MEMORY_AND_DISK_SER_2, MEMORY_ONLY,

MEMORY_ONLY_2, MEMORY_ONLY_SER, MEMORY_ONLY_SER_2, NONE, OFF_HEAP

rdd.persist( StorageLevel.MEMORY_ONLY)

Page 79: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

DoFnCompute

Expensive DoFn

DoFn

FilterFn

Job 2

Job 1

Page 80: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

PCollection<String> values = //expensive computationvalues.cache();

Page 81: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

PCollection<String> values = //expensive computationCacheOptions opts = new CacheOptions.Builder() .useDisk(true).useMemory(true) .build();values.cache(opts);

Page 82: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Spark needs to be able to serialize data

Send data between workers

Persist data in memory or disk

Page 83: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Spark supported serialization

Java Serializable (and Externalizable)

Kyro Serialization

Page 84: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Spark recommends Kryo

Extra config on the SparkConfig

Custom serializer

registration

Page 85: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Spark on Crunch

Hides serialization behind PTypes

Handles complex records like Avro

Page 86: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Spark on Crunch

Hides serialization behind PTypes

Handles complex records like Avro

Page 87: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Additional Topics to Explore

Aggregation sort behaviors

Reusing Crunch Functions in Spark

Page 88: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

With Crunch, we’ll be able to ...

Page 89: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

minimize significant code refactoring,

Page 90: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

shorten learning curve by reusing concepts and API already used to...

Page 91: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

incrementally switch from Spark,

Page 92: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

overall experiment to find where Spark fits best.

Page 94: Crunch Spark with Apache From MapReduce to · Spark with Apache Crunch Micah Whitacre @mkwhit. Invested in learning. Invested in learning Setup production clusters. Invested in learning

Special Thanks...

● Josh Wills - helped come up with content● Sean Owen & Sandy Ryza whose repo I

forked to build examples and experiment