lightning fast cluster computing - apache software...

Lightning Fast Cluster Computing

Michast

Michael Armbrust - @michaelarmbrust Reflections | Projections 2015

What is Apache ?

Fast and general computing engine for clusters created by students at UC Berkeley •  Makes it easy to process large (GB-PB) datasets •  Support for Java, Scala, Python, R •  Libraries for SQL, streaming, machine learning, … •  100x faster than Hadoop Map/Reduce for some

applications

Spark Model

Write programs in terms of transformations on distributed datasets Resilient Distributed Datasets (RDDs) > Collections of objects that can be stored in memory

or disk across a cluster >  Parallel functional transformations (map, filter, …) > Automatically rebuilt on failure

Example: Log Mining Load messages from a log file into memory, then interactively search for the problem

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(lambda x: x.startswith(“ERROR”))

messages = errors.map(lambda x: x.split(‘\t’)[2])

messages.cache() Block 1

Block 2

Block 3

Worker

Driver

messages.filter(lambda x: “foo” in x).count()

messages.filter(lambda x: “bar” in x).count()

results Cache 1

Cache 2

Cache 3

Base RDD Transformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for on-disk data)

Result: scaled to 1 TB data in 5-7 sec"(vs 170 sec for on-disk data)

Fault Tolerance

file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

filter reduce map

t file

RDDs track lineage info to rebuild lost data

filter reduce map

t file

Fault Tolerance

file.map(lambda rec: (rec.type, 1)) .reduceByKey(lambda x, y: x + y) .filter(lambda (type, count): count > 10)

RDDs track lineage info to rebuild lost data

Speed-up ML Using Memory

1000 1500 2000 2500 3000 3500 4000

1 5 10 20 30

Number of Iterations

Hadoop Spark

110 s / iteration

first iteration 80 s further iterations 1 s

On-Disk Sort Record: Time to sort 100TB

2100 machines 2013 Record: Hadoop

2014 Record: Spark

Source: Daytona GraySort benchmark, sortbenchmark.org

72 minutes

207 machines

23 minutes

Also sorted 1PB in 4 hours

Higher-Level Libraries

Spark Streaming

real-time

Spark SQL structured data

MLlib machine learning

GraphX graph

Seamlessly switch components

// Load data using SQL points = ctx.sql(“select latitude, longitude from tweets”)

// Train a machine learning model model = KMeans.train(points, 10)

// Apply it to a stream sc.twitterStream(...) .map(lambda t: (model.predict(t.location), 1)) .reduceByWindow(“5s”, lambda a, b: a + b)

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Impala (SQL) Giraph (Graph)

non-test, non-example source lines

Powerful Stack – Agile Development

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

Streaming

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

SparkSQL Streaming

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

GraphX

Streaming SparkSQL

100000

120000

140000

Hadoop MapReduce

Storm (Streaming)

GraphX

Streaming SparkSQL

Your App?

Open Source Ecosystem Applications

Environments Data Sources

Over 1000 production users, clusters up to 8000 nodes

Many talks online at spark-summit.org

Spark Community

Get Involved on

Check us out at Contribute code through

Best way to get started is to fix a bug

Don’t forget to write a test!

About Databricks

•  The hardest part of using Spark is managing 100s of machines.

•  Databricks makes this easy

Founded by creators of Spark and remains largest contributor.

Using to analyze emojoi use on Twitter

What’s next for ?

+ declarative programming

Create and Running Spark Programs Faster:

•  Write less code •  Read less data •  Let the optimizer do the hard work

DataFrame noun – [dey-tuh-freym]

1.  A distributed collection of rows organized into named columns.

2.  An abstraction for selecting, filtering, aggregating and plotting structured data (cf. R, Pandas).

Write Less Code: Compute an Average

private IntWritable one = new IntWritable(1) private IntWritable output = new IntWritable() proctected void map( LongWritable key, Text value, Context context) { String[] fields = value.split("\t") output.set(Integer.parseInt(fields[1])) context.write(one, output) } IntWritable one = new IntWritable(1) DoubleWritable average = new DoubleWritable() protected void reduce( IntWritable key, Iterable<IntWritable> values, Context context) { int sum = 0 int count = 0 for(IntWritable value : values) { sum += value.get() count++ } average.set(sum / (double) count) context.Write(key, average) }

data = sc.textFile(...).split("\t") data.map(lambda x: (x[0], [x.[1], 1])) \ .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \ .map(lambda x: [x[0], x[1][0] / x[1][1]]) \ .collect()

Write Less Code: Compute an Average

Using RDDs

data = sc.textFile(...).split("\t") data.map(lambda x: (x[0], [int(x[1]), 1])) \ .reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]]) \ .map(lambda x: [x[0], x[1][0] / x[1][1]]) \ .collect()

Using DataFrames

sqlCtx.table("people") \ .groupBy("name") \ .agg("name", avg("age")) \ .collect()

Using SQL

SELECT name, avg(age) FROM people GROUP BY name

Not Just Less Code: Faster Implementations

0 2 4 6 8 10

RDD Scala

RDD Python

DataFrame Scala

DataFrame Python

DataFrame SQL

Time to Aggregate 10 million int pairs (secs)

Machine Learning Pipelines tokenizer = Tokenizer(inputCol="text", outputCol="words”) hashingTF = HashingTF(inputCol="words", outputCol="features”) lr = LogisticRegression(maxIter=10, regParam=0.01) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) df = sqlCtx.load("/path/to/data") model = pipeline.fit(df)

df0 df1 df2 df3 tokenizer hashingTF lr.model

Pipeline Model

Optimization happens as late as possible, therefore Spark SQL can

optimize across functions.

def add_demographics(events): u = sqlCtx.table("users") # Load Hive table events \ .join(u, events.user_id == u.user_id) \ # Join on user_id .withColumn("city", zipToCity(df.zip)) # udf adds city column events = add_demographics(sqlCtx.load("/data/events", "json")) training_data = events.where(events.city == “Champaign") .select(events.timestamp).collect()

Logical Plan

filter

events file users table

expensive

only join relevant users

Physical Plan

scan (events) filter

scan (users)

def add_demographics(events): u = sqlCtx.table("users") # Load partitioned Hive table events \ .join(u, events.user_id == u.user_id) \ # Join on user_id .withColumn("city", zipToCity(df.zip)) # Run udf to add city column

Physical Plan with Predicate Pushdown

and Column Pruning

optimized scan

(events) optimized

scan (users)

events = add_demographics(sqlCtx.load("/data/events", "parquet")) training_data = events.where(events.city == “Champaign") .select(events.timestamp).collect()

Logical Plan

filter

events file users table

Physical Plan

scan (events) filter

scan (users)

Plan Optimization & Execution

Set Footer from Insert Dropdown Menu 33

SQL AST

DataFrame

Unresolved Logical

Logical Plan

Optimized Logical

Plan RDDs

Selected Physical

Analysis Logical Optimization

Physical Planning

Physical Plans

Catalog

DataFrames and SQL share the same optimization/execution pipeline

Code Generation

Writing Rules as Tree Transformations 1.  Find filters on top of

projections. 2.  Check that the filter

can be evaluated without the result of the project.

3.  If so, switch the operators.

Projectname

Projectid,name

Filterid = 1

People

OriginalPlan

Projectname

Projectid,name

Filterid = 1

People

FilterPush-Down

Prior Work: "Optimizer Generators Volcano / Cascades: •  Create a custom language for expressing

rules that rewrite trees of relational operators.

•  Build a compiler that generates executable code for these rules.

Cons: Developers need to learn this custom language. Language might not be powerful enough. 35

Filter Push Down Transformation

val newPlan = queryPlan transform {

case f @ Filter(_, p @ Project(_, grandChild))

if(f.references subsetOf grandChild.output) =>

p.copy(child = f.copy(child = grandChild)

Partial Function Tree

Find Filter on Project

Check that the filter can be evaluated without the result of the project.

If so, switch the order.

Scala: Pattern Matching

Catalyst: Attribute Reference Tracking

} Scala: Copy Constructors

Optimizing with Rules

Projectname

Projectid,name

Filterid = 1

People

OriginalPlan

Projectname

Projectid,name

Filterid = 1

People

FilterPush-Down

Projectname

Filterid = 1

People

CombineProjection

IndexLookupid = 1

return: name

PhysicalPlan

• Type-safe: operate on domain objects with compiled lambda functions • Fast: Code-generated

encoders for fast serialization • Interoperable: Easily

convert DataFrames to Datasets without boiler plate

Coming Soon: Datasets

val df = ctx.read.json("people.json") // Convert to custom objects. case class Person(name: String, age: Int) val ds: Dataset[Person] = df.as[Person] ds.filter(_.age > 30) // Compute histogram of age by name. ds.groupBy(_.name).mapGroups { case (name, people) => val buckets = Array[Int](10) people.map(_.age).foreach { a => buckets(a / 10) += 1 } (name, buckets) }

Questions?

https://databricks.com/company/careers https://github.com/apache/spark

lightning fast cluster computing - apache software...

Documents

budapest spark meetup - apache spark @enbrite.ly

steve loughran julio guijarro hp laboratories,...

spark: top-k keyword query in relational...

tu23 - building web applications with struts -...

spark concepts - spark sql, graphx, streaming

spark and spark sql - amir h. payberah · spark and spark...

spark & spark sql

the apache software foundation community development...

spark, spark streaming & tachyon

mcdonough spark tutorial spark summit 2013

developing apache spark applications - cloudera · apache...

automotive engine control and hybrid systems: challenges...

s u m m i t - amazon web services... · task2/slide1 task...

paris spark meetup : extension de spark (tachyon / spark...

spark ignition energy measurements in jet...

spark plug thread repair spark plug spark plug sockets for

spark: a framework for iterative and interactive...

spark platform spark core spark extensions using …...

the other apache technologies your big data solution needs...

[spark meetup] spark streaming overview