20130912 ytc_reynold xin_spark and shark

38
Spark and Shark: High- speed Analytics over Hadoop Data Sep 12, 2013 @ Yahoo! Tech Conference Reynold Xin ( 辛辛 ), AMPLab, UC Berkeley

Upload: yahootechconference

Post on 26-Jan-2015

112 views

Category:

Technology


2 download

DESCRIPTION

In this talk, we present two emerging, popular open source projects: Spark and Shark. Spark is an open source cluster computing system that aims to make data analytics fast — both fast to run and fast to write. It outperform Hadoop by up to 100x in many real-world applications. Spark programs are often much shorter than their MapReduce counterparts thanks to its high-level APIs and language integration in Java, Scala, and Python. Shark is an analytic query engine built on top of Spark that is compatible with Hive. It can run Hive queries much faster in existing Hive warehouses without modifications. These systems have been adopted by many organizations large and small (e.g. Yahoo, Intel, Adobe, Alibaba, Tencent) to implement data intensive applications such as ETL, interactive SQL, and machine learning.

TRANSCRIPT

Page 1: 20130912 YTC_Reynold Xin_Spark and Shark

Spark and Shark: High-speed Analytics over Hadoop Data

Sep 12, 2013 @ Yahoo! Tech Conference

Reynold Xin (辛湜 ), AMPLab, UC Berkeley

Page 2: 20130912 YTC_Reynold Xin_Spark and Shark

The Spark Ecosystem

Spark

SharkSQL

HDFS / Hadoop Storage

Mesos/YARN Resource Manager

Spark Streamin

gGraphX MLBase

Page 3: 20130912 YTC_Reynold Xin_Spark and Shark

Today’s Talk

Spark

SharkSQL

HDFS / Hadoop Storage

Mesos/YARN Resource Manager

Spark Streamin

gGraphX MLBase

Page 4: 20130912 YTC_Reynold Xin_Spark and Shark

Apache Spark

Fast and expressive cluster computing system interoperable with Apache Hadoop

Improves efficiency through:» In-memory computing primitives»General computation graphs

Improves usability through:»Rich APIs in Scala, Java, Python» Interactive shell

Up to 100× faster(2-10× on disk)

Often 5× less code

Page 5: 20130912 YTC_Reynold Xin_Spark and Shark

Shark

An analytic query engine built on top of Spark

»Support both SQL and complex analytics»Can be100X faster than Apache Hive

Compatible with Hive data, metastore, queries

»HiveQL»UDF / UDAF»SerDes»Scripts

Page 6: 20130912 YTC_Reynold Xin_Spark and Shark

Community

3000 people attended online training

1100+ meetup members

80+ GitHub contributors

Page 7: 20130912 YTC_Reynold Xin_Spark and Shark
Page 8: 20130912 YTC_Reynold Xin_Spark and Shark

Today’s Talk

Spark

SharkSQL

HDFS / Hadoop Storage

Mesos Resource Manager

Spark Streamin

gGraphX MLBase

Page 9: 20130912 YTC_Reynold Xin_Spark and Shark

Why a New Framework?

MapReduce greatly simplified big data analysis

But as soon as it got popular, users wanted more:

»More complex, multi-pass analytics (e.g. ML, graph)

»More interactive ad-hoc queries»More real-time stream processing

Page 10: 20130912 YTC_Reynold Xin_Spark and Shark

Hadoop MapReduce

iter. 1 iter. 2 …

Input

filesystemread

filesystemwrite

filesystemread

filesystemwrite

Iterative:

Interactive:

Input

query 1

query 2

select

Subset

Slow due to replication and disk I/O,but necessary for fault tolerance

Page 11: 20130912 YTC_Reynold Xin_Spark and Shark

Spark Programming Model

Key idea: resilient distributed datasets (RDDs)

»Distributed collections of objects that can optionally be cached in memory across cluster

»Manipulated through parallel operators»Automatically recomputed on failure

Programming interface»Functional APIs in Scala, Java, Python» Interactive use from Scala shell

Page 12: 20130912 YTC_Reynold Xin_Spark and Shark

Example: Log Mining

Exposes RDDs through a functional API in Scala

Usable interactively from Scala shelllines = spark.textFile(“hdfs://...”)

errors = lines.filter(_.startsWith(“ERROR”))

errors.persist() Block 1

Block 2

Block 3

Worker

errors.filter(_.contains(“foo”)).count()

errors.filter(_.contains(“bar”)).count()

tasks

results

Errors 2

Base RDD

Transformed RDD

Action

Result: full-text search of Wikipedia in <1 sec (vs 20 sec for

on-disk data)

Result: 1 TB data in 5 sec(vs 170 sec for on-disk data)

Worker

Errors 3

Worker

Errors 1

Master

Page 13: 20130912 YTC_Reynold Xin_Spark and Shark

Data Sharing in Spark

iter. 1 iter. 2 …

Input

Iterative:

Interactive:

Input

query 1

query 2select

10-100x faster than network/disk

Page 14: 20130912 YTC_Reynold Xin_Spark and Shark

Fault Tolerance

RDDs track the series of transformations used to build them (their lineage) to recompute lost data

E.g:messages = textFile(...).filter(_.contains(“error”)) .map(_.split(‘\t’)(2))

HadoopRDDpath = hdfs://…

FilteredRDDfunc =

_.contains(...)

MappedRDDfunc = _.split(…)

Page 15: 20130912 YTC_Reynold Xin_Spark and Shark

Spark: Expressive API

map

filter

groupBy

sort

union

join

leftOuterJoin

rightOuterJoin

reduce

count

fold

reduceByKey

groupByKey

cogroup

cross

zip

sample

take

first

partitionBy

mapWith

pipe

save

...

15

and broadcast variables, accumulators…

Page 16: 20130912 YTC_Reynold Xin_Spark and Shark

public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } }}

public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); }}

Page 17: 20130912 YTC_Reynold Xin_Spark and Shark

Word Countval docs = sc.textFiles(“hdfs://…”)

docs.flatMap(line => line.split(“\s”)) .map(word => (word, 1)) .reduceByKey((a, b) => a + b)

Page 18: 20130912 YTC_Reynold Xin_Spark and Shark

Word Countval docs = sc.textFiles(“hdfs://…”)

docs.flatMap(line => line.split(“\s”)) .map(word => (word, 1)) .reduceByKey((a, b) => a + b)

docs.flatMap(_.split(“\s”)) .map((_, 1)) .reduceByKey(_ + _)

Page 19: 20130912 YTC_Reynold Xin_Spark and Shark

Spark in Java and Python

Python APIlines = spark.textFile(…)

errors = lines.filter( lambda s: "ERROR" in s)

errors.count()

Java APIJavaRDD<String> lines = spark.textFile(…);

errors = lines.filter( new Function<String, Boolean>() { public Boolean call(String s) { return s.contains("ERROR"); } });

errors.count()

Page 20: 20130912 YTC_Reynold Xin_Spark and Shark

SchedulerGeneral task DAGs

Pipelines functionswithin a stage

Cache-aware datalocality & reuse

Partitioning-awareto avoid shuffles

Launch jobs in ms

join

union

groupBy

map

Stage 3

Stage 1

Stage 2

A: B:

C: D:

E:

F:

G:

= previously computed partition

Page 21: 20130912 YTC_Reynold Xin_Spark and Shark

Example: Logistic Regression

Goal: find best line separating two sets of points

+

++

+

+

+

++ +

– ––

––

+

target

random initial line

22

Page 22: 20130912 YTC_Reynold Xin_Spark and Shark

Logistic Regression: Prep the Data

val sc = new SparkContext(“spark://master”, “LRExp”)

def loadDict(): Map[String, Double] = { … }

val dict = sc.broadcast( loadDict() )

def parser(s: String): (Array[Double], Double) = {val parts = s.split(‘\t’)val y = parts(0).toDoubleval x = parts.drop(1).map(dict.getOrElse(_, 0.0))(x, y)

}

val data = sc.textFile(“hdfs://d”).map(parser).cache()

Load data in memory once

Start a new Spark Session

Build a dictionary for file parsing

Broadcast the Dictionary to the Cluster

Line Parser which uses the Dictionary

23

Page 23: 20130912 YTC_Reynold Xin_Spark and Shark

Logistic Regression: Grad. Desc.

val sc = new SparkContext(“spark://master”, “LRExp”)

// Parsing …

val data: Spark.RDD[(Array[Double],Double)] = …

var w = Vector.random(D)

val lambda = 1.0/data.count

for (i <- 1 to ITERATIONS) { val gradient = data.map {

case (x, y) => (1.0/(1+exp(-y*(w dot x)))–1)*y*x }.reduce( (g1, g2) => g1 + g2 ) w -= (lambda / sqrt(i) ) * gradient}

Initial parameter vector

Iterate

Set Learning Rate

Apply Gradient on Master

Compute Gradient Using Map & Reduce

24

Page 24: 20130912 YTC_Reynold Xin_Spark and Shark

Logistic Regression Performance

1 5 10 20 300

500

1000

1500

2000

2500

3000

3500

4000

Hadoop

Spark

Number of Iterations

Ru

nn

ing

Tim

e (

s) 110 s / iteration

first iteration 80 sfurther iterations 1

s

Page 25: 20130912 YTC_Reynold Xin_Spark and Shark

Spark ML Lib (Spark 0.8)Classification Logistic Regression, Linear

SVM (+L1, L2)

Regression Linear Regression(+Lasso, Ridge)

Collaborative Filtering Alternating Least Squares

Clustering K-Means

Optimization SGD, Parallel Gradient

Page 26: 20130912 YTC_Reynold Xin_Spark and Shark

Spark

Spark

SharkSQL

HDFS / Hadoop Storage

Mesos Resource Manager

Spark Streamin

gGraphX MLBase

Page 27: 20130912 YTC_Reynold Xin_Spark and Shark

Shark

Spark

SharkSQL

HDFS / Hadoop Storage

Mesos Resource Manager

Spark Streamin

gGraphX MLBase

Page 28: 20130912 YTC_Reynold Xin_Spark and Shark

Shark

A data analytics system that»builds on Spark,»scales out and tolerate worker failures,»supports low-latency, interactive queries through

(optional) in-memory computation,»supports both SQL and complex analytics,» is compatible with Hive (storage, serdes, UDFs,

types, metadata).

Page 29: 20130912 YTC_Reynold Xin_Spark and Shark

Hive Architecture

Meta store

HDFS

Client

Driver

SQL Parse

r

Query Optimize

r

Physical Plan

Execution

CLI JDBC

MapReduce

Page 30: 20130912 YTC_Reynold Xin_Spark and Shark

Shark Architecture

Meta store

HDFS

Client

Driver

SQL Parse

r

Physical Plan

Execution

CLI JDBC

Spark

Cache Mgr.

Query Optimize

r

Page 31: 20130912 YTC_Reynold Xin_Spark and Shark

Shark Query Language

Hive QL and a few additional DDL add-ons:

Creating an in-memory table:»CREATE TABLE src_cached AS SELECT * FROM …

Page 32: 20130912 YTC_Reynold Xin_Spark and Shark

Engine Features

Columnar Memory Store

Data Co-partitioning & Co-location

Partition Pruning based on Statistics (range, bloom filter and others)

Spark Integration

Page 33: 20130912 YTC_Reynold Xin_Spark and Shark

Columnar Memory Store

Simply caching Hive records as JVM objects is inefficient.

Shark employs column-oriented storage.

1

Column Storage

2 3

johnmike

sally

4.1 3.5 6.4

Row Storage

1 john 4.1

2mike

3.5

3sally

6.4Benefit: compact representation, CPU efficient compression, cache locality.

Page 34: 20130912 YTC_Reynold Xin_Spark and Shark

Spark IntegrationUnified system for query processing and Spark analytics

Query processing and ML share the same set of workers and caches

Page 35: 20130912 YTC_Reynold Xin_Spark and Shark

Performance

1.7 TB Real Warehouse Data on 100 EC2 nodes

Page 36: 20130912 YTC_Reynold Xin_Spark and Shark

More InformationDownload and docs: www.spark-project.org

» Easy to run locally, on EC2, or on Mesos/YARN

Email: [email protected]

Twitter: @rxin

Page 37: 20130912 YTC_Reynold Xin_Spark and Shark

Behavior with Insufficient RAM

0% 25% 50% 75% 100%0

20

40

60

80

10068.8

58.1

40.729.7

11.5

Percent of working set in memory

Itera

tion

tim

e (

s)

Page 38: 20130912 YTC_Reynold Xin_Spark and Shark

Example: PageRank

1. Start each page with a rank of 12. On each iteration, update each page’s rank to

Σi∈neighbors ranki / |neighborsi|

links = // RDD of (url, neighbors) pairsranks = // RDD of (url, rank) pairs

for (i <- 1 to ITERATIONS) { ranks = links.join(ranks).flatMap { (url, (links, rank)) => links.map(dest => (dest, rank/links.size)) }.reduceByKey(_ + _)}