Download - Osd ctw spark

Transcript
Page 1: Osd ctw spark

Spark Next generation cloud

computing engine

Wisely Chen

Page 2: Osd ctw spark

Agenda• What is Spark?

• Next big thing

• How to use Spark?

• Demo

• Q&A

Page 3: Osd ctw spark

Who am I?

• Wisely Chen ( [email protected] )

• Sr. Engineer in Yahoo![Taiwan] data team

• Loves to promote open source tech

• Hadoop Summit 2013 San Jose

• Jenkins Conf 2013 Palo Alto

• Coscup 2006, 2012, 2013 , OSDC 2007, Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012

Page 4: Osd ctw spark

Taiwan Data Team

Data!Highway

BI!Report

Serving!API

Data!Mart

ETL /Forecast

Machine!Learning

Page 5: Osd ctw spark

Machine Learning

Distribute Computing

Big Data

Page 6: Osd ctw spark

Recommendation

Forecast

Page 7: Osd ctw spark

HADOOP

Page 8: Osd ctw spark

Faster ML

Distribute Computing

Bigger Big Data

Page 9: Osd ctw spark

Opinion from Cloudera• The leading candidate for “successor to

MapReduce” today is Apache Spark

• No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. !

• From http://0rz.tw/y3OfM

Page 10: Osd ctw spark

What is Spark

• From UC Berkeley AMP Lab

• Most activity Big data open source project since Hadoop

Page 11: Osd ctw spark

Where is Spark?

Page 12: Osd ctw spark

HDFS

YARN

MapReduce

Hadoop 2.0

Storm HBase Others

Page 13: Osd ctw spark

HDFS

YARN

MapReduce

Hadoop Architecture

Hive

Storage

Resource Management

Computing Engine

SQL

Page 14: Osd ctw spark

HDFS

YARN

MapReduce

Hadoop vs Spark

Spark

Hive Shark

Page 15: Osd ctw spark

Spark vs Hadoop• Spark run on Yarn, Mesos or Standalone mode

• Spark’s main concept is based on MapReduce

• Spark can read from

• HDFS: data locality

• HBase

• Cassandra

Page 16: Osd ctw spark

More than MapReduce

HDFS

Spark Core : MapReduce

Shark: Hive GraphX: Pregel MLib: MahoutStreaming:

Storm

Resource Management System(Yarn, Mesos)

Page 17: Osd ctw spark

Why Spark?

Page 18: Osd ctw spark

天下武功,無堅不破,惟快不破

Page 19: Osd ctw spark

3X~25X than MapReduce framework !

From Matei’s paper: http://0rz.tw/VVqgP

Logistic regression

Runn

ing

Tim

e(S)

0

20

40

60

80

MR Spark3

76

KMeans

0

27.5

55

82.5

110

MR Spark

33

106

PageRank

0

45

90

135

180

MR Spark

23

171

Page 20: Osd ctw spark

What is Spark

• Apache Spark™ is a very fast and general engine for large-scale data processing

Page 21: Osd ctw spark

Why is Spark so fast?

Page 22: Osd ctw spark

HDFS

• 100X lower than memory

• Store data into Network+Disk

• Network speed is 100X than memory

• Implement fault tolerance

Page 23: Osd ctw spark

MapReduce Pagerank!

• …..readInputFromHDFS…

• for (int runs = 0; runs < iter_runnumber ; runs++) {

• ………….. • isCompleted = runRankCalculation(inPath,lastResultPath);

• …………

• }

• …..writeOutputToHDFS….

Page 24: Osd ctw spark

Workflow

Input HDFS

Iter 1 RunRank

Tmp HDFS

Iter 2 RunRank

Tmp HDFS

Iter N RunRank

Input HDFS

Iter 1 RunRank

Tmp Mem

Iter 2 RunRank

Tmp Mem

Iter N RunRank

MapReduce

Spark

Page 25: Osd ctw spark

First iteration!take 200 sec

3rd iteration!take 20 sec

Page Rank algorithm in 1 billion record url

2nd iteration!take 20 sec

Page 26: Osd ctw spark

RDD

• Resilient Distributed Dataset

• Collections of objects spread across a cluster, stored in RAM or on Disk

• Built through parallel transformations

Page 27: Osd ctw spark

Fault Tolerance

天下武功,無堅不破,惟快不破

Page 28: Osd ctw spark

RDD

RDD a RDD b

val a =sc.textFile(“hdfs://....”)

val b = a.filer( line=>line.contain(“Spark”) )

Value c

val c = b.count()

Transformation Action

Page 29: Osd ctw spark

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!

Worker!!!!

Worker!!!!Task

TaskTask

Page 30: Osd ctw spark

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!!Block1

RDD a

Worker!!!!!Block2

RDD a

Worker!!!!!Block3

RDD a

Page 31: Osd ctw spark

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Block1 Block2

Block3

Page 32: Osd ctw spark

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Block1 Block2

Block3

Page 33: Osd ctw spark

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Cache1 Cache2

Cache3

Page 34: Osd ctw spark

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!!

RDD m

Worker!!!!!

RDD m

Worker!!!!!

RDD m

Cache1 Cache2

Cache3

Page 35: Osd ctw spark

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!!

RDD a

Worker!!!!!

RDD a

Worker!!!!!

RDD a

Cache1 Cache2

Cache3

Page 36: Osd ctw spark

1st iteration(no cache)!

take same time

with cache!take 7 sec

RDD Cache

Page 37: Osd ctw spark

RDD Cache

• Data locality

• CacheA big shuffle!take 20min

After cache, take only 265ms

self join 5 billion record data

Page 38: Osd ctw spark

Easy to use

• Interactive Shell

• Multi Language API

• JVM: Scala, JAVA

• PySpark: Python

Page 39: Osd ctw spark

Scala Word Count• val file = spark.textFile("hdfs://...")

• val counts = file.flatMap(line => line.split(" "))

• .map(word => (word, 1))

• .reduceByKey(_ + _)

• counts.saveAsTextFile("hdfs://...")

Page 40: Osd ctw spark

Step by Step

• file.flatMap(line => line.split(" “)) => (aaa,bb,cc)

• .map(word => (word, 1)) => ((aaa,1),(bb,1)..)

• .reduceByKey(_ + _) => ((aaa,123),(bb,23)…)

Page 41: Osd ctw spark

Java Wordcount• JavaRDD<String> file = spark.textFile("hdfs://...");

• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()

• public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }

• });

• JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>()

• public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }

• });

• JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>()

• public Integer call(Integer a, Integer b) { return a + b; }

• });

• counts.saveAsTextFile("hdfs://...");

Page 42: Osd ctw spark

Java vs Scala• Scala : file.flatMap(line => line.split(" "))

• Java version :

• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()

• public Iterable<String> call(String s) {

• return Arrays.asList(s.split(" ")); }

• });

Page 43: Osd ctw spark

Python• file = spark.textFile("hdfs://...")

• counts = file.flatMap(lambda line: line.split(" ")) \

• .map(lambda word: (word, 1)) \

• .reduceByKey(lambda a, b: a + b)

• counts.saveAsTextFile("hdfs://...")

Page 44: Osd ctw spark

Highly Recommend

• Scala : Latest API feature, Stable

• Python

• very familiar language

• Native Lib: NumPy, SciPy

Page 45: Osd ctw spark

FYI• Combiner : ReduceByKey(_+_)

!

• Typical WordCount :

• groupByKey().mapValues{ arr =>

• var r = 0 ; arr.foreach{i=> r+=i} ; r

• }

Page 46: Osd ctw spark

WordCount

ReduceByKey !reduce a lot in map side

hadoop style shuffle!send a lot data to network

Page 47: Osd ctw spark

DEMO

Page 48: Osd ctw spark

• FB 打卡 Yahoo! 徵人訊息,獲得 Yahoo! 沐浴小鴨

• FB打卡說 ”Yahoo!  APP超讚!!”

並附上超級商城或新聞APP截圖,即可憑打卡記錄,獲得小鴨護腕墊或購物袋一只

Page 49: Osd ctw spark

Just memory?• From Matei’s paper: http://0rz.tw/VVqgP

• HBM: stores data in an in-memory HDFS instance.

• SP : Spark

• HBM’1, SP’1 : first run

• Storage: HDFS with 256 MB blocks

• Node information

• m1.xlarge EC2 nodes

• 4 cores

• 15 GB of RAM

Page 50: Osd ctw spark

100GB data on 100 node cluster

Logistic regression Ru

nnin

g Ti

me(

S)

0

35

70

105

140

HBM'1 HBM SP'1 SP3

4662

139

KMeans

Runn

ing

Tim

e(S)

0

50

100

150

200

HBM'1 HBM SP'1 SP

33

8287

182

Page 51: Osd ctw spark

There is more• General DAG scheduler

• Control partition shuffle

• Fast driven RPC to launch task

!

• For more info, check http://0rz.tw/jwYwI

Page 52: Osd ctw spark
Page 53: Osd ctw spark

Top Related