osd ctw spark

Spark Next generation cloud

computing engine

Wisely Chen

Agenda• What is Spark?

• Next big thing

• How to use Spark?

• Demo

• Q&A

Who am I?

• Wisely Chen ( thegiive@gmail.com )

• Sr. Engineer in Yahoo![Taiwan] data team

• Loves to promote open source tech

• Hadoop Summit 2013 San Jose

• Jenkins Conf 2013 Palo Alto

• Coscup 2006, 2012, 2013 , OSDC 2007, Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012

Taiwan Data Team

Data!Highway

BI!Report

Serving!API

Data!Mart

ETL /Forecast

Machine!Learning

Machine Learning

Distribute Computing

Big Data

Recommendation

Forecast

HADOOP

Faster ML

Distribute Computing

Bigger Big Data

Opinion from Cloudera• The leading candidate for “successor to

MapReduce” today is Apache Spark

• No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. !

• From http://0rz.tw/y3OfM

What is Spark

• From UC Berkeley AMP Lab

• Most activity Big data open source project since Hadoop

Where is Spark?

MapReduce

Hadoop 2.0

Storm HBase Others

MapReduce

Hadoop Architecture

Storage

Resource Management

Computing Engine

MapReduce

Hadoop vs Spark

Hive Shark

Spark vs Hadoop• Spark run on Yarn, Mesos or Standalone mode

• Spark’s main concept is based on MapReduce

• Spark can read from

• HDFS: data locality

• HBase

• Cassandra

More than MapReduce

Spark Core : MapReduce

Shark: Hive GraphX: Pregel MLib: MahoutStreaming:

Resource Management System(Yarn, Mesos)

Why Spark?

天下武功，無堅不破，惟快不破

3X~25X than MapReduce framework !

From Matei’s paper: http://0rz.tw/VVqgP

Logistic regression

MR Spark3

KMeans

MR Spark

PageRank

MR Spark

What is Spark

• Apache Spark™ is a very fast and general engine for large-scale data processing

Why is Spark so fast?

• 100X lower than memory

• Store data into Network+Disk

• Network speed is 100X than memory

• Implement fault tolerance

MapReduce Pagerank!

• …..readInputFromHDFS…

• for (int runs = 0; runs < iter_runnumber ; runs++) {

• ………….. • isCompleted = runRankCalculation(inPath,lastResultPath);

• …………

• …..writeOutputToHDFS….

Workflow

Input HDFS

Iter 1 RunRank

Tmp HDFS

Iter 2 RunRank

Tmp HDFS

Iter N RunRank

Input HDFS

Iter 1 RunRank

Tmp Mem

Iter 2 RunRank

Tmp Mem

Iter N RunRank

MapReduce

First iteration!take 200 sec

3rd iteration!take 20 sec

Page Rank algorithm in 1 billion record url

2nd iteration!take 20 sec

• Resilient Distributed Dataset

• Collections of objects spread across a cluster, stored in RAM or on Disk

• Built through parallel transformations

Fault Tolerance

天下武功，無堅不破，惟快不破

RDD a RDD b

val a =sc.textFile(“hdfs://....”)

val b = a.filer( line=>line.contain(“Spark”) )

Value c

val c = b.count()

Transformation Action

Log mining

val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()

Driver

Worker!!!!

Worker!!!!Task

TaskTask

Log mining

Driver

Worker!!!!!Block1

Worker!!!!!Block2

Worker!!!!!Block3

Log mining

Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Block1 Block2

Block3

Log mining

Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Block1 Block2

Block3

Log mining

Driver

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Worker!!!!!

RDD err

Cache1 Cache2

Cache3

Log mining

Driver

Worker!!!!!

Cache1 Cache2

Cache3

Log mining

Driver

Worker!!!!!

Cache1 Cache2

Cache3

1st iteration(no cache)!

take same time

with cache!take 7 sec

RDD Cache

• Data locality

• CacheA big shuffle!take 20min

After cache, take only 265ms

self join 5 billion record data

Easy to use

• Interactive Shell

• Multi Language API

• JVM: Scala, JAVA

• PySpark: Python

Scala Word Count• val file = spark.textFile("hdfs://...")

• val counts = file.flatMap(line => line.split(" "))

• .map(word => (word, 1))

• .reduceByKey(_ + _)

• counts.saveAsTextFile("hdfs://...")

Step by Step

• file.flatMap(line => line.split(" “)) => (aaa,bb,cc)

• .map(word => (word, 1)) => ((aaa,1),(bb,1)..)

• .reduceByKey(_ + _) => ((aaa,123),(bb,23)…)

Java Wordcount• JavaRDD<String> file = spark.textFile("hdfs://...");

• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()

• public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }

• });

• JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>()

• public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }

• });

• JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>()

• public Integer call(Integer a, Integer b) { return a + b; }

• });

• counts.saveAsTextFile("hdfs://...");

Java vs Scala• Scala : file.flatMap(line => line.split(" "))

• Java version :

• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()

• public Iterable<String> call(String s) {

• return Arrays.asList(s.split(" ")); }

• });

Python• file = spark.textFile("hdfs://...")

• counts = file.flatMap(lambda line: line.split(" ")) \

• .map(lambda word: (word, 1)) \

• .reduceByKey(lambda a, b: a + b)

• counts.saveAsTextFile("hdfs://...")

Highly Recommend

• Scala : Latest API feature, Stable

• Python

• very familiar language

• Native Lib: NumPy, SciPy

FYI• Combiner : ReduceByKey(_+_)

• Typical WordCount :

• groupByKey().mapValues{ arr =>

• var r = 0 ; arr.foreach{i=> r+=i} ; r

WordCount

ReduceByKey !reduce a lot in map side

hadoop style shuffle!send a lot data to network

• FB 打卡 Yahoo! 徵人訊息，獲得 Yahoo! 沐浴小鴨

• FB打卡說 ”Yahoo! APP超讚!!”

並附上超級商城或新聞APP截圖，即可憑打卡記錄，獲得小鴨護腕墊或購物袋一只

Just memory?• From Matei’s paper: http://0rz.tw/VVqgP

• HBM: stores data in an in-memory HDFS instance.

• SP : Spark

• HBM’1, SP’1 : first run

• Storage: HDFS with 256 MB blocks

• Node information

• m1.xlarge EC2 nodes

• 4 cores

• 15 GB of RAM

100GB data on 100 node cluster

Logistic regression Ru

HBM'1 HBM SP'1 SP3

KMeans

HBM'1 HBM SP'1 SP

There is more• General DAG scheduler

• Control partition shuffle

• Fast driven RPC to launch task

• For more info, check http://0rz.tw/jwYwI

osd ctw spark

public iterable

rdd err worker

mateis paper

1 hbm sp

driver worker

val err

word gt

line gt

Software

osd equipments

equipment overview 5th floor ctw and beyond…

ctw - awareness & understanding

maersk drilling - ec.europa.eu · maersk drilling brussels,...

gold-n-diamonds · 3335999 2,011 30h 34103 127 3329999...

ctw investor overview_october 18 2011

chetan osd

glaucoma management and osd management and ... moderate osd...

950k crawler dozer - deere€¦ · blade type outside dozer...

service employees international union, ctw-clc … ·...

manual osd

ctw dec 29, 2011

ctw training programme

using ctw as a language modeler in dasher

cyclops tornado osd v1.0 manual - himodel tornado osd manual...

community training works! (ctw) - ctwfl.com

osd - visualbee

ctw research bibliography. research papers …ctw research...

ctw instruction manual - Αεροβόλα - camping m4 a1...

ctw training programme participants