Download - Osd ctw spark
Spark Next generation cloud
computing engine
Wisely Chen
Agenda• What is Spark?
• Next big thing
• How to use Spark?
• Demo
• Q&A
Who am I?
• Wisely Chen ( [email protected] )
• Sr. Engineer in Yahoo![Taiwan] data team
• Loves to promote open source tech
• Hadoop Summit 2013 San Jose
• Jenkins Conf 2013 Palo Alto
• Coscup 2006, 2012, 2013 , OSDC 2007, Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012
Taiwan Data Team
Data!Highway
BI!Report
Serving!API
Data!Mart
ETL /Forecast
Machine!Learning
Machine Learning
Distribute Computing
Big Data
Recommendation
Forecast
HADOOP
Faster ML
Distribute Computing
Bigger Big Data
Opinion from Cloudera• The leading candidate for “successor to
MapReduce” today is Apache Spark
• No vendor — no new project — is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. !
• From http://0rz.tw/y3OfM
What is Spark
• From UC Berkeley AMP Lab
• Most activity Big data open source project since Hadoop
Where is Spark?
HDFS
YARN
MapReduce
Hadoop 2.0
Storm HBase Others
HDFS
YARN
MapReduce
Hadoop Architecture
Hive
Storage
Resource Management
Computing Engine
SQL
HDFS
YARN
MapReduce
Hadoop vs Spark
Spark
Hive Shark
Spark vs Hadoop• Spark run on Yarn, Mesos or Standalone mode
• Spark’s main concept is based on MapReduce
• Spark can read from
• HDFS: data locality
• HBase
• Cassandra
More than MapReduce
HDFS
Spark Core : MapReduce
Shark: Hive GraphX: Pregel MLib: MahoutStreaming:
Storm
Resource Management System(Yarn, Mesos)
Why Spark?
天下武功,無堅不破,惟快不破
3X~25X than MapReduce framework !
From Matei’s paper: http://0rz.tw/VVqgP
Logistic regression
Runn
ing
Tim
e(S)
0
20
40
60
80
MR Spark3
76
KMeans
0
27.5
55
82.5
110
MR Spark
33
106
PageRank
0
45
90
135
180
MR Spark
23
171
What is Spark
• Apache Spark™ is a very fast and general engine for large-scale data processing
Why is Spark so fast?
HDFS
• 100X lower than memory
• Store data into Network+Disk
• Network speed is 100X than memory
• Implement fault tolerance
MapReduce Pagerank!
• …..readInputFromHDFS…
• for (int runs = 0; runs < iter_runnumber ; runs++) {
• ………….. • isCompleted = runRankCalculation(inPath,lastResultPath);
• …………
• }
• …..writeOutputToHDFS….
Workflow
Input HDFS
Iter 1 RunRank
Tmp HDFS
Iter 2 RunRank
Tmp HDFS
Iter N RunRank
Input HDFS
Iter 1 RunRank
Tmp Mem
Iter 2 RunRank
Tmp Mem
Iter N RunRank
MapReduce
Spark
First iteration!take 200 sec
3rd iteration!take 20 sec
Page Rank algorithm in 1 billion record url
2nd iteration!take 20 sec
RDD
• Resilient Distributed Dataset
• Collections of objects spread across a cluster, stored in RAM or on Disk
• Built through parallel transformations
Fault Tolerance
天下武功,無堅不破,惟快不破
RDD
RDD a RDD b
val a =sc.textFile(“hdfs://....”)
val b = a.filer( line=>line.contain(“Spark”) )
Value c
val c = b.count()
Transformation Action
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!
Worker!!!!
Worker!!!!Task
TaskTask
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!Block1
RDD a
Worker!!!!!Block2
RDD a
Worker!!!!!Block3
RDD a
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Block1 Block2
Block3
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Block1 Block2
Block3
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Worker!!!!!
RDD err
Cache1 Cache2
Cache3
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!
RDD m
Worker!!!!!
RDD m
Worker!!!!!
RDD m
Cache1 Cache2
Cache3
Log mining
val a = sc.textfile(“hdfs://aaa.com/a.txt”)!val err = a.filter( t=> t.contains(“ERROR”) )! .filter( t=>t.contains(“2014”)!!err.cache()!err.count()!!val m = err.filter( t=> t.contains(“MYSQL”) )!! ! .count()!val a = err.filter( t=> t.contains(“APACHE”) )!! ! .count()
Driver
Worker!!!!!
RDD a
Worker!!!!!
RDD a
Worker!!!!!
RDD a
Cache1 Cache2
Cache3
1st iteration(no cache)!
take same time
with cache!take 7 sec
RDD Cache
RDD Cache
• Data locality
• CacheA big shuffle!take 20min
After cache, take only 265ms
self join 5 billion record data
Easy to use
• Interactive Shell
• Multi Language API
• JVM: Scala, JAVA
• PySpark: Python
Scala Word Count• val file = spark.textFile("hdfs://...")
• val counts = file.flatMap(line => line.split(" "))
• .map(word => (word, 1))
• .reduceByKey(_ + _)
• counts.saveAsTextFile("hdfs://...")
Step by Step
• file.flatMap(line => line.split(" “)) => (aaa,bb,cc)
• .map(word => (word, 1)) => ((aaa,1),(bb,1)..)
• .reduceByKey(_ + _) => ((aaa,123),(bb,23)…)
Java Wordcount• JavaRDD<String> file = spark.textFile("hdfs://...");
• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()
• public Iterable<String> call(String s) { return Arrays.asList(s.split(" ")); }
• });
• JavaPairRDD<String, Integer> pairs = words.map(new PairFunction<String, String, Integer>()
• public Tuple2<String, Integer> call(String s) { return new Tuple2<String, Integer>(s, 1); }
• });
• JavaPairRDD<String, Integer> counts = pairs.reduceByKey(new Function2<Integer, Integer>()
• public Integer call(Integer a, Integer b) { return a + b; }
• });
• counts.saveAsTextFile("hdfs://...");
Java vs Scala• Scala : file.flatMap(line => line.split(" "))
• Java version :
• JavaRDD<String> words = file.flatMap(new FlatMapFunction<String, String>()
• public Iterable<String> call(String s) {
• return Arrays.asList(s.split(" ")); }
• });
Python• file = spark.textFile("hdfs://...")
• counts = file.flatMap(lambda line: line.split(" ")) \
• .map(lambda word: (word, 1)) \
• .reduceByKey(lambda a, b: a + b)
• counts.saveAsTextFile("hdfs://...")
Highly Recommend
• Scala : Latest API feature, Stable
• Python
• very familiar language
• Native Lib: NumPy, SciPy
FYI• Combiner : ReduceByKey(_+_)
!
• Typical WordCount :
• groupByKey().mapValues{ arr =>
• var r = 0 ; arr.foreach{i=> r+=i} ; r
• }
WordCount
ReduceByKey !reduce a lot in map side
hadoop style shuffle!send a lot data to network
DEMO
• FB 打卡 Yahoo! 徵人訊息,獲得 Yahoo! 沐浴小鴨
• FB打卡說 ”Yahoo! APP超讚!!”
並附上超級商城或新聞APP截圖,即可憑打卡記錄,獲得小鴨護腕墊或購物袋一只
Just memory?• From Matei’s paper: http://0rz.tw/VVqgP
• HBM: stores data in an in-memory HDFS instance.
• SP : Spark
• HBM’1, SP’1 : first run
• Storage: HDFS with 256 MB blocks
• Node information
• m1.xlarge EC2 nodes
• 4 cores
• 15 GB of RAM
100GB data on 100 node cluster
Logistic regression Ru
nnin
g Ti
me(
S)
0
35
70
105
140
HBM'1 HBM SP'1 SP3
4662
139
KMeans
Runn
ing
Tim
e(S)
0
50
100
150
200
HBM'1 HBM SP'1 SP
33
8287
182
There is more• General DAG scheduler
• Control partition shuffle
• Fast driven RPC to launch task
!
• For more info, check http://0rz.tw/jwYwI