osd ctw spark
Post on 27-Aug-2014
618 views
Embed Size (px)
DESCRIPTION
OSDC.tw 2014 at TaiwanTRANSCRIPT
- Spark Next generation cloud computing engine Wisely Chen
- Agenda What is Spark? Next big thing How to use Spark? Demo Q&A
- Who am I? Wisely Chen ( [email protected] ) Sr. Engineer inYahoo![Taiwan] data team Loves to promote open source tech Hadoop Summit 2013 San Jose Jenkins Conf 2013 Palo Alto Coscup 2006, 2012, 2013 , OSDC 2007,Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012
- Taiwan Data Team Data! Highway BI! Report Serving! API Data! Mart ETL / Forecast Machine! Learning
- Machine Learning Distribute Computing Big Data
- Recommendation Forecast
- HADOOP
- Faster ML Distribute Computing Bigger Big Data
- Opinion from Cloudera The leading candidate for successor to MapReduce today is Apache Spark No vendor no new project is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. ! From http://0rz.tw/y3OfM
- What is Spark From UC Berkeley AMP Lab Most activity Big data open source project since Hadoop
- Where is Spark?
- HDFS YARN MapReduce Hadoop 2.0 Storm HBase Others
- HDFS YARN MapReduce Hadoop Architecture Hive Storage Resource Management Computing Engine SQL
- HDFS YARN MapReduce Hadoop vs Spark Spark Hive Shark
- Spark vs Hadoop Spark run on Yarn, Mesos or Standalone mode Sparks main concept is based on MapReduce Spark can read from HDFS: data locality HBase Cassandra
- More than MapReduce HDFS Spark Core : MapReduce Shark: Hive GraphX: Pregel MLib: Mahout Streaming: Storm Resource Management System(Yarn, Mesos)
- Why Spark?
- 3X~25X than MapReduce framework ! From Mateis paper: http://0rz.tw/VVqgP Logistic regression RunningTime(S) 0 20 40 60 80 MR Spark 3 76 KMeans 0 27.5 55 82.5 110 MR Spark 33 106 PageRank 0 45 90 135 180 MR Spark 23 171
- What is Spark Apache Spark is a very fast and general engine for large-scale data processing
- Why is Spark so fast?
- HDFS 100X lower than memory Store data into Network+Disk Network speed is 100X than memory Implement fault tolerance
- MapReduce Pagerank ! ..readInputFromHDFS for (int runs = 0; runs < iter_runnumber ; runs++) { .. isCompleted = runRankCalculation(inPath,lastResultPath); } ..writeOutputToHDFS.
- Workow Input HDFS Iter 1 RunRank Tmp HDFS Iter 2 RunRank Tmp HDFS Iter N RunRank Input HDFS Iter 1 RunRank Tmp Mem Iter 2 RunRank Tmp Mem Iter N RunRank MapReduce Spark
- First iteration! take 200 sec 3rd iteration! take 20 sec Page Rank algorithm in 1 billion record url 2nd iteration! take 20 sec
- RDD Resilient Distributed Dataset Collections of objects spread across a cluster, stored in RAM or on Disk Built through parallel transformations
- Fault Tolerance
- RDD RDD a RDD b val a =sc.textFile(hdfs://....) val b = a.ler( line=>line.contain(Spark) ) Value c val c = b.count() Transformation Action
- Log mining val a = sc.textle(hdfs://aaa.com/a.txt)! val err = a.lter( t=> t.contains(ERROR) )! .lter( t=>t.contains(2014)! ! err.cache()! err.count()! ! val m = err.lter( t=> t.contains(MYSQL) )! ! ! .count()! val a = err.lter( t=> t.contains(APACHE) )! ! ! .count() Driver Worker! ! ! ! Worker! ! ! ! Worker! ! ! !Task TaskTask
- Log mining val a = sc.textle(hdfs://aaa.com/a.txt)! val err = a.lter( t=> t.contains(ERROR) )! .lter( t=>t.contains(2014)! ! err.cache()! err.count()! ! val m = err.lter( t=> t.contains(MYSQL) )! ! ! .count()! val a = err.lter( t=> t.contains(APACHE) )! ! ! .count() Driver Worker! ! ! ! !Block1 RDD a Worker! ! ! ! !Block2 RDD a Worker! ! ! ! !Block3 RDD a
- Log mining val a = sc.textle(hdfs://aaa.com/a.txt)! val err = a.lter( t=> t.contains(ERROR) )! .lter( t=>t.contains(2014)! ! err.cache()! err.count()! ! val m = err.lter( t=> t.contains(MYSQL) )! ! ! .count()! val a = err.lter( t=> t.contains(APACHE) )! ! ! .count() Driver Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Block1 Block2 Block3
- Log mining val a = sc.textle(hdfs://aaa.com/a.txt)! val err = a.lter( t=> t.contains(ERROR) )! .lter( t=>t.contains(2014)! ! err.cache()! err.count()! ! val m = err.lter( t=> t.contains(MYSQL) )! ! ! .count()! val a = err.lter( t=> t.contains(APACHE) )! ! ! .count() Driver Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Block1 Block2 Block3
- Log mining val a = sc.textle(hdfs://aaa.com/a.txt)! val err = a.lter( t=> t.contains(ERROR) )! .lter( t=>t.contains(2014)! ! err.cache()! err.count()! ! val m = err.lter( t=> t.contains(MYSQL) )! ! ! .count()! val a = err.lter( t=> t.contains(APACHE) )! ! ! .count() Driver Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Cache1 Cache2 Cache3
- Log mining val a = sc.textle(hdfs://aaa.com/a.txt)! val err = a.lter( t=> t.contains(ERROR) )! .lter( t=>t.contains(2014)! ! err.cache()! err.count()! ! val m = err.lter( t=> t.contains(MYSQL) )! ! ! .count()! val a = err.lter( t=> t.contains(APACHE) )! ! ! .count() Driver Worker! ! ! ! ! RDD m Worker! ! ! ! ! RDD m Worker! ! ! ! ! RDD m Cache1 Cache2 Cache3
- Log mining val a = sc.textle(hdfs://aaa.com/a.txt)! val err = a.lter( t=> t.contains(ERROR) )! .lter( t=>t.contains(2014)! ! err.cache()! err.count()! ! val m = err.lter( t=> t.contains(MYSQL) )! ! ! .count()! val a = err.lter( t=> t.contains(APACHE) )! ! ! .count() Driver Worker! ! ! ! ! RDD a Worker! ! ! ! ! RDD a Worker! ! ! ! ! RDD a Cache1 Cache2 Cache3
- 1st iteration(no cache)! take same time with cache! take 7 sec RDD Cache
- RDD Cache Data locality Cache A big shufe! take 20min After cache, take only 265ms self join 5 billion record data
- Easy to use Interactive Shell Multi Language API JVM: Scala, JAVA PySpark: Python
- Scala Word Count val le = spark.textFile("hdfs://...") val counts = le.atMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
- Step by Step le.atMap(line => line.split(" )) => (aaa,bb,cc) .map(word => (word, 1)) => ((aaa,1),(bb,1)..) .reduceByKey(_ + _) => ((aaa,123),(bb,23))
- Java Wordcount JavaRDD le = spark.textFile("hdfs://..."); JavaRDD words = le.atMap(new FlatMapFunction() public Iterable call(String s) { return Arrays.asList(s.split(" ")); } }); JavaPairRDD pairs = words.map(new PairFunction() public Tuple2 call(String s) { return new Tuple2(s, 1); } }); JavaPairRDD counts = pairs.reduceByKey(new Function2() public Integer call(Integer a, Integer b) { return a + b; } }); counts.saveAsTextFile("hdfs://...");
- Java vs Scala Scala : le.atMap(line => line.split(" ")) Java version : JavaRDD words = le.atMap(new FlatMapFunction() public Iterable call(String s) { return Arrays.asList(s.split(" ")); } });
- Python le = spark.textFile("hdfs://...") counts = le.atMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("hdfs://...")
- Highly Recommend Scala : Latest API feature, Stable Python very familiar language Native Lib: NumPy, SciPy