osd ctw spark

Click here to load reader

Post on 27-Aug-2014

618 views

Category:

Software

1 download

Embed Size (px)

DESCRIPTION

OSDC.tw 2014 at Taiwan

TRANSCRIPT

  • Spark Next generation cloud computing engine Wisely Chen
  • Agenda What is Spark? Next big thing How to use Spark? Demo Q&A
  • Who am I? Wisely Chen ( [email protected] ) Sr. Engineer inYahoo![Taiwan] data team Loves to promote open source tech Hadoop Summit 2013 San Jose Jenkins Conf 2013 Palo Alto Coscup 2006, 2012, 2013 , OSDC 2007,Webconf 2013, Coscup 2012, PHPConf 2012 , RubyConf 2012
  • Taiwan Data Team Data! Highway BI! Report Serving! API Data! Mart ETL / Forecast Machine! Learning
  • Machine Learning Distribute Computing Big Data
  • Recommendation Forecast
  • HADOOP
  • Faster ML Distribute Computing Bigger Big Data
  • Opinion from Cloudera The leading candidate for successor to MapReduce today is Apache Spark No vendor no new project is likely to catch up. Chasing Spark would be a waste of time, and would delay availability of real-time analytic and processing services for no good reason. ! From http://0rz.tw/y3OfM
  • What is Spark From UC Berkeley AMP Lab Most activity Big data open source project since Hadoop
  • Where is Spark?
  • HDFS YARN MapReduce Hadoop 2.0 Storm HBase Others
  • HDFS YARN MapReduce Hadoop Architecture Hive Storage Resource Management Computing Engine SQL
  • HDFS YARN MapReduce Hadoop vs Spark Spark Hive Shark
  • Spark vs Hadoop Spark run on Yarn, Mesos or Standalone mode Sparks main concept is based on MapReduce Spark can read from HDFS: data locality HBase Cassandra
  • More than MapReduce HDFS Spark Core : MapReduce Shark: Hive GraphX: Pregel MLib: Mahout Streaming: Storm Resource Management System(Yarn, Mesos)
  • Why Spark?
  • 3X~25X than MapReduce framework ! From Mateis paper: http://0rz.tw/VVqgP Logistic regression RunningTime(S) 0 20 40 60 80 MR Spark 3 76 KMeans 0 27.5 55 82.5 110 MR Spark 33 106 PageRank 0 45 90 135 180 MR Spark 23 171
  • What is Spark Apache Spark is a very fast and general engine for large-scale data processing
  • Why is Spark so fast?
  • HDFS 100X lower than memory Store data into Network+Disk Network speed is 100X than memory Implement fault tolerance
  • MapReduce Pagerank ! ..readInputFromHDFS for (int runs = 0; runs < iter_runnumber ; runs++) { .. isCompleted = runRankCalculation(inPath,lastResultPath); } ..writeOutputToHDFS.
  • Workow Input HDFS Iter 1 RunRank Tmp HDFS Iter 2 RunRank Tmp HDFS Iter N RunRank Input HDFS Iter 1 RunRank Tmp Mem Iter 2 RunRank Tmp Mem Iter N RunRank MapReduce Spark
  • First iteration! take 200 sec 3rd iteration! take 20 sec Page Rank algorithm in 1 billion record url 2nd iteration! take 20 sec
  • RDD Resilient Distributed Dataset Collections of objects spread across a cluster, stored in RAM or on Disk Built through parallel transformations
  • Fault Tolerance
  • RDD RDD a RDD b val a =sc.textFile(hdfs://....) val b = a.ler( line=>line.contain(Spark) ) Value c val c = b.count() Transformation Action
  • Log mining val a = sc.textle(hdfs://aaa.com/a.txt)! val err = a.lter( t=> t.contains(ERROR) )! .lter( t=>t.contains(2014)! ! err.cache()! err.count()! ! val m = err.lter( t=> t.contains(MYSQL) )! ! ! .count()! val a = err.lter( t=> t.contains(APACHE) )! ! ! .count() Driver Worker! ! ! ! Worker! ! ! ! Worker! ! ! !Task TaskTask
  • Log mining val a = sc.textle(hdfs://aaa.com/a.txt)! val err = a.lter( t=> t.contains(ERROR) )! .lter( t=>t.contains(2014)! ! err.cache()! err.count()! ! val m = err.lter( t=> t.contains(MYSQL) )! ! ! .count()! val a = err.lter( t=> t.contains(APACHE) )! ! ! .count() Driver Worker! ! ! ! !Block1 RDD a Worker! ! ! ! !Block2 RDD a Worker! ! ! ! !Block3 RDD a
  • Log mining val a = sc.textle(hdfs://aaa.com/a.txt)! val err = a.lter( t=> t.contains(ERROR) )! .lter( t=>t.contains(2014)! ! err.cache()! err.count()! ! val m = err.lter( t=> t.contains(MYSQL) )! ! ! .count()! val a = err.lter( t=> t.contains(APACHE) )! ! ! .count() Driver Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Block1 Block2 Block3
  • Log mining val a = sc.textle(hdfs://aaa.com/a.txt)! val err = a.lter( t=> t.contains(ERROR) )! .lter( t=>t.contains(2014)! ! err.cache()! err.count()! ! val m = err.lter( t=> t.contains(MYSQL) )! ! ! .count()! val a = err.lter( t=> t.contains(APACHE) )! ! ! .count() Driver Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Block1 Block2 Block3
  • Log mining val a = sc.textle(hdfs://aaa.com/a.txt)! val err = a.lter( t=> t.contains(ERROR) )! .lter( t=>t.contains(2014)! ! err.cache()! err.count()! ! val m = err.lter( t=> t.contains(MYSQL) )! ! ! .count()! val a = err.lter( t=> t.contains(APACHE) )! ! ! .count() Driver Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Worker! ! ! ! ! RDD err Cache1 Cache2 Cache3
  • Log mining val a = sc.textle(hdfs://aaa.com/a.txt)! val err = a.lter( t=> t.contains(ERROR) )! .lter( t=>t.contains(2014)! ! err.cache()! err.count()! ! val m = err.lter( t=> t.contains(MYSQL) )! ! ! .count()! val a = err.lter( t=> t.contains(APACHE) )! ! ! .count() Driver Worker! ! ! ! ! RDD m Worker! ! ! ! ! RDD m Worker! ! ! ! ! RDD m Cache1 Cache2 Cache3
  • Log mining val a = sc.textle(hdfs://aaa.com/a.txt)! val err = a.lter( t=> t.contains(ERROR) )! .lter( t=>t.contains(2014)! ! err.cache()! err.count()! ! val m = err.lter( t=> t.contains(MYSQL) )! ! ! .count()! val a = err.lter( t=> t.contains(APACHE) )! ! ! .count() Driver Worker! ! ! ! ! RDD a Worker! ! ! ! ! RDD a Worker! ! ! ! ! RDD a Cache1 Cache2 Cache3
  • 1st iteration(no cache)! take same time with cache! take 7 sec RDD Cache
  • RDD Cache Data locality Cache A big shufe! take 20min After cache, take only 265ms self join 5 billion record data
  • Easy to use Interactive Shell Multi Language API JVM: Scala, JAVA PySpark: Python
  • Scala Word Count val le = spark.textFile("hdfs://...") val counts = le.atMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...")
  • Step by Step le.atMap(line => line.split(" )) => (aaa,bb,cc) .map(word => (word, 1)) => ((aaa,1),(bb,1)..) .reduceByKey(_ + _) => ((aaa,123),(bb,23))
  • Java Wordcount JavaRDD le = spark.textFile("hdfs://..."); JavaRDD words = le.atMap(new FlatMapFunction() public Iterable call(String s) { return Arrays.asList(s.split(" ")); } }); JavaPairRDD pairs = words.map(new PairFunction() public Tuple2 call(String s) { return new Tuple2(s, 1); } }); JavaPairRDD counts = pairs.reduceByKey(new Function2() public Integer call(Integer a, Integer b) { return a + b; } }); counts.saveAsTextFile("hdfs://...");
  • Java vs Scala Scala : le.atMap(line => line.split(" ")) Java version : JavaRDD words = le.atMap(new FlatMapFunction() public Iterable call(String s) { return Arrays.asList(s.split(" ")); } });
  • Python le = spark.textFile("hdfs://...") counts = le.atMap(lambda line: line.split(" ")) .map(lambda word: (word, 1)) .reduceByKey(lambda a, b: a + b) counts.saveAsTextFile("hdfs://...")
  • Highly Recommend Scala : Latest API feature, Stable Python very familiar language Native Lib: NumPy, SciPy