survey of spark for data pre-processing and analytics
TRANSCRIPT
Yannick Pouliot Consulting© 2015 all rights reserved
Yannick Pouliot, [email protected]
8/12/2015
Spark: New Vistas of Computing
Power
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
Spark: An Open Source Cluster-Based Distributed Computing
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
Distributed Computing A La SparkHere, “distributed computing” means…• Multi-node cluster: master, slaves• Paradigm is:
o Minimize networking by allocating chunks of data to slaveso Slaves receive code to run on their subset of the data
• Lots of redundancy• No communication between nodes
o only with the master
• Operating on commodity hardware
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
Distributed Computing Is The Rage Everywhere … Except In Academia
That Should Disturb You
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
Some Ideal Applicationshere, using Spark’s MLlib library
The highly distributed nature of Spark means it is ideal for …• Generating lots of…
o Trees in a random foresto Permutations for computing distributionso Samplings (boostrapping, sub-sampling)
• Cleaning up textual data, e.g., o NLP on EHR recordso Mapping variant spellings of drug names to UMLS CUIs
• Normalizing datasets• Computing basic statistics• Hypothesis testing
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
Spark = Speed• Focus of Spark is to make data analytics fast
• fast to run code• fast to write code
• To run programs faster, Spark provides primitives for in-memory cluster computingo job can load data into memory and queried repeatedly o much quicker than with disk-based systems like Hadoop MapReduce
• To make programming faster, Spark integrates into the Scala programming languageo Enables manipulation of distributed datasets as if they were local collectionso Can use Spark interactively to query big data from the Scala interpreter
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
HDFS
Marketing-Level Architecture Stack
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
Architecture: Closer To Reality
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
Dynamics
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
Data Sources Integration
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
Ecosystem
Yannick Pouliot Consulting© 2015 all rights reserved
Spark’s Machine Learning Library: MLlib
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
MLlib
Current Functionality
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
Spark Programming ModelSpark follows REPL model: read–eval–print loop
o Similar to R and python shello Ideal for exploratory data analysis
Writing a Spark program typically consists of:1. Reading some input data into local memory2. Invoking transformation or actions that operate on a subset of data in local
memory3. Running those transformations/actions in a distributed fashion on the
network (memory or disk)4. Deciding what actions to undertake next
Best of all, it can all be done within the shell, just like R (or Python)
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
RDD: The Secret Sauce
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
RDD = Distributed Data Frame• RDD = “Resilient Data Dataset”
o Similar to an R data frame, but laid out across a cluster of machines as a collection of partitions• Partition = subset of the data
o Spark Master node remembers all transformations applied to an RDD• if a partition is lost (e.g., a slave machine goes down), it can easily be reconstructed on
some other machine in the cluster (“lineage”)o “Resilient” = RDD can always be reconstructed because of lineage tracking
• RDDs are Spark’s fundamental abstraction for representing a collection of objects that can be distributed across multiple machines in a clustero val rdd = sc.parallelize(Array(1, 2, 2, 4), 4)
• 4 = number of “partitions”• Partitions are fundamental unit of parallelism
o val rdd2 = sc.textFile("linkage")
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
Partitions = Data Slices• Partitions are “slices” a dataset is cut into • Spark will run one task for each partition • Typically 2-4 partitions for each CPU in cluster
o Normally, Spark tries to set the number of partitions automatically based on your clustero Can also be set manually as a second parameter to parallelize (e.g. sc.parallelize(data, 4))
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
A Brief Word About Scala• One of the two languages Spark is built on (the other being Java)• Compiles to Java byte code• REPL based, so good for exploratory data analysis• Pure OO language• Much more streamlined than Java
o Way fewer lines of code
• Lots of type inferring • Seamless calls to Java
o E.g., you might be using a Java method inside a Scale object…
• Broad’s GATK is written in Scala
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
Example: Word Count In Spark Using Scala
scala> val hamlet = sc.textFile(“/Users/akuntamukkala/temp/gutenburg.txt”)
hamlet: org.apache.spark.rdd.RDD[String] =` MappedRDD[1] at textFile at <console>:12
scala> val topWordCount = hamlet.flatMap(str=>str.split(“ “)). filter(!_.isEmpty).map(word=>(word,1)).reduceByKey(_+_).map{case (word, count) => (count, word)}.sortByKey(false)
topWordCount: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[10] at sortByKey at <console>:14
scala> topWordCount.take(5).foreach(x=>println(x))(1044,the)(730,and)(679,of)(648,to)(511,I)
RDD lineage
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
More On Word Count
Yannick Pouliot Consulting© 2015 all rights reserved
A Quick Tour of Common Spark
Functions
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
Common Transformations
TRANSFORMATION & PURPOSE EXAMPLE & RESULT
filter(func) Purpose: new RDD by selecting those data elements on which func returns true
scala> val rdd = sc.parallelize(List(“ABC”,”BCD”,”DEF”)) scala> val filtered = rdd.filter(_.contains(“C”)) scala> filtered.collect() Result:Array[String] = Array(ABC, BCD)
map(func) Purpose: return new RDD by applying func on each data element
scala> val rdd=sc.parallelize(List(1,2,3,4,5)) scala> val times2 = rdd.map(_*2) scala> times2.collect() Result:Array[Int] = Array(2, 4, 6, 8, 10)
flatMap(func) Purpose: Similar to map but func returns a Seq instead of a value. For example, mapping a sentence into a Seq of words
scala> val rdd=sc.parallelize(List(“Spark is awesome”,”It is fun”)) scala> val fm=rdd.flatMap(str=>str.split(“ “)) scala> fm.collect() Result:Array[String] = Array(Spark, is, awesome, It, is, fun)
reduceByKey(func,[numTasks]) Purpose: To aggregate values of a key using a function. “numTasks” is an optional parameter to specify number of reduce tasks
scala> val word1=fm.map(word=>(word,1)) scala> val wrdCnt=word1.reduceByKey(_+_) scala> wrdCnt.collect() Result:Array[(String, Int)] = Array((is,2), (It,1), (awesome,1), (Spark,1), (fun,1))
groupByKey([numTasks]) Purpose: To convert (K,V) to (K,Iterable<V>)
scala> val cntWrd = wrdCnt.map{case (word, count) => (count, word)} scala> cntWrd.groupByKey().collect() Result:Array[(Int, Iterable[String])] = Array((1,ArrayBuffer(It, awesome, Spark, fun)), (2,ArrayBuffer(is)))
distinct([numTasks]) Purpose: Eliminate duplicates from RDD
scala> fm.distinct().collect() Result:Array[String] = Array(is, It, awesome, Spark, fun)
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
ACTION & PURPOSE EXAMPLE & RESULT
count() Purpose: get the number of data elements in the RDD
scala> val rdd = sc.parallelize(list(‘A’,’B’,’c’)) scala> rdd.count() Result:long = 3
collect() Purpose: get all the data elements in an RDD as an array
scala> val rdd = sc.parallelize(list(‘A’,’B’,’c’)) scala> rdd.collect() Result:Array[char] = Array(A, B, c)
reduce(func) Purpose: Aggregate the data elements in an RDD using this function which takes two arguments and returns one
scala> val rdd = sc.parallelize(list(1,2,3,4)) scala> rdd.reduce(_+_) Result:Int = 10
take (n) Purpose: : fetch first n data elements in an RDD. computed by driver program.
Scala> val rdd = sc.parallelize(list(1,2,3,4)) scala> rdd.take(2) Result:Array[Int] = Array(1, 2)
foreach(func) Purpose: execute function for each data element in RDD. usually used to update an accumulator(discussed later) or interacting with external systems.
Scala> val rdd = sc.parallelize(list(1,2,3,4)) scala> rdd.foreach(x=>println(“%s*10=%s”. format(x,x*10)))Result:1*10=10 4*10=40 3*10=30 2*10=20
first() Purpose: retrieves the first data element in RDD. Similar to take(1)
scala> val rdd = sc.parallelize(list(1,2,3,4)) scala> rdd.first() Result:Int = 1
saveAsTextFile(path) Purpose: Writes the content of RDD to a text file or a set of text files to local file system/ HDFS
scala> val hamlet = sc.textFile(“/users/akuntamukkala/ temp/gutenburg.txt”) scala> hamlet.filter(_.contains(“Shakespeare”)). saveAsTextFile(“/users/akuntamukkala/temp/ filtered”)Result:akuntamukkala@localhost~/temp/filtered$ ls _SUCCESS part-00000 part-00001
Common Actions
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
TRANSFORMATION AND PURPOSE EXAMPLE AND RESULT
union()Purpose: new RDD containing all elements from source RDD and argument.
Scala> val rdd1=sc.parallelize(List(‘A’,’B’))scala> val rdd2=sc.parallelize(List(‘B’,’C’))scala> rdd1.union(rdd2).collect()Result:Array[Char] = Array(A, B, B, C)
intersection()Purpose: new RDD containing only common elements from source RDD and argument.
Scala> rdd1.intersection(rdd2).collect()Result:Array[Char] = Array(B)
cartesian()Purpose: new RDD cross product of all elements from source RDD and argument
Scala> rdd1.cartesian(rdd2).collect()Result:Array[(Char, Char)] = Array((A,B), (A,C), (B,B), (B,C))
subtract()Purpose: new RDD created by removing data elements in source RDD in common with argument
scala> rdd1.subtract(rdd2).collect() Result:Array[Char] = Array(A)
join(RDD,[numTasks])Purpose: When invoked on (K,V) and (K,W), this operation creates a new RDD of (K, (V,W))
scala> val personFruit = sc.parallelize(Seq((“Andy”, “Apple”), (“Bob”, “Banana”), (“Charlie”, “Cherry”), (“Andy”,”Apricot”)))scala> val personSE = sc.parallelize(Seq((“Andy”, “Google”), (“Bob”, “Bing”), (“Charlie”, “Yahoo”), (“Bob”,”AltaVista”)))scala> personFruit.join(personSE).collect()Result:Array[(String, (String, String))] = Array((Andy,(Apple,Google)), (Andy,(Apricot,Google)), (Charlie,(Cherry,Yahoo)), (Bob,(Banana,Bing)), (Bob,(Banana,AltaVista)))
cogroup(RDD,[numTasks])Purpose: To convert (K,V) to (K,Iterable<V>)
scala> personFruit.cogroup(personSe).collect()Result:Array[(String, (Iterable[String], Iterable[String]))] = Array((Andy,(ArrayBuffer(Apple, Apricot),ArrayBuffer(google))), (Charlie,(ArrayBuffer(Cherry),ArrayBuffer(Yahoo))), (Bob,(ArrayBuffer(Banana),ArrayBuffer(Bing, AltaVista))))
Common Set Operations
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
Spark comes with R binding!
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
Spark and R: A Marriage Made In Heaven
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
SparkR: A Package For Computing with Spark
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
Side Bar: Running Spark R on Amazon AWS1. Launch Spark cluster at Amazon:
./spark-ec2 --key-pair=spark-df --identity-file=/Users/code/Downloads/spark-df.pem --region=eu-west-1-s 1--instance-type c3.2xlarge launch mysparkr 2. Launch SparkR locally:
chmod u+w /root/spark/ ./spark/bin/sparkR
AWS is probably the best way for academics to access Spark, because of complexity deploying the infrastructure
Yannick Pouliot Consulting© 2015 all rights reserved
And Not Just R: Python
and Spark
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
Python Code Example
Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved
Yannick Pouliot Consulting© 2015 all rights reserved
Questions?