survey of spark for data pre-processing and analytics

Yannick Pouliot Consulting© 2015 all rights reserved

Yannick Pouliot, [email protected]

8/12/2015

Spark: New Vistas of Computing

Power

Yannick Pouliot Consulting© 2015 all rights reservedYannick Pouliot Consulting© 2015 all rights reserved

Spark: An Open Source Cluster-Based Distributed Computing


Distributed Computing A La SparkHere, “distributed computing” means…• Multi-node cluster: master, slaves• Paradigm is:

o Minimize networking by allocating chunks of data to slaveso Slaves receive code to run on their subset of the data

• Lots of redundancy• No communication between nodes

o only with the master

• Operating on commodity hardware


Distributed Computing Is The Rage Everywhere … Except In Academia

That Should Disturb You


Some Ideal Applicationshere, using Spark’s MLlib library

The highly distributed nature of Spark means it is ideal for …• Generating lots of…

o Trees in a random foresto Permutations for computing distributionso Samplings (boostrapping, sub-sampling)

• Cleaning up textual data, e.g., o NLP on EHR recordso Mapping variant spellings of drug names to UMLS CUIs

• Normalizing datasets• Computing basic statistics• Hypothesis testing

http://spark.apache.org/docs/latest/mllib-statistics.html#stratified-sampling

http://spark.apache.org/docs/latest/mllib-statistics.html

http://spark.apache.org/docs/latest/mllib-statistics.html#hypothesis-testing


Spark = Speed• Focus of Spark is to make data analytics fast

• fast to run code• fast to write code

• To run programs faster, Spark provides primitives for in-memory cluster computingo job can load data into memory and queried repeatedly o much quicker than with disk-based systems like Hadoop MapReduce

• To make programming faster, Spark integrates into the Scala programming languageo Enables manipulation of distributed datasets as if they were local collectionso Can use Spark interactively to query big data from the Scala interpreter


HDFS

Marketing-Level Architecture Stack


Architecture: Closer To Reality


Dynamics


Data Sources Integration


Ecosystem


Spark’s Machine Learning Library: MLlib


MLlib

Current Functionality


Spark Programming ModelSpark follows REPL model: read–eval–print loop

o Similar to R and python shello Ideal for exploratory data analysis

Writing a Spark program typically consists of:1. Reading some input data into local memory2. Invoking transformation or actions that operate on a subset of data in local

memory3. Running those transformations/actions in a distributed fashion on the

network (memory or disk)4. Deciding what actions to undertake next

Best of all, it can all be done within the shell, just like R (or Python)


RDD: The Secret Sauce


RDD = Distributed Data Frame• RDD = “Resilient Data Dataset”

o Similar to an R data frame, but laid out across a cluster of machines as a collection of partitions• Partition = subset of the data

o Spark Master node remembers all transformations applied to an RDD• if a partition is lost (e.g., a slave machine goes down), it can easily be reconstructed on

some other machine in the cluster (“lineage”)o “Resilient” = RDD can always be reconstructed because of lineage tracking

• RDDs are Spark’s fundamental abstraction for representing a collection of objects that can be distributed across multiple machines in a clustero val rdd = sc.parallelize(Array(1, 2, 2, 4), 4)

• 4 = number of “partitions”• Partitions are fundamental unit of parallelism

o val rdd2 = sc.textFile("linkage")


Partitions = Data Slices• Partitions are “slices” a dataset is cut into • Spark will run one task for each partition • Typically 2-4 partitions for each CPU in cluster

o Normally, Spark tries to set the number of partitions automatically based on your clustero Can also be set manually as a second parameter to parallelize (e.g. sc.parallelize(data, 4))


A Brief Word About Scala• One of the two languages Spark is built on (the other being Java)• Compiles to Java byte code• REPL based, so good for exploratory data analysis• Pure OO language• Much more streamlined than Java

o Way fewer lines of code

• Lots of type inferring • Seamless calls to Java

o E.g., you might be using a Java method inside a Scale object…

• Broad’s GATK is written in Scala


Example: Word Count In Spark Using Scala

scala> val hamlet = sc.textFile(“/Users/akuntamukkala/temp/gutenburg.txt”)

hamlet: org.apache.spark.rdd.RDD[String] =` MappedRDD[1] at textFile at <console>:12

scala> val topWordCount = hamlet.flatMap(str=>str.split(“ “)). filter(!_.isEmpty).map(word=>(word,1)).reduceByKey(_+_).map{case (word, count) => (count, word)}.sortByKey(false)

topWordCount: org.apache.spark.rdd.RDD[(Int, String)] = MapPartitionsRDD[10] at sortByKey at <console>:14

scala> topWordCount.take(5).foreach(x=>println(x))(1044,the)(730,and)(679,of)(648,to)(511,I)

RDD lineage


More On Word Count


A Quick Tour of Common Spark

Functions


Common Transformations

TRANSFORMATION & PURPOSE EXAMPLE & RESULT

filter(func) Purpose: new RDD by selecting those data elements on which func returns true

scala> val rdd = sc.parallelize(List(“ABC”,”BCD”,”DEF”)) scala> val filtered = rdd.filter(_.contains(“C”)) scala> filtered.collect() Result:Array[String] = Array(ABC, BCD)

map(func) Purpose: return new RDD by applying func on each data element

scala> val rdd=sc.parallelize(List(1,2,3,4,5)) scala> val times2 = rdd.map(_*2) scala> times2.collect() Result:Array[Int] = Array(2, 4, 6, 8, 10)

flatMap(func) Purpose: Similar to map but func returns a Seq instead of a value. For example, mapping a sentence into a Seq of words

scala> val rdd=sc.parallelize(List(“Spark is awesome”,”It is fun”)) scala> val fm=rdd.flatMap(str=>str.split(“ “)) scala> fm.collect() Result:Array[String] = Array(Spark, is, awesome, It, is, fun)

reduceByKey(func,[numTasks]) Purpose: To aggregate values of a key using a function. “numTasks” is an optional parameter to specify number of reduce tasks

scala> val word1=fm.map(word=>(word,1)) scala> val wrdCnt=word1.reduceByKey(_+_) scala> wrdCnt.collect() Result:Array[(String, Int)] = Array((is,2), (It,1), (awesome,1), (Spark,1), (fun,1))

groupByKey([numTasks]) Purpose: To convert (K,V) to (K,Iterable<V>)

scala> val cntWrd = wrdCnt.map{case (word, count) => (count, word)} scala> cntWrd.groupByKey().collect() Result:Array[(Int, Iterable[String])] = Array((1,ArrayBuffer(It, awesome, Spark, fun)), (2,ArrayBuffer(is)))

distinct([numTasks]) Purpose: Eliminate duplicates from RDD

scala> fm.distinct().collect() Result:Array[String] = Array(is, It, awesome, Spark, fun)


ACTION & PURPOSE EXAMPLE & RESULT

count() Purpose: get the number of data elements in the RDD

scala> val rdd = sc.parallelize(list(‘A’,’B’,’c’)) scala> rdd.count() Result:long = 3

collect() Purpose: get all the data elements in an RDD as an array

scala> val rdd = sc.parallelize(list(‘A’,’B’,’c’)) scala> rdd.collect() Result:Array[char] = Array(A, B, c)

reduce(func) Purpose: Aggregate the data elements in an RDD using this function which takes two arguments and returns one

scala> val rdd = sc.parallelize(list(1,2,3,4)) scala> rdd.reduce(_+_) Result:Int = 10

take (n) Purpose: : fetch first n data elements in an RDD. computed by driver program.

Scala> val rdd = sc.parallelize(list(1,2,3,4)) scala> rdd.take(2) Result:Array[Int] = Array(1, 2)

foreach(func) Purpose: execute function for each data element in RDD. usually used to update an accumulator(discussed later) or interacting with external systems.

Scala> val rdd = sc.parallelize(list(1,2,3,4)) scala> rdd.foreach(x=>println(“%s*10=%s”. format(x,x*10)))Result:1*10=10 4*10=40 3*10=30 2*10=20

first() Purpose: retrieves the first data element in RDD. Similar to take(1)

scala> val rdd = sc.parallelize(list(1,2,3,4)) scala> rdd.first() Result:Int = 1

saveAsTextFile(path) Purpose: Writes the content of RDD to a text file or a set of text files to local file system/ HDFS

scala> val hamlet = sc.textFile(“/users/akuntamukkala/ temp/gutenburg.txt”) scala> hamlet.filter(_.contains(“Shakespeare”)). saveAsTextFile(“/users/akuntamukkala/temp/ filtered”)Result:akuntamukkala@localhost~/temp/filtered$ ls _SUCCESS part-00000 part-00001

Common Actions


TRANSFORMATION AND PURPOSE EXAMPLE AND RESULT

union()Purpose: new RDD containing all elements from source RDD and argument.

Scala> val rdd1=sc.parallelize(List(‘A’,’B’))scala> val rdd2=sc.parallelize(List(‘B’,’C’))scala> rdd1.union(rdd2).collect()Result:Array[Char] = Array(A, B, B, C)

intersection()Purpose: new RDD containing only common elements from source RDD and argument.

Scala> rdd1.intersection(rdd2).collect()Result:Array[Char] = Array(B)

cartesian()Purpose: new RDD cross product of all elements from source RDD and argument

Scala> rdd1.cartesian(rdd2).collect()Result:Array[(Char, Char)] = Array((A,B), (A,C), (B,B), (B,C))

subtract()Purpose: new RDD created by removing data elements in source RDD in common with argument

scala> rdd1.subtract(rdd2).collect() Result:Array[Char] = Array(A)

join(RDD,[numTasks])Purpose: When invoked on (K,V) and (K,W), this operation creates a new RDD of (K, (V,W))

scala> val personFruit = sc.parallelize(Seq((“Andy”, “Apple”), (“Bob”, “Banana”), (“Charlie”, “Cherry”), (“Andy”,”Apricot”)))scala> val personSE = sc.parallelize(Seq((“Andy”, “Google”), (“Bob”, “Bing”), (“Charlie”, “Yahoo”), (“Bob”,”AltaVista”)))scala> personFruit.join(personSE).collect()Result:Array[(String, (String, String))] = Array((Andy,(Apple,Google)), (Andy,(Apricot,Google)), (Charlie,(Cherry,Yahoo)), (Bob,(Banana,Bing)), (Bob,(Banana,AltaVista)))

cogroup(RDD,[numTasks])Purpose: To convert (K,V) to (K,Iterable<V>)

scala> personFruit.cogroup(personSe).collect()Result:Array[(String, (Iterable[String], Iterable[String]))] = Array((Andy,(ArrayBuffer(Apple, Apricot),ArrayBuffer(google))), (Charlie,(ArrayBuffer(Cherry),ArrayBuffer(Yahoo))), (Bob,(ArrayBuffer(Banana),ArrayBuffer(Bing, AltaVista))))

Common Set Operations


Spark comes with R binding!


Spark and R: A Marriage Made In Heaven


SparkR: A Package For Computing with Spark


Side Bar: Running Spark R on Amazon AWS1. Launch Spark cluster at Amazon:

./spark-ec2 --key-pair=spark-df --identity-file=/Users/code/Downloads/spark-df.pem --region=eu-west-1-s 1--instance-type c3.2xlarge launch mysparkr 2. Launch SparkR locally:

chmod u+w /root/spark/ ./spark/bin/sparkR

AWS is probably the best way for academics to access Spark, because of complexity deploying the infrastructure


And Not Just R: Python

and Spark


Python Code Example


Questions?

survey of spark for data pre-processing and analytics

Health & Medicine