apache spark: moving on from hadoop

Apache SparkMoving on from Hadoop

Víctor Sánchez AnguixUniversitat Politècnica de València

MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image

Course 2014/2015

Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

Hadoop is unbeatable (?)


https://spark.apache.org/





Google Trends



http://www.datanami.com/2014/03/06/apache_spark_3_real-world_use_cases/





➢ Open source cluster computing

➢ Distributed disk → Distributed memory

➢ Created at UC Berkeley in 2009

➢ Last major release: Dec 2014https://spark.apache.org/

What is Apache Spark?




➢ Core concept in Spark

➢ Distributed collection of objects in memory

➢ Operate on parallel on RDDs

➢ Read from file, distributed file system, or parallelize existing collection

Resilient Distributed Dataset (RDD)


➢ RDDs are fault tolerant

➢ Spark maintains DAG of operations for getting a RDD

➢ We can cache RDDs to save computations

Resilient Distributed Dataset (RDD)



Spark Architecture

Interact with cluster

Main program, coordinates tasks

Assigns resources

Carries out tasks, manages RDD chunks




➢ Main application for a Spark script

➢ Creates Spark context and coordinates executors

➢ Executes instructions in Java/Python/Scala

➢ ONLY parallelizes operations on RDDs

Driver program


➢ We will stick to Scala

➢ Functional programming

➢ Completely integrated with Java

➢ Shorter code due to Scala abstractions

Programming in Spark


➢ We have an interactive shell

spark-shell --master local[4]

spark-shell --master yarn-client


Number of cores to use in local mode

Use resources from a yarn cluster like Hadoop


Welcome to

____ __

/ __/__ ___ _____/ /__

_\ \/ _ \/ _ `/ __/ '_/

/___/ .__/\_,_/_/ /_/\_\ version 1.2.0

/_/

Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_25)

Type in expressions to have them evaluated.

Type :help for more information.

15/02/02 11:43:03 INFO repl.SparkILoop: Created spark context..

Spark context available as sc.

scala>



➢ Special type of object

➢ Interacts with distributed resources:○ Read data○ Add resources (e.g., jars, files) to cluster○ Creates RDDs

○ etc.

➢ In spark shell, it is automatically created in sc

Spark context


➢ Load from local filesystem

➢ Load from HDFS

Loading a text file

val Students =sc.textFile( “file:///home/victor.sanchez/students.tsv” )

Students: org.apache.spark.rdd.RDD[String] = file://home/victor.sanchez/students.tsv MappedRDD[1] at textFile at <console>:12scala>

RDD of Strings

val Students=sc.textFile(“hdfs://localhost/user/victor.sanchez/students.tsv”)

Final variable, content does not change


➢ Take n elements from RDD

➢ Get whole RDD into driver

What’s in my dataset?

val x = Students.take( 3 )

x: Array[String] = Array(1 John Doe M 18, 2 Mary Doe F 20, 3Lara Croft F 25)

Array of Strings, LOCAL!! Resides in master

val x = Students.collect

x: Array[String] = Array(1 John Doe M 18, 2 Mary Doe F 20, 3Lara Croft F 25, 4 Sherlock Holmes M 36, 5 John Watson M

38, 6 SarahKerrigan F 21, 7 Bruce Wayne M 32, 8Tony Stark M 33, 9 Princess Peach F 21, 10 Peter Parker

M 23)

Elements to take

If no arguments, no need for ()


➢ Parallelize collection from driver

➢ Broadcast a variable (only sent once)

Can I go the reverse way?

val myArray = Array( 1, 2, 3, 4, 5 )

myArray: Array[Int] = Array(1, 2, 3, 4, 5)

val myArrayPar = sc.parallelize( myArray )

myArrayPar: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:14

val x = 6

val xBroad = sc.broadcast( x )

xBroad: org.apache.spark.broadcast.Broadcast[Int] = Broadcast(4)

Array creation


➢ Map: Project or generate new data

➢ It really takes an anonymous function as arg:

Basic operations on RDDs

val StudentsF = Students.map( l => l.split( "\t", -1 ) )

StudentsF: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[6] at map

StudentsF.take( 2 )

res6: Array[Array[String]] = Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20))

For each l in Students, generate its split

(l:String) => l.split( "\t", -1 )(x:Int,y:Int) => x+y

Input parameters

Output


➢ Map: Project or generate new data


def splitWrapped(line:String) : Array[String] = { line.split( "t", -1 ) }

val StudentsF = Students.map( splitWrapped )

StudentsF: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[6] at map

StudentsF.take( 2 )

res6: Array[Array[String]] = Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20))

Output

Input


➢ Generate a new RDD for students with a new field indicating if the student is under 25 years

Exercise


➢ Foreach: Perform operation over each object

➢ Does not return a new RDD!


StudentsF.foreach( x => print( x( 1 ) + " " ) )

John Bruce Tony Princess Peter 15/02/03 09:17:04 INFO executor.Executor: Finished task 1.0 in stage 13.0 (TID 27). 1693 bytes result sent to driverMary Lara Sherlock John Sarah 15/02/03 09:17:04 INFO executor.Executor: Finished task 0.0 in stage 13.0 (TID 26). 1693 bytes result sent to driver


➢ Filter: filter elements fulfilling a condition


val StudentsFilt = StudentsF.filter( s => s( 0 ).toInt > 3 )

StudentsFilt: org.apache.spark.rdd.RDD[Array[String]] = FilteredRDD[13] at filter at <console>:16

StudentsFilt.take( 3 )

res13: Array[Array[String]] = Array(Array(4, Sherlock, Holmes, M, 36), Array(5, John, Watson, M, 38))

Anon. function

Convert to integer


➢ Distinct: only different objects


val StudentsDis = StudentsF.map( s => s( 3 ) ).distinct

StudentsDis: org.apache.spark.rdd.RDD[String] = MappedRDD[18] at distinct at <console>:16

StudentsFilt.take( 2 )

res16: Array[String] = Array(F, M)


➢ Fold: Reduce all objects to a single object

➢ Beware, dummy is applied more than once


val dummyStudent = Array( "12", "Clark", "Kent", "M", "25" )

val StudentsFold = StudentsF.fold( dummyStudent )( (acc,value) => { if ( value( 4 ).toInt > acc( 4 ).toInt) value else acc } )

StudentsFold: Array[String] = Array(5, John, Watson, M, 38)Starting left operand

Left operand

val StudentsFold = StudentsF.fold( dummyStudent )( (acc,value) => { Array( "[" + acc( 0 ) + "-" + value( 0 ) + "]" , acc( 1 ), acc( 2 ), acc( 3 ), acc( 4 ) ) } )

StudentsFold: Array[String] = Array([[12-[[[[12-7]-8]-9]-10]]-[[[[[[12-1]-2]-3]-4]-5]-6]], Clark, Kent, M, 0)


➢ Reduce: Reduce all objects to a single object


val StudentsRed = StudentsF.map( s => s( 4 ).toInt ).reduce( _ + _ )

StudentsRed: Int = 267

Binary operatorConmutativeAssociate


➢ Max:

➢ Min:


val StudentsMax = StudentsF.map( s => s( 4 ).toInt ).max

StudentsMax: Int = 38

val StudentsMin = StudentsF.map( s => s( 4 ).toInt ).min

StudentsMax: Int = 38


➢ Count:

➢ CountByValue: Count repetitions of elements


val StudentsCount = StudentsF.count

StudentsCount: Long = 10

val StudentsCount = StudentsF.map( s => s( 3 ) ).countByValue

StudentsCount: scala.collection.Map[String,Long] = Map(M -> 6, F -> 4)


➢ Count the number of students that are female

Exercise


➢ Sample:

➢ RandomSplit: Splits into random RDDs


val StudentsSample = StudentsF.sample( true, 0.5 )

StudentsSample: org.apache.spark.rdd.RDD[Array[String]] =PartitionwiseSampledRDD[33]

StudentsSample.take( 3 )

res18: Array[Array[String]] = Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20), Array(2, Mary, Doe, F, 20))

val StudentsSplit = StudentsF.randomSplit( Array( 0.8, 0.2 ) )

StudentsSplit: Array[org.apache.spark.rdd.RDD[Array[String]]] = Array(PartitionwiseSampledRDD[46]

StudentsSplit( 0 ).collect

res26: Array[Array[String]] = Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20), Array(5, John, Watson, M, 38), Array(6, Sarah, Kerrigan, F, 21), Array(9, Princess, Peach, F, 21), Array(10, Peter, Parker, M, 23))

With replacement and fraction

Weights for each partition


➢ SortBy: Sort elements according to value

➢ Top: Get largest elements


val StudentsSorted = StudentsF.sortBy( x => x( 4 ) )

StudentsSorted.collect

Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20), Array(6, Sarah, Kerrigan, F, 21), ...

val StudentsTop = StudentsF.map( s => s( 4 ) ).top( 3 )

StudentsTop: Array[String] = Array(38, 36, 33)

Value to sort by

k elements to select


➢ Union: Two RDDs into one


val StudentsUnder25 = Students.filter( s => s( 4 ).toInt < 25 )

val StudentsOver30 = Students.filter( s => s( 4 ).toInt > 30 )

val StudentsUnion = StudentsOver30.union( StudentsUnder25 )

StudentsUnion: org.apache.spark.rdd.RDD[Array[String]] = UnionRDD[75]

StudentsUnion.collect

Array[Array[String]] = Array(Array(4, Sherlock, Holmes, M, 36), Array(5, John, Watson, M, 38), Array(7, Bruce, Wayne, M, 32), Array(8, Tony, Stark, M, 33), Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20), Array(6, Sarah, Kerrigan, F, 21), Array(9, Princess, Peach, F, 21), Array(10, Peter, Parker, M, 23))


➢ Intersection: Common elements in two RDDs


val StudentsUnder35 = StudentsF.map( s => s( 4 ).toInt ).filter( a => a < 35 )

val StudentsOver25 = StudentsF.map( s => s( 4 ).toInt ).filter( a => a > 25 )

val StudentsIntersect = StudentsUnder35.intersection( StudentsOver25 )

StudentsIntersect: org.apache.spark.rdd.RDD[Int] = MappedRDD[92]

StudentsIntersect.collect

res31: Array[Int] = Array(32, 33)


➢ Subtract: Elements in a RDD not in the other


val StudentsUnder35 = StudentsF.map( s => s( 4 ).toInt ).filter( a => a < 35 )

val StudentsOver25 = StudentsF.map( s => s( 4 ).toInt ).filter( a => a > 25 )

val StudentsSub = StudentsUnder35.subtract( StudentsOver25 )

StudentsSub: org.apache.spark.rdd.RDD[Int] = MappedRDD[12]

StudentsSub.collect

res0: Array[Int] = Array(18, 20, 21, 21, 23, 25)


➢ Tuples in Scala:

➢ Pair RDDs → RDDs with tuples (key, value)

Pair RDDs

val myTuple = ( 13, "Bob", "Squarepants", "M", 10 )

myTuple._1

res6: Int = 13

Tuple creation

Access fields

val PairStudents = StudentsF.map( s => ( s( 3 ), s ) )

PairStudents.take( 3 )

res8: Array[(String, Array[String])] = Array((M,Array(1, John, Doe, M, 18)), (F,Array(2, Mary, Doe, F, 20)), (F,Array(3, Lara, Croft, F, 25)))


➢ Join:

Operations on Pair RDDs

val PairStudentsId = StudentsF.map( s => ( s( 0 ), s ) ) val PairGrades = GradesF.map( g => ( g( 0 ), g ) )val StudentGrades = PairStudentsId.join( PairGrades )

StudentGrades: org.apache.spark.rdd.RDD[(String, (Array[String], Array[String]))]

StudentGrades.take( 3 )

res13: Array[(String, (Array[String], Array[String]))] = Array((4,(Array(4, Sherlock, Holmes, M, 36),Array(4, Math, 2.3))), (4,(Array(4, Sherlock, Holmes, M, 36),Array(4, Biology, 6.7))), (4,(Array(4, Sherlock, Holmes, M, 36),Array(4, Engineering, 8.0))))

Prepare key, value structure

Output is (key, (value1, value2))


➢ Left Join:


val auxRDD = sc.parallelize( Array(Array("0","Dummy","Student","M","10"), Array("1","John","Doe","M","18" ) ) ) val auxPairRDD = auxRDD.map( a => ( a( 0 ), a ) )val auxGrades = auxPairRDD.leftOuterJoin( PairGrades )

auxGrades: org.apache.spark.rdd.RDD[(String, (Array[String], Option[Array[String]]))] = FlatMappedValuesRDD[34]

auxGrades.take( 2 )

res23: Array[(String, (Array[String], Option[Array[String]]))] = Array((0,(Array(0, Dummy, Student, M, 10),None)), (1,(Array(1, John, Doe, M, 18),Some([Ljava.lang.String;@30d4fbf))))

Option = None or a value

String representation for non emptyvalue in Option


➢ Left Join (cont):


val auxGrades = auxPairRDD.leftOuterJoin( PairGrades ).map( p => ( p._1, ( p._2._1, if(!p._2._2.isEmpty) p._2._2.get ) ) )

auxGrades.take( 2 )

res26: Array[(String, (Array[String], Any))] = Array((0,(Array(0, Dummy, Student, M, 10),())), (1,(Array(1, John, Doe, M, 18),Array(1, Math, 5.6))))


➢ reduceByKey: To single object by key


val RedKeys = PairStudents.map({case (k,v) => ( k, v( 4 ).toInt ) }).reduceByKey( _ + _ )

RedKeys: org.apache.spark.rdd.RDD[(String, String)] = ShuffledRDD[84]

RedKeys.take( 2 )

res33: Array[(String, Int)] = Array((F,87), (M,180))

More than 1 line in anon function

Pattern matching

Result is a RDD


➢ foldByKey: To single object by key


val foldedKeys = PairStudents.map({case(k,v) => (k,v(4).toInt)}).foldByKey(0)((a,b) => Math.max(a,b))

foldedKeys: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[92]

res30: Array[(String, Array[String])] = Array((F,Array(2, Mary, Doe, F, 20)), (F,Array(3, Lara, Croft, F, 25)), (F,Array(6, Sarah, Kerrigan, F, 21)))

Left parameter

Function


➢ groupByKey: group values with same key


val groupedKeys = PairStudents.groupByKey

groupedKeys: org.apache.spark.rdd.RDD[(String, Iterable[Array[String]])] = ShuffledRDD[93]groupedKeys.collect

res35: Array[(String, Iterable[Array[String]])] = Array((F,CompactBuffer([Ljava.lang.String;@31788c16, [Ljava.lang.String;@613511b9, [Ljava.lang.String;@631eba8a, [Ljava.lang.String;@7668ecdc)), (M,CompactBuffer([Ljava.lang.String;@62969c3f, [Ljava.lang.String;@dec1eaa, [Ljava.lang.String;@8d1320a, [Ljava.lang.String;@5e2c330b, [Ljava.lang.String;@27cb477a, [Ljava.lang.String;@12c1aeff)))

groupedKeys.map({case (k,v)=>(k,v.map( x => "("+ x.mkString(",")+ ")" ) ) }).take(2)res40: Array[(String, Iterable[String])] = Array((F,List((2,Mary,Doe,F,20), (3,Lara,Croft,F,25), (6,Sarah,Kerrigan,F,21), (9,Princess,Peach,F,21))), (M,List((1,John,Doe,M,18), (4,Sherlock,Holmes,M,36), (5,John,Watson,M,38), (7,Bruce,Wayne,M,32), (8,Tony,Stark,M,33), (10,Peter,Parker,M,23))))

String repr of Iterable[Array[String]]


➢ Really useful for AI and ML

➢ For loop example

var StudentsLoop = PairStudents.map( s => (s(0).toInt,s(1),s(2)) )

for( i <- 1 to 10 ){ StudentsLoop = StudentsLoop.map( {case (id,name,surname) => (id+1,name,surname)} )

}

StudentsLoop.collect

res43: Array[(Int, String, String)] = Array((11,John,Doe), (12,Mary,Doe), (13,Lara,Croft), (14,Sherlock,Holmes), (15,John,Watson), (16,Sarah,Kerrigan), (17,Bruce,Wayne), (18,Tony,Stark), (19,Princess,Peach), (20,Peter,Parker))

Looping!

Non final variable


➢ Persist in memory/disk RDDs

➢ Other levels of persistance:○ MEMORY_ONLY, MEMORY_AND_DISK,

MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, etc...

Caching

PairStudents.cache

import org.apache.spark.storage.StorageLevel

GradesF.persist( StorageLevel.MEMORY_AND_DISK )


➢ Store to local or HDFS

Saving RDDs

PairStudents.map(x =>( x._1,"("+ x._2.mkString( "," )+")")).saveAsTextFile( "file:///home/victor.sanchez/res" )

PairStudents.map(x =>( x._1,"("+ x._2.mkString( "," )+")")).saveAsTextFile( "hdfs:///user/victor.sanchez/res" )

Trick to convert Array[String] properly to String


package es.upv.dsic.iarfid.haia

import org.apache.spark.SparkContext

import org.apache.spark.SparkConf

import org.apache.spark.SparkContext._

object mySparkScript {

def average(data: Iterable[Double]): Double = { data.reduceLeft( _ + _ )/data.size }

def main( args: Array[String] ) {

val sc = new SparkContext( ( new SparkConf() ).setAppName( "MY SPARK SCRIPT" ) )

val Grades = sc.textFile( args( 0 ) ).map( l => l.split( “\t”, -1 ) ).map( g => ( g( 1 ), g( 2 ).toDouble ) )

val GradesGr = Grades.groupByKey.map( g => ( g._1, average( g._2 ) ) )

GradesGr.saveAsTextFile( args( 1 ) )

}

}

Scripting Packa

Package

Packa

Imports

Packa

Support methods

Packa

Main method

Singleton object

Program arguments

Packa


Compiling Spark code

➢ Scala code is compiled to Java Byte code

➢ sbt is a scala compiler for Scala and Java

➢ sbt can help us manage our dependencies

➢ Spark cluster → Fat jar, sbt assembly can do!


Spark project example

build.sbt

lib/

project/

plugins.sbt

build.scala

src/

main/

resources/

test/

target/

Main .sbt file. Scala code to compile your scala source!

Plugins needed by sbt to compile your source

Your project source file

Extra libraries

Output jar for your project

How to compile your main .sbt

Test sources

Additional files for your jar

Your project code


import AssemblyKeys._

assemblySettings

name := "haia"

version := "1.0"

scalaVersion := "2.10.4"

organization := "es.upv"

libraryDependencies ++= Seq(

"org.apache.spark" %% "spark-core" % "1.1.0" % "provided"

)

jarName in assembly := {

name.value + ".jar"

}

outputPath in assembly := {

file( "target/" + (jarName in assembly).value )

}

Main sbt file example

mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>

{

case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first

case PathList(ps @ _*) if ps.last endsWith ".html" => MergeStrategy.first

case "unwanted.txt" => MergeStrategy.discard

case PathList( "META-INF", ".*pom.properties" ) => MergeStrategy.first

case x => old(x)

}

}


➢ Fat jar?○ A jar with all of the jar files it depends on○ Workers needs all dependencies

○ sbt-assembly plugin can generate fat jars

➢ Generating a fat jar:sbt assembly

Generating a fat jar


spark-submit --class es.upv.dsic.iarfid.haia.mySparkScript --master yarn-cluster target/haia.jar hdfs:///user/victor.sanchez/grades.tsv hdfs:///user/victor.sanchez/spark_submit_ex

How to execute Spark code from jar

Singleton object to execute

Fat jar file

Program parameters

Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image

➢ Simulated annealing → Optimization method

➢ Multi-point → Exploring from different points

➢ Function to optimize:

Exercise: Multi-point simulated annealing


Single point simulated annealing


Time to work!


➢ Spark is still a quite novel technology➢ Unexpected Out of memory exceptions➢ Memory issues are difficult to debug in Spark➢ Avoid out of memory scenarios:

○ Use object serialization (Java or Kryo)○ Choose data structures wisely○ Increase parallelism (spark.default.parallelism)○ Avoid groupBy operations → reduceBy

○ More memory for shuffle (spark.shuffle.spill=false or higher spark.shuffle.memoryFraction)

A final advice on Spark


➢ Disk based parallelization

➢ No looping

➢ More mature project

➢ Many organizations use it

Hadoop ecosystem vs Spark

➢ Memory based parallelization

➢ Loopings (nice for AI an ML)

➢ Initial steps for Spark

➢ Changing all Hadoop code has a cost


Extra information

➢ http://spark.apache.org/

➢ Learning Spark: Lightning-Fast Big Data Analysis. Holden Karau et al. Ed. O’Reilly

➢ StackOverflow

http://spark.apache.org/



apache spark: moving on from hadoop

Data & Analytics

spark apache spark

spark script

tolerant spark

hadoopapache spark

pattern recognition

digital image hadoop

artificial intelligence

spark context available