apache spark: moving on from hadoop
TRANSCRIPT
![Page 1: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/1.jpg)
Apache SparkMoving on from Hadoop
Víctor Sánchez AnguixUniversitat Politècnica de València
MSc. In Artificial Intelligence, Pattern Recognition, and Digital Image
Course 2014/2015
![Page 2: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/2.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Hadoop is unbeatable (?)
![Page 3: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/3.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
https://spark.apache.org/
Hadoop is unbeatable (?)
![Page 4: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/4.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Google Trends
Hadoop is unbeatable (?)
![Page 5: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/5.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
http://www.datanami.com/2014/03/06/apache_spark_3_real-world_use_cases/
Hadoop is unbeatable (?)
![Page 6: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/6.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Open source cluster computing
➢ Distributed disk → Distributed memory
➢ Created at UC Berkeley in 2009
➢ Last major release: Dec 2014https://spark.apache.org/
What is Apache Spark?
![Page 7: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/7.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Core concept in Spark
➢ Distributed collection of objects in memory
➢ Operate on parallel on RDDs
➢ Read from file, distributed file system, or parallelize existing collection
Resilient Distributed Dataset (RDD)
![Page 8: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/8.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ RDDs are fault tolerant
➢ Spark maintains DAG of operations for getting a RDD
➢ We can cache RDDs to save computations
Resilient Distributed Dataset (RDD)
![Page 9: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/9.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
https://spark.apache.org/
Spark Architecture
Interact with cluster
Main program, coordinates tasks
Assigns resources
Carries out tasks, manages RDD chunks
![Page 10: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/10.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Main application for a Spark script
➢ Creates Spark context and coordinates executors
➢ Executes instructions in Java/Python/Scala
➢ ONLY parallelizes operations on RDDs
Driver program
![Page 11: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/11.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ We will stick to Scala
➢ Functional programming
➢ Completely integrated with Java
➢ Shorter code due to Scala abstractions
Programming in Spark
![Page 12: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/12.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ We have an interactive shell
spark-shell --master local[4]
spark-shell --master yarn-client
Programming in Spark
Number of cores to use in local mode
Use resources from a yarn cluster like Hadoop
![Page 13: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/13.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.2.0
/_/
Using Scala version 2.10.4 (Java HotSpot(TM) 64-Bit Server VM, Java 1.7.0_25)
Type in expressions to have them evaluated.
Type :help for more information.
15/02/02 11:43:03 INFO repl.SparkILoop: Created spark context..
Spark context available as sc.
scala>
Programming in Spark
![Page 14: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/14.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Special type of object
➢ Interacts with distributed resources:○ Read data○ Add resources (e.g., jars, files) to cluster○ Creates RDDs
○ etc.
➢ In spark shell, it is automatically created in sc
Spark context
![Page 15: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/15.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Load from local filesystem
➢ Load from HDFS
Loading a text file
val Students =sc.textFile( “file:///home/victor.sanchez/students.tsv” )
Students: org.apache.spark.rdd.RDD[String] = file://home/victor.sanchez/students.tsv MappedRDD[1] at textFile at <console>:12scala>
RDD of Strings
val Students=sc.textFile(“hdfs://localhost/user/victor.sanchez/students.tsv”)
Final variable, content does not change
![Page 16: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/16.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Take n elements from RDD
➢ Get whole RDD into driver
What’s in my dataset?
val x = Students.take( 3 )
x: Array[String] = Array(1 John Doe M 18, 2 Mary Doe F 20, 3Lara Croft F 25)
Array of Strings, LOCAL!! Resides in master
val x = Students.collect
x: Array[String] = Array(1 John Doe M 18, 2 Mary Doe F 20, 3Lara Croft F 25, 4 Sherlock Holmes M 36, 5 John Watson M
38, 6 SarahKerrigan F 21, 7 Bruce Wayne M 32, 8Tony Stark M 33, 9 Princess Peach F 21, 10 Peter Parker
M 23)
Elements to take
If no arguments, no need for ()
![Page 17: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/17.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Parallelize collection from driver
➢ Broadcast a variable (only sent once)
Can I go the reverse way?
val myArray = Array( 1, 2, 3, 4, 5 )
myArray: Array[Int] = Array(1, 2, 3, 4, 5)
val myArrayPar = sc.parallelize( myArray )
myArrayPar: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[4] at parallelize at <console>:14
val x = 6
val xBroad = sc.broadcast( x )
xBroad: org.apache.spark.broadcast.Broadcast[Int] = Broadcast(4)
Array creation
![Page 18: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/18.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Map: Project or generate new data
➢ It really takes an anonymous function as arg:
Basic operations on RDDs
val StudentsF = Students.map( l => l.split( "\t", -1 ) )
StudentsF: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[6] at map
StudentsF.take( 2 )
res6: Array[Array[String]] = Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20))
For each l in Students, generate its split
(l:String) => l.split( "\t", -1 )(x:Int,y:Int) => x+y
Input parameters
Output
![Page 19: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/19.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Map: Project or generate new data
Basic operations on RDDs
def splitWrapped(line:String) : Array[String] = { line.split( "t", -1 ) }
val StudentsF = Students.map( splitWrapped )
StudentsF: org.apache.spark.rdd.RDD[Array[String]] = MappedRDD[6] at map
StudentsF.take( 2 )
res6: Array[Array[String]] = Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20))
Output
Input
![Page 20: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/20.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Generate a new RDD for students with a new field indicating if the student is under 25 years
Exercise
![Page 21: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/21.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Foreach: Perform operation over each object
➢ Does not return a new RDD!
Basic operations on RDDs
StudentsF.foreach( x => print( x( 1 ) + " " ) )
John Bruce Tony Princess Peter 15/02/03 09:17:04 INFO executor.Executor: Finished task 1.0 in stage 13.0 (TID 27). 1693 bytes result sent to driverMary Lara Sherlock John Sarah 15/02/03 09:17:04 INFO executor.Executor: Finished task 0.0 in stage 13.0 (TID 26). 1693 bytes result sent to driver
![Page 22: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/22.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Filter: filter elements fulfilling a condition
Basic operations on RDDs
val StudentsFilt = StudentsF.filter( s => s( 0 ).toInt > 3 )
StudentsFilt: org.apache.spark.rdd.RDD[Array[String]] = FilteredRDD[13] at filter at <console>:16
StudentsFilt.take( 3 )
res13: Array[Array[String]] = Array(Array(4, Sherlock, Holmes, M, 36), Array(5, John, Watson, M, 38))
Anon. function
Convert to integer
![Page 23: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/23.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Distinct: only different objects
Basic operations on RDDs
val StudentsDis = StudentsF.map( s => s( 3 ) ).distinct
StudentsDis: org.apache.spark.rdd.RDD[String] = MappedRDD[18] at distinct at <console>:16
StudentsFilt.take( 2 )
res16: Array[String] = Array(F, M)
![Page 24: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/24.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Fold: Reduce all objects to a single object
➢ Beware, dummy is applied more than once
Basic operations on RDDs
val dummyStudent = Array( "12", "Clark", "Kent", "M", "25" )
val StudentsFold = StudentsF.fold( dummyStudent )( (acc,value) => { if ( value( 4 ).toInt > acc( 4 ).toInt) value else acc } )
StudentsFold: Array[String] = Array(5, John, Watson, M, 38)Starting left operand
Left operand
val StudentsFold = StudentsF.fold( dummyStudent )( (acc,value) => { Array( "[" + acc( 0 ) + "-" + value( 0 ) + "]" , acc( 1 ), acc( 2 ), acc( 3 ), acc( 4 ) ) } )
StudentsFold: Array[String] = Array([[12-[[[[12-7]-8]-9]-10]]-[[[[[[12-1]-2]-3]-4]-5]-6]], Clark, Kent, M, 0)
![Page 25: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/25.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Reduce: Reduce all objects to a single object
Basic operations on RDDs
val StudentsRed = StudentsF.map( s => s( 4 ).toInt ).reduce( _ + _ )
StudentsRed: Int = 267
Binary operatorConmutativeAssociate
![Page 26: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/26.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Max:
➢ Min:
Basic operations on RDDs
val StudentsMax = StudentsF.map( s => s( 4 ).toInt ).max
StudentsMax: Int = 38
val StudentsMin = StudentsF.map( s => s( 4 ).toInt ).min
StudentsMax: Int = 38
![Page 27: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/27.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Count:
➢ CountByValue: Count repetitions of elements
Basic operations on RDDs
val StudentsCount = StudentsF.count
StudentsCount: Long = 10
val StudentsCount = StudentsF.map( s => s( 3 ) ).countByValue
StudentsCount: scala.collection.Map[String,Long] = Map(M -> 6, F -> 4)
![Page 28: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/28.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Count the number of students that are female
Exercise
![Page 29: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/29.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Sample:
➢ RandomSplit: Splits into random RDDs
Basic operations on RDDs
val StudentsSample = StudentsF.sample( true, 0.5 )
StudentsSample: org.apache.spark.rdd.RDD[Array[String]] =PartitionwiseSampledRDD[33]
StudentsSample.take( 3 )
res18: Array[Array[String]] = Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20), Array(2, Mary, Doe, F, 20))
val StudentsSplit = StudentsF.randomSplit( Array( 0.8, 0.2 ) )
StudentsSplit: Array[org.apache.spark.rdd.RDD[Array[String]]] = Array(PartitionwiseSampledRDD[46]
StudentsSplit( 0 ).collect
res26: Array[Array[String]] = Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20), Array(5, John, Watson, M, 38), Array(6, Sarah, Kerrigan, F, 21), Array(9, Princess, Peach, F, 21), Array(10, Peter, Parker, M, 23))
With replacement and fraction
Weights for each partition
![Page 30: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/30.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ SortBy: Sort elements according to value
➢ Top: Get largest elements
Basic operations on RDDs
val StudentsSorted = StudentsF.sortBy( x => x( 4 ) )
StudentsSorted.collect
Array(Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20), Array(6, Sarah, Kerrigan, F, 21), ...
val StudentsTop = StudentsF.map( s => s( 4 ) ).top( 3 )
StudentsTop: Array[String] = Array(38, 36, 33)
Value to sort by
k elements to select
![Page 31: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/31.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Union: Two RDDs into one
Basic operations on RDDs
val StudentsUnder25 = Students.filter( s => s( 4 ).toInt < 25 )
val StudentsOver30 = Students.filter( s => s( 4 ).toInt > 30 )
val StudentsUnion = StudentsOver30.union( StudentsUnder25 )
StudentsUnion: org.apache.spark.rdd.RDD[Array[String]] = UnionRDD[75]
StudentsUnion.collect
Array[Array[String]] = Array(Array(4, Sherlock, Holmes, M, 36), Array(5, John, Watson, M, 38), Array(7, Bruce, Wayne, M, 32), Array(8, Tony, Stark, M, 33), Array(1, John, Doe, M, 18), Array(2, Mary, Doe, F, 20), Array(6, Sarah, Kerrigan, F, 21), Array(9, Princess, Peach, F, 21), Array(10, Peter, Parker, M, 23))
![Page 32: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/32.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Intersection: Common elements in two RDDs
Basic operations on RDDs
val StudentsUnder35 = StudentsF.map( s => s( 4 ).toInt ).filter( a => a < 35 )
val StudentsOver25 = StudentsF.map( s => s( 4 ).toInt ).filter( a => a > 25 )
val StudentsIntersect = StudentsUnder35.intersection( StudentsOver25 )
StudentsIntersect: org.apache.spark.rdd.RDD[Int] = MappedRDD[92]
StudentsIntersect.collect
res31: Array[Int] = Array(32, 33)
![Page 33: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/33.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Subtract: Elements in a RDD not in the other
Basic operations on RDDs
val StudentsUnder35 = StudentsF.map( s => s( 4 ).toInt ).filter( a => a < 35 )
val StudentsOver25 = StudentsF.map( s => s( 4 ).toInt ).filter( a => a > 25 )
val StudentsSub = StudentsUnder35.subtract( StudentsOver25 )
StudentsSub: org.apache.spark.rdd.RDD[Int] = MappedRDD[12]
StudentsSub.collect
res0: Array[Int] = Array(18, 20, 21, 21, 23, 25)
![Page 34: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/34.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Tuples in Scala:
➢ Pair RDDs → RDDs with tuples (key, value)
Pair RDDs
val myTuple = ( 13, "Bob", "Squarepants", "M", 10 )
myTuple._1
res6: Int = 13
Tuple creation
Access fields
val PairStudents = StudentsF.map( s => ( s( 3 ), s ) )
PairStudents.take( 3 )
res8: Array[(String, Array[String])] = Array((M,Array(1, John, Doe, M, 18)), (F,Array(2, Mary, Doe, F, 20)), (F,Array(3, Lara, Croft, F, 25)))
![Page 35: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/35.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Join:
Operations on Pair RDDs
val PairStudentsId = StudentsF.map( s => ( s( 0 ), s ) ) val PairGrades = GradesF.map( g => ( g( 0 ), g ) )val StudentGrades = PairStudentsId.join( PairGrades )
StudentGrades: org.apache.spark.rdd.RDD[(String, (Array[String], Array[String]))]
StudentGrades.take( 3 )
res13: Array[(String, (Array[String], Array[String]))] = Array((4,(Array(4, Sherlock, Holmes, M, 36),Array(4, Math, 2.3))), (4,(Array(4, Sherlock, Holmes, M, 36),Array(4, Biology, 6.7))), (4,(Array(4, Sherlock, Holmes, M, 36),Array(4, Engineering, 8.0))))
Prepare key, value structure
Output is (key, (value1, value2))
![Page 36: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/36.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Left Join:
Operations on Pair RDDs
val auxRDD = sc.parallelize( Array(Array("0","Dummy","Student","M","10"), Array("1","John","Doe","M","18" ) ) ) val auxPairRDD = auxRDD.map( a => ( a( 0 ), a ) )val auxGrades = auxPairRDD.leftOuterJoin( PairGrades )
auxGrades: org.apache.spark.rdd.RDD[(String, (Array[String], Option[Array[String]]))] = FlatMappedValuesRDD[34]
auxGrades.take( 2 )
res23: Array[(String, (Array[String], Option[Array[String]]))] = Array((0,(Array(0, Dummy, Student, M, 10),None)), (1,(Array(1, John, Doe, M, 18),Some([Ljava.lang.String;@30d4fbf))))
Option = None or a value
String representation for non emptyvalue in Option
![Page 37: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/37.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Left Join (cont):
Operations on Pair RDDs
val auxGrades = auxPairRDD.leftOuterJoin( PairGrades ).map( p => ( p._1, ( p._2._1, if(!p._2._2.isEmpty) p._2._2.get ) ) )
auxGrades.take( 2 )
res26: Array[(String, (Array[String], Any))] = Array((0,(Array(0, Dummy, Student, M, 10),())), (1,(Array(1, John, Doe, M, 18),Array(1, Math, 5.6))))
![Page 38: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/38.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ reduceByKey: To single object by key
Operations on Pair RDDs
val RedKeys = PairStudents.map({case (k,v) => ( k, v( 4 ).toInt ) }).reduceByKey( _ + _ )
RedKeys: org.apache.spark.rdd.RDD[(String, String)] = ShuffledRDD[84]
RedKeys.take( 2 )
res33: Array[(String, Int)] = Array((F,87), (M,180))
More than 1 line in anon function
Pattern matching
Result is a RDD
![Page 39: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/39.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ foldByKey: To single object by key
Operations on Pair RDDs
val foldedKeys = PairStudents.map({case(k,v) => (k,v(4).toInt)}).foldByKey(0)((a,b) => Math.max(a,b))
foldedKeys: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[92]
res30: Array[(String, Array[String])] = Array((F,Array(2, Mary, Doe, F, 20)), (F,Array(3, Lara, Croft, F, 25)), (F,Array(6, Sarah, Kerrigan, F, 21)))
Left parameter
Function
![Page 40: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/40.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ groupByKey: group values with same key
Operations on Pair RDDs
val groupedKeys = PairStudents.groupByKey
groupedKeys: org.apache.spark.rdd.RDD[(String, Iterable[Array[String]])] = ShuffledRDD[93]groupedKeys.collect
res35: Array[(String, Iterable[Array[String]])] = Array((F,CompactBuffer([Ljava.lang.String;@31788c16, [Ljava.lang.String;@613511b9, [Ljava.lang.String;@631eba8a, [Ljava.lang.String;@7668ecdc)), (M,CompactBuffer([Ljava.lang.String;@62969c3f, [Ljava.lang.String;@dec1eaa, [Ljava.lang.String;@8d1320a, [Ljava.lang.String;@5e2c330b, [Ljava.lang.String;@27cb477a, [Ljava.lang.String;@12c1aeff)))
groupedKeys.map({case (k,v)=>(k,v.map( x => "("+ x.mkString(",")+ ")" ) ) }).take(2)res40: Array[(String, Iterable[String])] = Array((F,List((2,Mary,Doe,F,20), (3,Lara,Croft,F,25), (6,Sarah,Kerrigan,F,21), (9,Princess,Peach,F,21))), (M,List((1,John,Doe,M,18), (4,Sherlock,Holmes,M,36), (5,John,Watson,M,38), (7,Bruce,Wayne,M,32), (8,Tony,Stark,M,33), (10,Peter,Parker,M,23))))
String repr of Iterable[Array[String]]
![Page 41: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/41.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Really useful for AI and ML
➢ For loop example
var StudentsLoop = PairStudents.map( s => (s(0).toInt,s(1),s(2)) )
for( i <- 1 to 10 ){ StudentsLoop = StudentsLoop.map( {case (id,name,surname) => (id+1,name,surname)} )
}
StudentsLoop.collect
res43: Array[(Int, String, String)] = Array((11,John,Doe), (12,Mary,Doe), (13,Lara,Croft), (14,Sherlock,Holmes), (15,John,Watson), (16,Sarah,Kerrigan), (17,Bruce,Wayne), (18,Tony,Stark), (19,Princess,Peach), (20,Peter,Parker))
Looping!
Non final variable
![Page 42: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/42.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Really useful for AI and ML
➢ For loop example
var StudentsLoop = PairStudents.map( s => (s(0).toInt,s(1),s(2)) )
for( i <- 1 to 10 ){ StudentsLoop = StudentsLoop.map( {case (id,name,surname) => (id+1,name,surname)} )
}
StudentsLoop.collect
res43: Array[(Int, String, String)] = Array((11,John,Doe), (12,Mary,Doe), (13,Lara,Croft), (14,Sherlock,Holmes), (15,John,Watson), (16,Sarah,Kerrigan), (17,Bruce,Wayne), (18,Tony,Stark), (19,Princess,Peach), (20,Peter,Parker))
Looping!
Non final variable
![Page 43: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/43.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Persist in memory/disk RDDs
➢ Other levels of persistance:○ MEMORY_ONLY, MEMORY_AND_DISK,
MEMORY_ONLY_SER, MEMORY_AND_DISK_SER, DISK_ONLY, etc...
Caching
PairStudents.cache
import org.apache.spark.storage.StorageLevel
GradesF.persist( StorageLevel.MEMORY_AND_DISK )
![Page 44: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/44.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Store to local or HDFS
Saving RDDs
PairStudents.map(x =>( x._1,"("+ x._2.mkString( "," )+")")).saveAsTextFile( "file:///home/victor.sanchez/res" )
PairStudents.map(x =>( x._1,"("+ x._2.mkString( "," )+")")).saveAsTextFile( "hdfs:///user/victor.sanchez/res" )
Trick to convert Array[String] properly to String
![Page 45: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/45.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
package es.upv.dsic.iarfid.haia
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext._
object mySparkScript {
def average(data: Iterable[Double]): Double = { data.reduceLeft( _ + _ )/data.size }
def main( args: Array[String] ) {
val sc = new SparkContext( ( new SparkConf() ).setAppName( "MY SPARK SCRIPT" ) )
val Grades = sc.textFile( args( 0 ) ).map( l => l.split( “\t”, -1 ) ).map( g => ( g( 1 ), g( 2 ).toDouble ) )
val GradesGr = Grades.groupByKey.map( g => ( g._1, average( g._2 ) ) )
GradesGr.saveAsTextFile( args( 1 ) )
}
}
Scripting Packa
Package
Packa
Imports
Packa
Support methods
Packa
Main method
Singleton object
Program arguments
Packa
![Page 46: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/46.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Compiling Spark code
➢ Scala code is compiled to Java Byte code
➢ sbt is a scala compiler for Scala and Java
➢ sbt can help us manage our dependencies
➢ Spark cluster → Fat jar, sbt assembly can do!
![Page 47: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/47.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Spark project example
build.sbt
lib/
project/
plugins.sbt
build.scala
src/
main/
resources/
test/
target/
Main .sbt file. Scala code to compile your scala source!
Plugins needed by sbt to compile your source
Your project source file
Extra libraries
Output jar for your project
How to compile your main .sbt
Test sources
Additional files for your jar
Your project code
![Page 48: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/48.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
import AssemblyKeys._
assemblySettings
name := "haia"
version := "1.0"
scalaVersion := "2.10.4"
organization := "es.upv"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.1.0" % "provided"
)
jarName in assembly := {
name.value + ".jar"
}
outputPath in assembly := {
file( "target/" + (jarName in assembly).value )
}
Main sbt file example
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
case PathList("javax", "servlet", xs @ _*) => MergeStrategy.first
case PathList(ps @ _*) if ps.last endsWith ".html" => MergeStrategy.first
case "unwanted.txt" => MergeStrategy.discard
case PathList( "META-INF", ".*pom.properties" ) => MergeStrategy.first
case x => old(x)
}
}
![Page 49: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/49.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Fat jar?○ A jar with all of the jar files it depends on○ Workers needs all dependencies
○ sbt-assembly plugin can generate fat jars
➢ Generating a fat jar:sbt assembly
Generating a fat jar
![Page 50: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/50.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
spark-submit --class es.upv.dsic.iarfid.haia.mySparkScript --master yarn-cluster target/haia.jar hdfs:///user/victor.sanchez/grades.tsv hdfs:///user/victor.sanchez/spark_submit_ex
How to execute Spark code from jar
Singleton object to execute
Fat jar file
Program parameters
![Page 51: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/51.jpg)
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Simulated annealing → Optimization method
➢ Multi-point → Exploring from different points
➢ Function to optimize:
Exercise: Multi-point simulated annealing
![Page 52: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/52.jpg)
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Single point simulated annealing
![Page 53: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/53.jpg)
Apache Pig: Making data transformation easy. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Time to work!
![Page 54: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/54.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Spark is still a quite novel technology➢ Unexpected Out of memory exceptions➢ Memory issues are difficult to debug in Spark➢ Avoid out of memory scenarios:
○ Use object serialization (Java or Kryo)○ Choose data structures wisely○ Increase parallelism (spark.default.parallelism)○ Avoid groupBy operations → reduceBy
○ More memory for shuffle (spark.shuffle.spill=false or higher spark.shuffle.memoryFraction)
A final advice on Spark
![Page 55: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/55.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
➢ Disk based parallelization
➢ No looping
➢ More mature project
➢ Many organizations use it
Hadoop ecosystem vs Spark
➢ Memory based parallelization
➢ Loopings (nice for AI an ML)
➢ Initial steps for Spark
➢ Changing all Hadoop code has a cost
![Page 56: Apache Spark: Moving on from Hadoop](https://reader031.vdocument.in/reader031/viewer/2022032219/55af670a1a28ab2c488b4741/html5/thumbnails/56.jpg)
Apache Spark: Moving on from Hadoop. MSc. in Artificial Intelligence, Pattern Recognition and Digital Image
Extra information
➢ http://spark.apache.org/
➢ Learning Spark: Lightning-Fast Big Data Analysis. Holden Karau et al. Ed. O’Reilly
➢ StackOverflow