introduction to spark with scala

41
Introduction to Spark with Scala Introduction to Spark with Scala Himanshu Gupta Software Consultant Knoldus Software LLP Himanshu Gupta Software Consultant Knoldus Software LLP

Upload: himanshu-gupta

Post on 15-Jul-2015

2.733 views

Category:

Engineering


11 download

TRANSCRIPT

Page 1: Introduction to Spark with Scala

Introduction to Spark with ScalaIntroduction to

Spark with Scala

Himanshu GuptaSoftware Consultant

Knoldus Software LLP

Himanshu GuptaSoftware Consultant

Knoldus Software LLP

Page 2: Introduction to Spark with Scala

Who am I ?Who am I ?

Himanshu Gupta (@himanshug735)

Software Consultant at Knoldus Software LLP

Spark & Scala enthusiast

Himanshu Gupta (@himanshug735)

Software Consultant at Knoldus Software LLP

Spark & Scala enthusiast

Page 3: Introduction to Spark with Scala

AgendaAgenda● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

Page 4: Introduction to Spark with Scala

What is Apache Spark ?What is Apache Spark ?

Fast and general engine for large-scale data processing with libraries for SQL, streaming, advanced analytics

Fast and general engine for large-scale data processing with libraries for SQL, streaming, advanced analytics

Page 5: Introduction to Spark with Scala

Spark HistorySpark History

Project Begins at

UCB AMP Lab

20092009

20102010

Open Sourced

Apache Incubator

20112011

20122012

20132013

20142014

20152015

Data Frames

ClouderaSupport

ApacheTop level

SparkSummit

2013

SparkSummit

2014

Page 6: Introduction to Spark with Scala

Spark StackSpark Stack

Img src - http://spark.apache.org/Img src - http://spark.apache.org/

Page 7: Introduction to Spark with Scala

Fastest Growing Open Source ProjectFastest Growing Open Source Project

Img src - https://databricks.com/blog/2015/03/31/spark-turns-five-years-old.htmlImg src - https://databricks.com/blog/2015/03/31/spark-turns-five-years-old.html

Page 8: Introduction to Spark with Scala

AgendaAgenda● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

Page 9: Introduction to Spark with Scala

Code SizeCode Size

Img src - http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdfImg src - http://spark-summit.org/wp-content/uploads/2013/10/Zaharia-spark-summit-2013-matei.pdf

Page 10: Introduction to Spark with Scala

Word Count Ex.public static class WordCountMapClass extends MapReduceBase

implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); }public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } }public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }

val file = spark.textFile("hdfs://...")val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)counts.saveAsTextFile("hdfs://...")

Page 11: Introduction to Spark with Scala

Daytona GraySort Record:Data to sort 100TB

Daytona GraySort Record:Data to sort 100TB

Img src -http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015 Img src -http://www.slideshare.net/databricks/new-directions-for-apache-spark-in-2015

Hadoop (2013):Hadoop (2013): 2100 nodes2100 nodes

72 minutes72 minutes

Spark (2014):Spark (2014): 206 nodes206 nodes

23 minutes23 minutes

Page 12: Introduction to Spark with Scala

Runs EverywhereRuns Everywhere

Img src - http://spark.apache.org/

Page 13: Introduction to Spark with Scala

Who are using Apache Spark ?Who are using Apache Spark ?

Img src - http://www.slideshare.net/datamantra/introduction-to-apache-spark-45062010Img src - http://www.slideshare.net/datamantra/introduction-to-apache-spark-45062010

Page 14: Introduction to Spark with Scala

AgendaAgenda

● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

Page 15: Introduction to Spark with Scala

Brief Introduction to RDDBrief Introduction to RDD

RDD stands for Resilient Distributed Dataset

A fault tolerant, distributed collection of objects.

In Spark all work is expressed in following ways:1) Creating new RDD(s)2) Transforming existing RDD(s)3) Calling operations on RDD(s)

RDD stands for Resilient Distributed Dataset

A fault tolerant, distributed collection of objects.

In Spark all work is expressed in following ways:1) Creating new RDD(s)2) Transforming existing RDD(s)3) Calling operations on RDD(s)

Page 16: Introduction to Spark with Scala

Example (RDD)Example (RDD)

val master = "local"val conf = new SparkConf().setMaster(master)

This is the Spark Configuration

Page 17: Introduction to Spark with Scala

Example (RDD)Example (RDD)

val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)

This is the Spark Context

Contd...Contd...

Page 18: Introduction to Spark with Scala

Example (RDD)Example (RDD)

val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)

This is the Spark Context

Contd...Contd...

Page 19: Introduction to Spark with Scala

Example (RDD)Example (RDD)

val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("data.txt")

Extract linesfrom text file

Contd...Contd...

Page 20: Introduction to Spark with Scala

Example (RDD)Example (RDD)

val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("demo.txt")val words = lines.flatMap(_.split(" ")).map((_,1))

Map linesto words

map

Contd...Contd...

Page 21: Introduction to Spark with Scala

Example (RDD)Example (RDD)val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("demo.txt")val words = lines.flatMap(_.split(" ")).map((_,1))val wordCountRDD = words.reduceByKey(_ + _)

Word Count RDDmap groupBy

Contd...Contd...

Page 22: Introduction to Spark with Scala

Example (RDD)Example (RDD)val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("demo.txt")val words = lines.flatMap(_.split(" ")).map((_,1))val wordCountRDD = words.reduceByKey(_ + _)val wordCount = wordCountRDD.collect

Map[word, count] map groupBy

collect

StartsComputation

Contd...Contd...

Page 23: Introduction to Spark with Scala

Example (RDD)Example (RDD)val master = "local"val conf = new SparkConf().setMaster(master)val sc = new SparkContext(conf)val lines = sc.textFile("demo.txt")val words = lines.flatMap(_.split(" ")).map((_,1))val wordCountRDD = words.reduceByKey(_ + _)val wordCount = wordCountRDD.collect

map groupBy

collect

Transformation Action

Contd...Contd...

Page 24: Introduction to Spark with Scala

AgendaAgenda

● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

Page 25: Introduction to Spark with Scala

Brief Introduction to Spark StreamingBrief Introduction to Spark Streaming

Img src - http://spark.apache.org/Img src - http://spark.apache.org/

Page 26: Introduction to Spark with Scala

How Spark Streaming Works ?How Spark Streaming Works ?

Img src - http://spark.apache.org/Img src - http://spark.apache.org/

Page 27: Introduction to Spark with Scala

Why we need Spark Streaming ?Why we need Spark Streaming ?

High Level API:High Level API:TwitterUtils.createStream(...) .filter(_.getText.contains("Spark")) .countByWindow(Seconds(10), Seconds(5)) //Counting tweets on a sliding window

Fault Tolerant:Fault Tolerant:

Integration:Integration:

Img src - http://spark.apache.org/Img src - http://spark.apache.org/

Integrated with Spark SQL, MLLib, GraphX...

Page 28: Introduction to Spark with Scala

Example (Spark Streaming)Example (Spark Streaming)

val master = "local" val conf = new SparkConf().setMaster(master)

Specify SparkConfiguration

Page 29: Introduction to Spark with Scala

Example (Spark Streaming)Example (Spark Streaming)

val master = "local" val conf = new SparkConf().setMaster(master) val ssc = new StreamingContext(conf, Seconds(10))

Setup StreamContext

Contd...Contd...

Page 30: Introduction to Spark with Scala

Example (Spark Streaming)Example (Spark Streaming)

val master = "local" val conf = new SparkConf().setMaster(master) val ssc = new StreamingContext(conf, Seconds(10)) val lines = ssc.socketTextStream("localhost", 9999)

This is theReceiverInputDStream

linesDStream

at time0 - 1

at time1 - 2

at time2 - 3

at time3 - 4

Contd...Contd...

Page 31: Introduction to Spark with Scala

Example (Spark Streaming)Example (Spark Streaming)

val master = "local" val conf = new SparkConf().setMaster(master) val ssc = new StreamingContext(conf, Seconds(10)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")).map((_, 1))

linesDStream

at time0 - 1

words/pairsDStream

at time1 - 2

at time2 - 3

at time3 - 4

map

Creates a Dstream(sequence of RDDs)

Contd...Contd...

Page 32: Introduction to Spark with Scala

Example (Spark Streaming)Example (Spark Streaming)

val master = "local" val conf = new SparkConf().setMaster(master) val ssc = new StreamingContext(conf, Seconds(10)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")).map((_, 1)) val wordCounts = words.reduceByKey(_ + _)

linesDStream

at time0 - 1

words/pairsDStream

at time1 - 2

at time2 - 3

at time3 - 4

wordCountDStream

map

groupBy

Groups Dstreamby Words

Contd...Contd...

Page 33: Introduction to Spark with Scala

Example (Spark Streaming)Example (Spark Streaming)

val master = "local" val conf = new SparkConf().setMaster(master) val ssc = new StreamingContext(conf, Seconds(10)) val lines = ssc.socketTextStream("localhost", 9999) val words = lines.flatMap(_.split(" ")).map((_, 1)) val wordCounts = words.reduceByKey(_ + _)

ssc.start()

linesDStream

at time0 - 1

words/pairsDStream

at time1 - 2

at time2 - 3

at time3 - 4

wordCountDStream

map

groupBy

Start streaming& computation

Contd...Contd...

Page 34: Introduction to Spark with Scala

AgendaAgenda● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

Page 35: Introduction to Spark with Scala

How to Install Spark ? Download Spark from -

http://spark.apache.org/downloads.html

Extract it to a suitable directory.

Go to the directory via terminal & run following command -

mvn -DskipTests clean package

Now Spark is ready to run in Interactive mode

./bin/spark-shell

Download Spark from -

http://spark.apache.org/downloads.html

Extract it to a suitable directory.

Go to the directory via terminal & run following command -

mvn -DskipTests clean package

Now Spark is ready to run in Interactive mode

./bin/spark-shell

Page 36: Introduction to Spark with Scala

sbt Setup

name := "Spark Demo"

version := "1.0"

scalaVersion := "2.10.5"

libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "1.2.1", "org.apache.spark" %% "spark-streaming" % "1.2.1", "org.apache.spark" %% "spark-sql" % "1.2.1", "org.apache.spark" %% "spark-mllib" % "1.2.1" )

Page 37: Introduction to Spark with Scala

AgendaAgenda● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

● What is Spark ?

● Why we need Spark ?

● Brief introduction to RDD

● Brief introduction to Spark Streaming

● How to install Spark ?

● Demo

Page 38: Introduction to Spark with Scala

Demo

Page 39: Introduction to Spark with Scala

Download Code

https://github.com/knoldus/spark-scala

Page 40: Introduction to Spark with Scala

References

http://spark.apache.org/

http://spark-summit.org/2014

http://spark.apache.org/docs/latest/quick-start.html

http://stackoverflow.com/questions/tagged/apache-spark

https://www.youtube.com/results?search_query=apache+spark

http://apache-spark-user-list.1001560.n3.nabble.com/

http://www.slideshare.net/paulszulc/apache-spark-101-in-50-min

Page 41: Introduction to Spark with Scala

Presenter:[email protected]

@himanshug735

Presenter:[email protected]

@himanshug735

Organizer:@Knolspeak

http://www.knoldus.comhttp://blog.knoldus.com

Organizer:@Knolspeak

http://www.knoldus.comhttp://blog.knoldus.com

Thanks