cloudcamp chicago lightning talk "spark: a quick ignition" - matthew kemp, architect...

16
Spark: A Quick Ignition Matthew Kemp

Upload: cloudcamp-chicago

Post on 27-Jul-2015

35 views

Category:

Technology


1 download

TRANSCRIPT

Spark: A Quick IgnitionMatthew Kemp

Provides distributed processing

Main unit of abstraction is the RDD

Can be used with frameworks like Mesos or Yarn

Supports Java, Python and Scala

https://spark.apache.org/

What is Spark?

Can be created from… Files or HDFS In memory iterable Cassandra or SQL tables

Transformations Lazily create a new RDD from an existing one

Actions Usually return a value, force computation of RDD

Resilient Distributed Dataset

Some examples: filter map flatMap distinct union intersection join reduceByKey

Transformations

Some examples: reduce collect take count foreach saveAsTextFile

Actions

Example: Word Count

flatMap()inputreduceBy

Key() map() outputmap()

#!/bin/pythonregex = re.compile('[%s]' % re.escape(string.punctuation))def word_count(sc, in_file_name, out_file_name): sc.textFile(in_file_name) \ .map(lambda line: regex.sub(' ', line).strip().lower()) \ .flatMap(lambda line: [ (word, 1) for word in line.split() ]) \ .reduceByKey(lambda a, b: a + b) \ .map(lambda (word, count): '%s,%s' % (word, count)) \ .saveAsTextFile(out_file_name)

Example: Word Count

#!/bin/pythonregex = re.compile('[%s]' % re.escape(string.punctuation))def word_count(sc, in_file_name, out_file_name): sc.textFile(in_file_name) \ .map(lambda line: regex.sub(' ', line)) \ .map(lambda line: line.strip()) \ .map(lambda line: line.lower()) \ .flatMap(lambda line: line.split()) \ .map(lambda word: (word, 1)) \ .reduceByKey(lambda a, b: a + b) \ .map(lambda (word, count): '%s,%s' % (word, count)) \ .saveAsTextFile(out_file_name)

Example: Alternate Word Count

$ pyspark...Using Python version 2.7.2 (default)SparkContext available as sc.>>> from word_count import word_count>>> word_count(sc, 'text.txt', 'text_counts')

Running the Example

a,23able,1about,6above,1accept,1accuse,1ago,2alarm,2all,7although,1always,2an,1

The Results From Sparkand,26anger,1another,1any,2anyone,1arches,1are,1arm,1armour,1as,7assistant,2...

#!/bin/bashtext=$(cat ${1} | tr "[:punct:]" " " | \ tr "[:upper:]" "[:lower:]")parsed=(${text})for w in ${parsed[@]}; do echo ${w}; done | sort | uniq -c

A (Bad) Shell Version

23 a 1 able 6 about 1 above 1 accept 1 accuse 2 ago 2 alarm 7 all 1 although 2 always 1 an

The Results From the Shell 26 and 1 anger 1 another 2 any 1 anyone 1 arches 1 are 1 arm 1 armour 7 as 2 assistant ...

Our Use Case

distinct()3rd party

3rd partydistinct()

join()

join()

union() distinct() foreach()1st party

Questions?

Contact [email protected]

@mattkemp

/in/matthewkemp