scala and big data in icm. scoobie, scalding, spark, stratosphere. scalar 2014

THE TALE OF THE!GLORIOUS LAMBDAS!

& THE WERE-CLUSTERZ

Mateusz Fedoryszak!matfed@icm.edu.pl!Michał Oniszczuk!micon@icm.edu.pl

More than the weather forecast.

MUCH MORE…

WE SPY ON SCIENTISTS

RAW DATA

COMMON MAP OF ACADEMIA

HADOOPHow to read millions of papers?

IN ONE PICTUREMap Reduce

WORD COUNT IS THE NEW HELLO WORLD

WORD COUNT IN VANILLA MAP-REDUCE

package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount { public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }

HOW SHOULDA WORD COUNT

LOOK LIKE?

val lines = List("ala ma kota", "kot ma ale")!!val words = lines.flatMap(_.split(" "))!val groups = words.groupBy(identity)"val counts = groups.map(x => (x._1, x._2.length))!!counts.foreach(println)

SCOOBI, SCALDINGMap–Reduce the right way — with lambdas.

WORD COUNT IN PURE SCALA

val lines = List("ala ma kota", "kot ma ale")!!val words = lines.flatMap(_.split(" "))!val groups = words.groupBy(identity)"val counts = groups.map(x => (x._1, x._2.size))"!counts.foreach(println)

WORD COUNT IN SCOOBI

val lines = fromTextFile("hdfs://in/...")!!val words = lines.mapFlatten(_.split(" "))!val groups = words.groupBy(identity)"val counts = groups.map(x => (x._1, x._2.size))"!counts!" .toTextFile("hdfs://out/...", overwrite=true)!" .persist()

BEHIND THE SCENES

val lines = ! fromTextFile("hdfs://in/...")!!val words = ! lines.mapFlatten(_.split(" "))!val groups = ! words.groupBy(identity)"val counts = ! groups.map(x => (x._1, x._2.length))!!counts! .toTextFile("hdfs://out/...",! overwrite=true)! .persist()

flatMap

groupBy

reduce

SCOOBI SNACKS– Joins, group-by, etc. baked in!

– Static type checking with custom data types and IO!

– One lang to rule them all (and it’s THE lang)!

– Easy local testing!

– REPL!

WHICH ONE IS THE FRAMEWORK?

Scoobi ScaldingPure Scala Cascading wrapperDeveloped by NICTA Developed by TwitterStrongly typed API Field-based and strongly typed API

Has cooler logo

THE NEW BIG DATA ZOOMost slides are by Matei Zaharia from the Spark team

SPARK IDEA

MAPREDUCE PROBLEMS…

iter. 1 iter. 2 . . .

HDFS read

HDFS write

HDFS read

HDFS write

query 1

query 2

query 3

result 1

result 2

result 3

HDFS read

iter. 1 iter. 2 . . .

… SOLVED WITH SPARK

Distributedmemory

query 1

query 2

query 3

one-time processing

RESILIENT DISTRIBUTED DATASETS (RDDS)

Restricted form of distributed shared memory» Partitioned data»Higher–level operations (map, filter, join, …)»No side–effects

Efficient fault recovery using lineage»List of operations»Recompute lost partitions on failure»No cost if nothing fails

Scala, Python, Java

+ REPL

map"reduce

filter"groupBy

join"…

SPARK EXAMPLES

EXAMPLE: LOG MININGLoad error messages from a log into memory, then interactively search for various patterns

Worker

Master

lines = spark.textFile(“hdfs://...”)errors = lines.filter(_.startsWith(“ERROR”))messages = errors.map(_.split(‘\t’)(2))cachedMsgs = messages.cache()

Worker

Master

Worker

Master

cachedMsgs.filter(_.contains(“foo”)).count

Block 1

Block 2

Block 3

Worker

Master

Block 1

Block 2

Block 3

Worker

Master

Block 1

Block 2

Block 3

Worker

Master

Cache 1

Cache 2

Cache 3

Block 1

Block 2

Block 3

Worker

Master

results

Cache 1

Cache 2

Cache 3

1TB data in 5-7 sec (vs 170 sec for on-disk data)

Block 1

Block 2

Block 3

Worker

Master

cachedMsgs.filter(_.contains(“foo”)).countcachedMsgs.filter(_.contains(“bar”)).count

results

Cache 1

Cache 2

Cache 3

PAGERANK PERFORMANCET

170,75 Hadoop Spark

SPARK LIBRARIES

SPARK’S ZOO

Spark Streaming

(real-time)

GraphX(graph)

Shark(SQL)

MLlib(machine learning)

BlinkDB

ALL IN ONE

val points = sc.runSql[Double, Double]( “select latitude, longitude from historic_tweets”)

val model = KMeans.train(points, 10)

sc.twitterStream(...) .map(t => (model.closestCenter(t.location), 1)) .reduceByWindow(“5s”, _ + _)

SPARK CONCLUSION

• In memory processing

• Libraries

• Increasingly popularSpark

Spark Streaming!

GraphX

…Shark MLlib

BlinkDB

USEFUL LINKS

• spark.apache.org!

• spark-summit.org !videos & online hands–on tutorials

Like Spark but less popular and less mature

CONCLUSION

• We are in the 80’s of RDBMS

• Scala goes well with big data

THANK YOU!!Q&A

Number of machines

25 50 100

HadoopHadoopBinMemSpark

Logistic Regression

Number of machines

25 50 100

274Hadoop HadoopBinMemSpark

K-Means

SCALABILITY

Percent of working set in memory

0 0.25 0.5 0.75 1

29,740,7

58,168,8

INSUFFICIENT RAM

PERFORMANCE

HiveImpala (disk)Impala (mem)Shark (disk)Shark (mem)

HadoopGiraphGraphX

StormSpark

Streaming

scala and big data in icm. scoobie, scalding, spark, stratosphere. scalar 2014

kot ma ale

ala ma kota

interactively

messages errors

cachedmsgs

import java

import org

errors lines

Data & Analytics

mapreduce with scalding @ 24th hadoop london meetup

icm-to-icm gateway user guide for cisco unified icm ... ·...

cisco icm software release 6.0(0) icm-to-icm gateway user...

working with the scalding type-safe api

stratosphere hotel presentation

stratellites - satellites in stratosphere

scalding presentation

working with the scalding type -safe api

writing hadoop jobs in scala using scalding

eeg stratosphere - user's guide

scalding big (ad)ta

scalding on tez (final)

stratosphere tower las vegas

las.vegas stratosphere - tower rides

scalding by adform research, alex gryzlov

apesnerds.weebly.comapesnerds.weebly.com/.../climate_change_and_ozone_… ·...

icm personal care silicone products · personal care...

icm - rockwell automation · 7 ird 2000 3 - entek...

april 20, 2015 for big data analytics - harvard...

scalding tank design