Transcript
Page 1: Big Data Analytics with Scala at SCALA.IO 2013

Big Data Analytics with Scala

Sam BESSALAH @samklr

Page 2: Big Data Analytics with Scala at SCALA.IO 2013

What is Big Data Analytics?

It’s about doing aggregations and running

complex models on large datasets, offline, in

real time or both.

Page 3: Big Data Analytics with Scala at SCALA.IO 2013

Lambda Architecture

Blueprint for a Big Data analytics

architecture

Page 4: Big Data Analytics with Scala at SCALA.IO 2013
Page 5: Big Data Analytics with Scala at SCALA.IO 2013
Page 6: Big Data Analytics with Scala at SCALA.IO 2013
Page 7: Big Data Analytics with Scala at SCALA.IO 2013
Page 8: Big Data Analytics with Scala at SCALA.IO 2013

Map Reduce redux

map : (Km, Vm) List (Km, Vm)

in Scala : T => List[(K,V)]

reduce :(Km, List(Vm))List(Kr, Vr)

(K, List[V]) => List[(K,V)]

Page 9: Big Data Analytics with Scala at SCALA.IO 2013
Page 10: Big Data Analytics with Scala at SCALA.IO 2013

Big data ‘’Hello World’’ : Word count

Page 11: Big Data Analytics with Scala at SCALA.IO 2013

Enters Cascading

Page 12: Big Data Analytics with Scala at SCALA.IO 2013
Page 13: Big Data Analytics with Scala at SCALA.IO 2013

Word Count Redux

(Flat)Map -Reduce

Page 14: Big Data Analytics with Scala at SCALA.IO 2013

SCALDING

class WordCount(args : Args) extends Job(args) { TextLine(args("input")) .flatMap ('line -> 'word) { line :String => line.split(“ \\s+”) } .groupBy('word){ group => group.size } .write(Tsv(args("output"))) }

Page 15: Big Data Analytics with Scala at SCALA.IO 2013

SCALDING : Clustering with Mahout

lazy val clust = new StreamingKMeans(new FastProjectionSearch( new EuclideanDistanceMeasure,5,10), args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float])

val count = 0;

val sloppyClusters = TextLine(args("input")) .map{ str => val vec = str.split("\t").map(_.toDouble) val cent = new Centroid(count, new DenseVector(vec)) count += 1 cent } .unorderedFold [StreamingKMeans,Centroid](clust) {(cl,cent) => cl.cluster(cent); cl } .flatMap(c => c.iterator.asScala.toIterable)

Page 16: Big Data Analytics with Scala at SCALA.IO 2013

SCALDING : Clustering with Mahout

val finalClusters = sloppyClusters.groupAll .mapValueStream { centList => lazy val bclusterer = new BallKMeans(new BruteSearch( new EuclideanDistanceMeasure), args("numclusters").toInt, 100) bclusterer.cluster(centList.toList.asJava) bclusterer.iterator.asScala } .values

Page 17: Big Data Analytics with Scala at SCALA.IO 2013

Scalding

- Two APIs : Field based API, and Typed API

- Field API : project, map, discard , groupBy…

- Typed API : TypedPipe[T], works like

scala.collection.Iterator[T]

- Matrix Library

- ALGEBIRD : Abstract Algebra library … we’ll

talk about it later

Page 18: Big Data Analytics with Scala at SCALA.IO 2013
Page 19: Big Data Analytics with Scala at SCALA.IO 2013

STORM

Page 20: Big Data Analytics with Scala at SCALA.IO 2013

- Distributed, fault tolerant, real time stream

computation engine.

- Four concepts

- Streams : infinite sequence of tuples

- Spouts : Source of streams

- Bolts : Process and produces streams

Can do : Filtering, aggregations, Joins, …

- Topologies : define a flow or network of

spouts and blots.

Page 21: Big Data Analytics with Scala at SCALA.IO 2013
Page 22: Big Data Analytics with Scala at SCALA.IO 2013

Streaming Word Count

Page 23: Big Data Analytics with Scala at SCALA.IO 2013

Trident

TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new Factory(), new Count(), new Fields("count")) .parallelismHint(6);

Page 24: Big Data Analytics with Scala at SCALA.IO 2013

ScalaStorm by Evan Chan

class SplitSentence extends StormBolt(outputFields = List("word")) { def execute(t: Tuple) = t matchSeq { case Seq(line: String) => line.split(‘’’’).foreach { word => using anchor t emit (word) } t ack } }

Page 25: Big Data Analytics with Scala at SCALA.IO 2013
Page 26: Big Data Analytics with Scala at SCALA.IO 2013

SummingBird

Write your job once and run it on Storm and

Hadoop

Page 27: Big Data Analytics with Scala at SCALA.IO 2013

def wordCount[P <: Platform[P]] (source: Producer[P, String], store: P#Store[String, Long]) = source.flatMap { line => line.split(‘’\\s+’’).map(_ -> 1L) } .sumByKey(store)

Page 28: Big Data Analytics with Scala at SCALA.IO 2013

SummingBird

trait Platform[P <: Platform[P]] {

type Source[+T]

type Store[-K, V]

type Sink[-T]

type Service[-K, +V]

type Plan[T}

}

Page 29: Big Data Analytics with Scala at SCALA.IO 2013

On Storm

- Source[+T] : Spout[(Long, T)]

- Store[-K, V] : StormStore [K, V]

- Sink[-T] : (T => Future[Unit])

- Service[-K, +V] : StormService[K,V]

- Plan[T] : StormTopology

Page 30: Big Data Analytics with Scala at SCALA.IO 2013

TypeSafety

Page 31: Big Data Analytics with Scala at SCALA.IO 2013

SummingBird dependencies

• StoreHaus

• Chill

• Scalding

• Algebird

• Tormenta

Page 32: Big Data Analytics with Scala at SCALA.IO 2013

But

- Can only aggregate values that are

associative : Monoids!!!!!!

trait Monoid [V] {

def zero : V

def aggregate(left : V, right :V): V

}

Page 33: Big Data Analytics with Scala at SCALA.IO 2013
Page 34: Big Data Analytics with Scala at SCALA.IO 2013

Clustering with Mahout redux

def StreamClustering(source : Platform[P.String], store : P#Store[_,_]) { lazy val clust = new StreamingKMeans(new FastProjectionSearch( new EuclideanDistanceMeasure,5,10), args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float])

val count = 0;

val sloppyClusters = source .map{ str => val vec = str.split("\t").map(_.toDouble) val cent = new Centroid(count, new DenseVector(vec)) count += 1 cent } .unorderedFold [StreamingKMeans,Centroid](clust) {(cl,cent) => cl.cluster(cent); cl } .flatMap(c => c.iterator.asScala.toIterable)

Page 35: Big Data Analytics with Scala at SCALA.IO 2013

SCALDING : Clustering with Mahout

val finalClusters = sloppyClusters.groupAll .mapValueStream { centList => lazy val bclusterer = new BallKMeans(new BruteSearch( new EuclideanDistanceMeasure), args("numclusters").toInt, 100) bclusterer.cluster(centList.toList.asJava) bclusterer.iterator.asScala } .values .saveTo(store) }

Page 36: Big Data Analytics with Scala at SCALA.IO 2013

APACHE SPARK

Page 37: Big Data Analytics with Scala at SCALA.IO 2013

What is Spark?

• Fast and expressive cluster computing system

compatible with Apache Hadoop, but order of magnitude

faster (order of magnitude faster)

• Improves efficiency through:

-General execution graphs

-In-memory storage

• Improves usability through:

-Rich APIs in Java, Scala, Python

-Interactive shell

Page 38: Big Data Analytics with Scala at SCALA.IO 2013

Key idea

• Write programs in terms of transformations on distributed

datasets

• Concept: resilient distributed datasets (RDDs)

- Collections of objects spread across a cluster

- Built through parallel transformations (map, filter, etc)

- Automatically rebuilt on failure

- Controllable persistence (e.g. caching in RAM)

Page 39: Big Data Analytics with Scala at SCALA.IO 2013

Example: Word Count

Page 40: Big Data Analytics with Scala at SCALA.IO 2013

Other RDD Operators

• map

• filter

• groupBy

• sort

• union

• join

• leftOuterJoin

• rightOuterJoin

Page 41: Big Data Analytics with Scala at SCALA.IO 2013

Example: Log Mining Load error messages from a log into memory,

then interactively search for various patterns

lines = spark.textFile(“hdfs://...”)

errors = lines.filter(s => s.startswith(“ERROR”))

messages = errors.map(s => s.split(“\t”))

messages.cache() Block 1

Block 2

Block 3

Worker

Worker

Worker

Driver

messages.filter(s=> s.contains(“foo”)).count()

messages.filter(s=> s.contains(“bar”)).count()

. . .

tasks

results

Cache 1

Cache 2

Cache 3

Base

RDD Transformed

RDD

Action

Result: full-text search of Wikipedia in 0.5 sec (vs 20 s for on-disk data)

Result: scaled to 1 TB data in 5 sec (vs 180 sec for on-disk data)

Page 42: Big Data Analytics with Scala at SCALA.IO 2013

Fault Recovery

RDDs track lineage information that can be

used to efficiently recompute lost data

Ex:

msgs = textFile.filter(-=> _.startsWith(“ERROR”)) .map(_ => _.split(“\t”))

HDFS File Filtered RDD Mapped RDD filter

(func = _.contains(...)) map

(func = _.split(...))

Page 43: Big Data Analytics with Scala at SCALA.IO 2013

Spark Streaming

- Extends Spark capabilities to large scale stream

processing.

- Scales to 100s of nodes and achieves second scale

latencies

-Efficient and fault-tolerant stateful stream processing

- Simple batch-like API for implementing complex

algorithms

Page 44: Big Data Analytics with Scala at SCALA.IO 2013

Discretized Stream Processing

44

Spark

Spark Streaming

batches of X seconds

live data stream

processed results

Chop up the live stream into batches of X seconds

Spark treats each batch of data as RDDs and processes them using RDD operations

Finally, the processed results of the RDD operations are returned in batches

Page 45: Big Data Analytics with Scala at SCALA.IO 2013

Discretized Stream Processing

45

Batch sizes as low as ½ second, latency of about 1 second

Potential for combining batch processing and streaming processing in the same system

Spark

Spark Streaming

batches of X seconds

live data stream

processed results

Page 46: Big Data Analytics with Scala at SCALA.IO 2013

Example – Get hashtags from Twitter val tweets = ssc.twitterStream()

DStream: a sequence of RDDs representing a stream of data

batch @ t+1 batch @ t batch @ t+2

tweets DStream

stored in memory as an RDD (immutable, distributed)

Twitter Streaming API

Page 47: Big Data Analytics with Scala at SCALA.IO 2013

Example – Get hashtags from Twitter

val tweets = ssc.twitterStream()

val hashTags = tweets.flatMap (status => getTags(status))

flatMap flatMap flatMap

transformation: modify data in one DStream to create another DStream

new DStream

new RDDs created for every batch

batch @ t+1 batch @ t batch @ t+2

tweets DStream

hashTags Dstream [#cat, #dog, … ]

Page 48: Big Data Analytics with Scala at SCALA.IO 2013

Example – Get hashtags from Twitter val tweets = ssc.twitterStream()

val hashTags = tweets.flatMap (status => getTags(status))

hashTags.foreach(hashTagRDD => { ... })

foreach: do whatever you want with the processed data

flatMap flatMap flatMap

foreach foreach foreach

batch @ t+1 batch @ t batch @ t+2

tweets DStream

hashTags DStream

Write to database, update analytics UI, do whatever you want

Page 49: Big Data Analytics with Scala at SCALA.IO 2013

Example – Get hashtags from Twitter

val tweets = ssc.twitterStream()

val hashTags = tweets.flatMap (status => getTags(status))

hashTags.saveAsHadoopFiles("hdfs://...")

output operation: to push data to external storage

flatMap flatMap flatMap

save save save

batch @ t+1 batch @ t batch @ t+2 tweets DStream

hashTags DStream

every batch saved to HDFS

Page 50: Big Data Analytics with Scala at SCALA.IO 2013

DStream of data

Window-based Transformations val tweets = ssc.twitterStream()

val hashTags = tweets.flatMap (status => getTags(status))

val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()

sliding window

operation window length sliding interval

window length

sliding interval

Page 51: Big Data Analytics with Scala at SCALA.IO 2013

Compute TopK Ip addresses

val ssc = new StreamingContext(master, "AlgebirdCMS", Seconds(10), …) val stream = ssc.KafkaStream(None, filters, StorageLevel.MEMORY, ..) val addresses = stream.map(ipAddress => ipAddress.getText) val cms = new CountMinSketchMonoid(EPS, DELTA, SEED, PERC) val globalCMS = cms.zero val mm = new MapMonoid[Long, Int]() //init val topAddresses = adresses.mapPartitions(ids => { ids.map(id => cms.create(id)) }) .reduce(_ ++ _)

Page 52: Big Data Analytics with Scala at SCALA.IO 2013

topAddresses.foreach(rdd => { if (rdd.count() != 0) { val partial = rdd.first() val partialTopK = partial.heavyHitters.map(id => (id, partial.frequency(id).estimate)) .toSeq.sortBy(_._2).reverse.slice(0, TOPK) globalCMS ++= partial val globalTopK = globalCMS.heavyHitters.map(id => (id, globalCMS.frequency(id).estimate)) .toSeq.sortBy(_._2).reverse.slice(0, TOPK) globalTopK.mkString("[", ",", "]"))) } })

Page 53: Big Data Analytics with Scala at SCALA.IO 2013

Multi purpose analytics stack

Ad-hoc Queries

Batch Processing

Stream Processing Spark

+ Shark

+ Spark

Streaming

MLBASE

GraphX

BLINK DB TACHYON

Page 54: Big Data Analytics with Scala at SCALA.IO 2013

SPARK

SPARK STREAMING

- Almost Similar API for batch or Streaming - Single¨Platform with fewer moving parts - Order of magnitude faster

Page 55: Big Data Analytics with Scala at SCALA.IO 2013

References Sam Ritchie : SummingBird

https://speakerdeck.com/sritchie/summingbird-streaming-mapreduce-at-twitter

Chris Severs, Vitaly Gordon : Scalable Machine Learning with Scala

http://slideshare.net/VitalyGordon/scalable-and-flexible-machine-learning-with-scala-linkedin

Apache Spark : http://spark.incubator.apache.org

Matei Zaharia : Parallel Programming with Spark

Page 56: Big Data Analytics with Scala at SCALA.IO 2013

Top Related