Big Data Analytics with Scala
Sam BESSALAH @samklr
What is Big Data Analytics?
It’s about doing aggregations and running
complex models on large datasets, offline, in
real time or both.
Lambda Architecture
Blueprint for a Big Data analytics
architecture
Map Reduce redux
map : (Km, Vm) List (Km, Vm)
in Scala : T => List[(K,V)]
reduce :(Km, List(Vm))List(Kr, Vr)
(K, List[V]) => List[(K,V)]
Big data ‘’Hello World’’ : Word count
Enters Cascading
Word Count Redux
(Flat)Map -Reduce
SCALDING
class WordCount(args : Args) extends Job(args) { TextLine(args("input")) .flatMap ('line -> 'word) { line :String => line.split(“ \\s+”) } .groupBy('word){ group => group.size } .write(Tsv(args("output"))) }
SCALDING : Clustering with Mahout
lazy val clust = new StreamingKMeans(new FastProjectionSearch( new EuclideanDistanceMeasure,5,10), args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float])
val count = 0;
val sloppyClusters = TextLine(args("input")) .map{ str => val vec = str.split("\t").map(_.toDouble) val cent = new Centroid(count, new DenseVector(vec)) count += 1 cent } .unorderedFold [StreamingKMeans,Centroid](clust) {(cl,cent) => cl.cluster(cent); cl } .flatMap(c => c.iterator.asScala.toIterable)
SCALDING : Clustering with Mahout
val finalClusters = sloppyClusters.groupAll .mapValueStream { centList => lazy val bclusterer = new BallKMeans(new BruteSearch( new EuclideanDistanceMeasure), args("numclusters").toInt, 100) bclusterer.cluster(centList.toList.asJava) bclusterer.iterator.asScala } .values
Scalding
- Two APIs : Field based API, and Typed API
- Field API : project, map, discard , groupBy…
- Typed API : TypedPipe[T], works like
scala.collection.Iterator[T]
- Matrix Library
- ALGEBIRD : Abstract Algebra library … we’ll
talk about it later
STORM
- Distributed, fault tolerant, real time stream
computation engine.
- Four concepts
- Streams : infinite sequence of tuples
- Spouts : Source of streams
- Bolts : Process and produces streams
Can do : Filtering, aggregations, Joins, …
- Topologies : define a flow or network of
spouts and blots.
Streaming Word Count
Trident
TridentTopology topology = new TridentTopology(); TridentState wordCounts = topology.newStream("spout1", spout) .each(new Fields("sentence"), new Split(), new Fields("word")) .groupBy(new Fields("word")) .persistentAggregate(new Factory(), new Count(), new Fields("count")) .parallelismHint(6);
ScalaStorm by Evan Chan
class SplitSentence extends StormBolt(outputFields = List("word")) { def execute(t: Tuple) = t matchSeq { case Seq(line: String) => line.split(‘’’’).foreach { word => using anchor t emit (word) } t ack } }
SummingBird
Write your job once and run it on Storm and
Hadoop
def wordCount[P <: Platform[P]] (source: Producer[P, String], store: P#Store[String, Long]) = source.flatMap { line => line.split(‘’\\s+’’).map(_ -> 1L) } .sumByKey(store)
SummingBird
trait Platform[P <: Platform[P]] {
type Source[+T]
type Store[-K, V]
type Sink[-T]
type Service[-K, +V]
type Plan[T}
}
On Storm
- Source[+T] : Spout[(Long, T)]
- Store[-K, V] : StormStore [K, V]
- Sink[-T] : (T => Future[Unit])
- Service[-K, +V] : StormService[K,V]
- Plan[T] : StormTopology
TypeSafety
SummingBird dependencies
• StoreHaus
• Chill
• Scalding
• Algebird
• Tormenta
But
- Can only aggregate values that are
associative : Monoids!!!!!!
trait Monoid [V] {
def zero : V
def aggregate(left : V, right :V): V
}
Clustering with Mahout redux
def StreamClustering(source : Platform[P.String], store : P#Store[_,_]) { lazy val clust = new StreamingKMeans(new FastProjectionSearch( new EuclideanDistanceMeasure,5,10), args("sloppyclusters").toInt, (10e-6).asInstanceOf[Float])
val count = 0;
val sloppyClusters = source .map{ str => val vec = str.split("\t").map(_.toDouble) val cent = new Centroid(count, new DenseVector(vec)) count += 1 cent } .unorderedFold [StreamingKMeans,Centroid](clust) {(cl,cent) => cl.cluster(cent); cl } .flatMap(c => c.iterator.asScala.toIterable)
SCALDING : Clustering with Mahout
val finalClusters = sloppyClusters.groupAll .mapValueStream { centList => lazy val bclusterer = new BallKMeans(new BruteSearch( new EuclideanDistanceMeasure), args("numclusters").toInt, 100) bclusterer.cluster(centList.toList.asJava) bclusterer.iterator.asScala } .values .saveTo(store) }
APACHE SPARK
What is Spark?
• Fast and expressive cluster computing system
compatible with Apache Hadoop, but order of magnitude
faster (order of magnitude faster)
• Improves efficiency through:
-General execution graphs
-In-memory storage
• Improves usability through:
-Rich APIs in Java, Scala, Python
-Interactive shell
Key idea
• Write programs in terms of transformations on distributed
datasets
• Concept: resilient distributed datasets (RDDs)
- Collections of objects spread across a cluster
- Built through parallel transformations (map, filter, etc)
- Automatically rebuilt on failure
- Controllable persistence (e.g. caching in RAM)
Example: Word Count
Other RDD Operators
• map
• filter
• groupBy
• sort
• union
• join
• leftOuterJoin
• rightOuterJoin
Example: Log Mining Load error messages from a log into memory,
then interactively search for various patterns
lines = spark.textFile(“hdfs://...”)
errors = lines.filter(s => s.startswith(“ERROR”))
messages = errors.map(s => s.split(“\t”))
messages.cache() Block 1
Block 2
Block 3
Worker
Worker
Worker
Driver
messages.filter(s=> s.contains(“foo”)).count()
messages.filter(s=> s.contains(“bar”)).count()
. . .
tasks
results
Cache 1
Cache 2
Cache 3
Base
RDD Transformed
RDD
Action
Result: full-text search of Wikipedia in 0.5 sec (vs 20 s for on-disk data)
Result: scaled to 1 TB data in 5 sec (vs 180 sec for on-disk data)
Fault Recovery
RDDs track lineage information that can be
used to efficiently recompute lost data
Ex:
msgs = textFile.filter(-=> _.startsWith(“ERROR”)) .map(_ => _.split(“\t”))
HDFS File Filtered RDD Mapped RDD filter
(func = _.contains(...)) map
(func = _.split(...))
Spark Streaming
- Extends Spark capabilities to large scale stream
processing.
- Scales to 100s of nodes and achieves second scale
latencies
-Efficient and fault-tolerant stateful stream processing
- Simple batch-like API for implementing complex
algorithms
Discretized Stream Processing
44
Spark
Spark Streaming
batches of X seconds
live data stream
processed results
Chop up the live stream into batches of X seconds
Spark treats each batch of data as RDDs and processes them using RDD operations
Finally, the processed results of the RDD operations are returned in batches
Discretized Stream Processing
45
Batch sizes as low as ½ second, latency of about 1 second
Potential for combining batch processing and streaming processing in the same system
Spark
Spark Streaming
batches of X seconds
live data stream
processed results
Example – Get hashtags from Twitter val tweets = ssc.twitterStream()
DStream: a sequence of RDDs representing a stream of data
batch @ t+1 batch @ t batch @ t+2
tweets DStream
stored in memory as an RDD (immutable, distributed)
Twitter Streaming API
Example – Get hashtags from Twitter
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
flatMap flatMap flatMap
…
transformation: modify data in one DStream to create another DStream
new DStream
new RDDs created for every batch
batch @ t+1 batch @ t batch @ t+2
tweets DStream
hashTags Dstream [#cat, #dog, … ]
Example – Get hashtags from Twitter val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.foreach(hashTagRDD => { ... })
foreach: do whatever you want with the processed data
flatMap flatMap flatMap
foreach foreach foreach
batch @ t+1 batch @ t batch @ t+2
tweets DStream
hashTags DStream
Write to database, update analytics UI, do whatever you want
Example – Get hashtags from Twitter
val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
hashTags.saveAsHadoopFiles("hdfs://...")
output operation: to push data to external storage
flatMap flatMap flatMap
save save save
batch @ t+1 batch @ t batch @ t+2 tweets DStream
hashTags DStream
every batch saved to HDFS
DStream of data
Window-based Transformations val tweets = ssc.twitterStream()
val hashTags = tweets.flatMap (status => getTags(status))
val tagCounts = hashTags.window(Minutes(1), Seconds(5)).countByValue()
sliding window
operation window length sliding interval
window length
sliding interval
Compute TopK Ip addresses
val ssc = new StreamingContext(master, "AlgebirdCMS", Seconds(10), …) val stream = ssc.KafkaStream(None, filters, StorageLevel.MEMORY, ..) val addresses = stream.map(ipAddress => ipAddress.getText) val cms = new CountMinSketchMonoid(EPS, DELTA, SEED, PERC) val globalCMS = cms.zero val mm = new MapMonoid[Long, Int]() //init val topAddresses = adresses.mapPartitions(ids => { ids.map(id => cms.create(id)) }) .reduce(_ ++ _)
topAddresses.foreach(rdd => { if (rdd.count() != 0) { val partial = rdd.first() val partialTopK = partial.heavyHitters.map(id => (id, partial.frequency(id).estimate)) .toSeq.sortBy(_._2).reverse.slice(0, TOPK) globalCMS ++= partial val globalTopK = globalCMS.heavyHitters.map(id => (id, globalCMS.frequency(id).estimate)) .toSeq.sortBy(_._2).reverse.slice(0, TOPK) globalTopK.mkString("[", ",", "]"))) } })
Multi purpose analytics stack
Ad-hoc Queries
Batch Processing
Stream Processing Spark
+ Shark
+ Spark
Streaming
MLBASE
GraphX
BLINK DB TACHYON
SPARK
SPARK STREAMING
- Almost Similar API for batch or Streaming - Single¨Platform with fewer moving parts - Order of magnitude faster
References Sam Ritchie : SummingBird
https://speakerdeck.com/sritchie/summingbird-streaming-mapreduce-at-twitter
Chris Severs, Vitaly Gordon : Scalable Machine Learning with Scala
http://slideshare.net/VitalyGordon/scalable-and-flexible-machine-learning-with-scala-linkedin
Apache Spark : http://spark.incubator.apache.org
Matei Zaharia : Parallel Programming with Spark