scalding presentation
DESCRIPTION
Scalding, Scala, MapReduce 24th Hadoop London MeetupTRANSCRIPT
![Page 1: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/1.jpg)
MapReduce with ScaldingAntonios Chalkiopoulos24th Big Data London Meetup
Scalding.io
![Page 2: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/2.jpg)
$ whoami
Scalding.io
http://scalding.io
http://github.com/scalding-io
@chalkiopoulos
![Page 3: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/3.jpg)
My recent achievement..
Scalding.io
![Page 4: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/4.jpg)
![Page 5: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/5.jpg)
What are we gonna talk about..?
Scalding.io
![Page 6: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/6.jpg)
Scalding.io
![Page 7: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/7.jpg)
A Scala API on top of Cascading
Scalding.io
![Page 8: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/8.jpg)
But what is ?
Scalding.io
![Page 9: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/9.jpg)
A few years ago I started on a fresh Big Data team…
Scalding.io
Story!!
![Page 10: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/10.jpg)
How do we efficiently develop MapReduce jobs for our new hadoop cluster ?
Scalding.io
??
![Page 11: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/11.jpg)
MapReduce Techs
Scalding.io
Java MapReduce
Hadoop
ab
stra
ctio
n
![Page 12: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/12.jpg)
ws
Java MapReduce Word count example
![Page 13: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/13.jpg)
MapReduce Techs
Scalding.io
Java MapReduce
Pig Hive
Hadoop
Cascading Others
ab
stra
ctio
n
![Page 14: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/14.jpg)
The promise of Cascading
Scalding.io
![Page 15: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/15.jpg)
[1] A simple, high level java API for MapReduce easy to understand and work with.
Scalding.io
![Page 16: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/16.jpg)
[2] Extensions to
MANY platforms
Scalding.io
![Page 17: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/17.jpg)
Scalding.io
Cascading
NoSQL Databases
SQL Databases
Hadoop Filesystem
Local Filesystem
In memory systems
Search Platforms
MongoDB Cassandra HBASE Accumulo …
ElasticSearch Solr …
Redis Memcached
…
![Page 18: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/18.jpg)
How it works?
Scalding.io
![Page 19: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/19.jpg)
A pipeline architecture
Scalding.io
![Page 20: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/20.jpg)
Scalding.io
data
data
data
Tuple1Tuple2
where tuples flow through pipes
Source tap
data
data
data
Sin
k tap
![Page 21: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/21.jpg)
Scalding.io
Log files
Customer Data
Log & Customer
FinalResults
Log files
Log files
Customer Data
Results
Results
![Page 22: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/22.jpg)
Cascading Example
Scalding.io
![Page 23: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/23.jpg)
Word count in Cascading
1. public class WordCount {
2. public static void main(String[] args) {3. Properties properties = new Properties();4. FlowConnector.setApplicationJarClass (properties, WordCount.class);5. Scheme sourceScheme = new TextLine (new Fields(“line”));6. Scheme sinkScheme = new TextLine (new Fields(“word”,”count”)); 7. Tap source = new Hfs( sourceScheme, args[0]);8. Tap sink = new Hfs( sinkScheme, args[1], SinkMode.REPLACE );9. Pipe assembly = new Pipe(“ Word Count “);10. String regex = “(?>!\\pL)(?=\\pL)[^ ]*(?<=\\pL)(?!\\pL)”;11. Function function = new RegexGenerator( new Fields(“word”), regex);12. assembly = new Each( assembly, new Fields(“line”), function );13. assembly = new GroupBy( assembly, new Fields(“word”) );14. Aggregator count = new Count(new Fields(“count”) );15. assembly = new Every( assembly, count );16. FlowConnector flowConnector = new FlowConnector( properties );17. Flow flow = flowConnector.connect(“word-count”, source, sink,
assembly);18. flow.complete();19. }20. }Scalding.io
70% less boilerplate code
But still some infrastructure code
![Page 24: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/24.jpg)
Scalding.io
![Page 25: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/25.jpg)
Scalding.io
No boilerplate code at all
Functional
Robust & Scalable
Run on JVM
![Page 26: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/26.jpg)
Here it comes
Scalding.io
Java MapReduce
Pig Hive
Hadoop
Cascading Others
ab
stra
ctio
n
Scalding
![Page 27: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/27.jpg)
The power of Scala on top of Cascading
Scalding.io
![Page 28: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/28.jpg)
Scala fits naturally with data
Scalding.io
![Page 29: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/29.jpg)
Word count in Scalding
Scalding.io
1. import com.twitter.scalding._
2. class WordCountJob(args : Args) extends Job(args) {
3. TextLine("input.txt”).read4. .flatMap('line -> 'word) { line : String => line.split("\\s+") }5. .groupBy('word) { _.size }6. .write( Tsv(”results.tsv”) )
7. }
Map phase
Reducephase
4 lines of code!
4
Code that developers enjoy writing
![Page 30: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/30.jpg)
Who is using it?
Scalding.io
Many many others…
![Page 31: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/31.jpg)
Scalding…
…open sourced by twitter at 2011…has more than 100 open source contributors…exposes the right abstractions…maximizes expressiveness…promotes extensibility…adds new capabilities to Cascading
Scalding.io
![Page 32: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/32.jpg)
Core Concepts
Scalding.io
![Page 33: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/33.jpg)
Sources & Sinks
1. Tsv("data.tsv", ('productID,'price,'quantity))2. .read3. .write(UnpackedAvroSource("data.avro”))
Scalding.io
TsvCsvOsvAvroParquet…
![Page 34: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/34.jpg)
Map Operations
Scalding.io
1. pipe1.filter ('age) { age:Int => age > 18 }2. pipe1.map ('price -> ’withVAT) { price:Double => price * 1.2 }3. pipe1.project('name, 'surname)
15 map operations translated into map phases
![Page 35: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/35.jpg)
Join operations
1. pipe1.joinWithSmaller('productId -> 'productId, pipe2)2. pipe1.joinWithLarger ('productId -> 'productId, pipe2)3. pipe1.joinWithTiny ('productId -> 'productId, pipe2)
Scalding.io
Optimize by hinting the relative sizes
Supports Left, Right, Inner, Outer Joins
1. pipe12. .joinWithSmaller('productId -> 'productId, pipe2, 3. joiner=new LeftJoin)
![Page 36: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/36.jpg)
Group operations
1. val pipe = Tsv(“input”, ('shopId, 'itemId, 'quantity))2. .groupBy('shopId) {3. _.sum[Long]('quantity-> 'totalSoldItems)4. }5. .write(Tsv(“results.tsv”))
Scalding.io
Group by particular fields
.groupBy
.groupAll Group all data
![Page 37: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/37.jpg)
Pipe operations
1. val p = (pipe1 ++ pipe2) // Concatenate 2 pipes2. .debug // Print sample data to screen3. .addTrap(Tsv(“bogus_lines”) // dirty data are recorded
Scalding.io
Simple pipe operations
![Page 38: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/38.jpg)
Connect with external systems
Scalding.io
![Page 39: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/39.jpg)
Scalding + Hive1. class HiveExample (args: Args) extends Job(args) {
2. val USER_SCHEMA = List('userId, 'username, 'photo)
3. HiveSource("myHiveTable", SinkMode.KEEP)4. .withHCatScheme(osvInputScheme(fields = USER_SCHEMA))5. .write(Tsv("outputFromHive"))6. }
Scalding.io
Define the schemaQuery HcatalogRead directly from
HDFS
![Page 40: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/40.jpg)
Scalding + ElasticSearch1. val schema = List('number, 'product, 'description)
2. val readES = ElasticSearchTap("localhost", 9200,"index firstType","", schema).read.write(Tsv("data/es-out.tsv"))
3. val writeES = Tsv("data.tsv”).read.write(ElasticSearchTap ("localhost”, 9200,"index/secondType","", schema))
Scalding.io
Read from ElasticSearch in
one line!Also index new data in ES
![Page 41: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/41.jpg)
Design patterns
Scalding.io
![Page 42: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/42.jpg)
Dependency InjectionLate boundExternal Operations
![Page 43: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/43.jpg)
How about defining external operations?
Scalding.io
1. val pipe1 = Tsv(“omniture.tsv”,OMNITURE_SCHEMA)2. .read3. .ETLOmnitureData4. .calculateOmnitureUserStats5. .joinWithCustomerDB('userId->'userId, customerPipe)6. .write(Tsv(“omniture-results.tsv”))
Custom operations: Re-usable modular code Single responsibility TestabilityFull-code
http://bit.ly/1pNSUKf
![Page 44: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/44.jpg)
Scalding Testing
Scalding.io
![Page 45: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/45.jpg)
Testing challenges in the context of MR
Scalding.io
Acceptance Tests
Unit – Component Tests
System Tests
Integration Tests
Scalding enables
testing in every layer
&
TDD
![Page 46: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/46.jpg)
example
Scalding.io
1. class TsvWordCountJobTest extends FlatSpec2. with ShouldMatchers with TuppleConversions {
3. “WordCountJob” should “count words” in { 4. JobTest(new WordCountJob(_))5. .args(“input”,”inFile”)6. .args(“output”,”outFile”)7. .source(TextLine(“inFile”), List(“0”) -> “cool Scala cool”))8. .sink[(String,Int)](Tsv(“outFile”)) { out =>9. out.toList should contain (“cool” -> 2)10. }11. .run12. .finish13. }14. }
Replaces taps with in-memory
collections and asserts the expected
output
![Page 47: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/47.jpg)
Monitoring
Scalding.io
![Page 48: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/48.jpg)
“Driven takes Cascading application development to the next level with management and monitoring capabilities for your apps”
Scalding.io
http://driven.cascading.io
![Page 49: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/49.jpg)
Scalding.io
Collects telemetry data and expose through a Web UI
![Page 50: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/50.jpg)
Advanced Concepts
Scalding.io
![Page 51: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/51.jpg)
Scalding adds Typed API Matrix API
Graphs Machine Learning Algorithm
Scalding.io
![Page 52: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/52.jpg)
What the future like?
Scalding.io
![Page 53: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/53.jpg)
So far…
Scalding.io
ab
stra
ctio
n
![Page 54: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/54.jpg)
Real TimeBatch Hybrid
Scalding.io
ab
stra
ctio
n
Summingbird
A unified API for everything
StormTEZ Spark
Enables the Lambda architecture
![Page 55: Scalding Presentation](https://reader035.vdocument.in/reader035/viewer/2022062702/554a11ffb4c9058c5d8b4b50/html5/thumbnails/55.jpg)
Scalding.io
Questions?