scalding by adform research, alex gryzlov

18
Quick Guide

Upload: vasil-remeniuk

Post on 15-Jul-2015

92 views

Category:

Technology


3 download

TRANSCRIPT

Page 1: Scalding by Adform Research, Alex Gryzlov

Quick Guide

Page 2: Scalding by Adform Research, Alex Gryzlov

What is Scalding ?

• Scala wrapper for Cascading

Page 3: Scalding by Adform Research, Alex Gryzlov

What is Cascading ?

Tap / Pipe / Sink abstraction over Map / Reduce in Java

Page 4: Scalding by Adform Research, Alex Gryzlov

What is Scalding ?

• Scala wrapper for Cascading

• Just like working with in-memory collections !

TextLine( args("input") )

.flatMap('line -> 'word) { line : String => tokenize(line) }

.groupBy('word) { _.size }

.write( Tsv( args("output") ) )

• No more scripting and UDFs!

Page 5: Scalding by Adform Research, Alex Gryzlov

Hands on

• Clone the skeleton repository

• Get IntelliJ Idea and the scala plugin

• Open the project

• Compile, wait for dependencies to download

• Create a run configuration …

• Create a specs2 configuration for tests

Page 6: Scalding by Adform Research, Alex Gryzlov

run the WordCountJob in local mode with given input and output

Page 7: Scalding by Adform Research, Alex Gryzlov

Building and Deploying

• Get sbt

• sbt assembly produces jar file in target/scala_2.10

• sbt s3-upload produces jar and uploads to s3

• Configure teamcity

Page 8: Scalding by Adform Research, Alex Gryzlov

Running on EMR

• hadoop fs -get s3://dev-adform-temp-results/wordcount-job.jar job.jar

• hadoop jar job.jar \

com.twitter.scalding.Tool \ Entry class

com.adform.dspr.WordCountJob \ Scalding job class

--hdfs \ Run in HDFS mode

--input s3://adform-dsp-metadata/countries/countries.txt \ Parameter

--output s3://dev-adform-temp-results/wordcount Parameter

Page 9: Scalding by Adform Research, Alex Gryzlov

Under the covers

• sbt run-main \

com.twitter.scalding.Tool \

com.adform.dspr.WordCountJob \

--hdfs \

--tool.graph \

--input dummy --output dummy

• dot -Tpng com.adform.dspr.WordCountJob0.dot -o logical_plan.png

• dot -Tpng com.adform.dspr.WordCountJob0_steps.dot -o mr_plan.png

Page 10: Scalding by Adform Research, Alex Gryzlov
Page 11: Scalding by Adform Research, Alex Gryzlov

Development

• Different APIs:• Fields – everything is a string

• Typed – working with classes, e.g. Request/Transaction

Page 12: Scalding by Adform Research, Alex Gryzlov

Development

• Fields:• No need to parse columns

• Redundant

• No IDE support like auto-completion

• Typed:• All benefits of types

• More manual work with parsing

Page 13: Scalding by Adform Research, Alex Gryzlov

Resources

• https://github.com/twitter/scalding

• https://github.com/twitter/scalding/tree/develop/tutorial

• https://github.com/twitter/scalding/wiki

• http://www.slideshare.net/AntwnisChalkiopoulos/scalding-presentation

• http://www.slideshare.net/ktoso/scalding-the-notsobasics-scaladays-2014

• https://gitz.adform.com/dspr/data-processing/tree/develop/jobs/process-logs-rtb

Page 14: Scalding by Adform Research, Alex Gryzlov

My Experience

• Running the job locally is a HUGE time saver

• Programming scala is amazing (no more UDFs)

• Type safety, IDE support!

• Debugging !!!!111

• More optimal job plans

Page 15: Scalding by Adform Research, Alex Gryzlov

My Experience

• A lot of configuring and googling random issues

• Scarce documentation, had to read source code

• IntelliJ is slow

• Boilerplate code for parsing data

Page 16: Scalding by Adform Research, Alex Gryzlov

Use cases

• Easy jobs hive

• Non-trivial jobs scalding

• Optional: scalding is nice for doing matrix calculations, twitter also provides a lot of monoids (algorithms) for nice approximations, e.g. HyperLogLog, CountMinSketch, etc. (see algebird).

Page 17: Scalding by Adform Research, Alex Gryzlov

process-logs-rtb

• Had to hack scalding: • WritableMultiSinkTap

• Records

• CompressedTsv

• ModelKryoInstantiator

• Uses typed API

• Helpers like FluentJob

Page 18: Scalding by Adform Research, Alex Gryzlov