Download - Scala and Hadoop @ eBay
![Page 1: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/1.jpg)
Scala and Hadoop @ eBay
![Page 2: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/2.jpg)
What we will cover
• Polymorphic Function Values• Higher Kinded/Recursive Types• Cokleislis Star Operators• Scala Macros
![Page 3: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/3.jpg)
I have no clue what those things are
![Page 4: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/4.jpg)
What we will ACTUALLY cover
• Why Scala• Why Hadoop• How we use Scala with Hadoop• Lots of CODE!
![Page 5: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/5.jpg)
Why Scala?
• JVM• **Functional**• Expressive• How to convince your boss?
![Page 6: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/6.jpg)
Someone on Hacker News said Scala sucks
• Compile Times• You changed List again?• Complicated• Leads to Madness
![Page 7: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/7.jpg)
Madness?trait Lazy[+T, P] { var creationParameters: P = None.asInstanceOf[P]; lazy val lazyThing: Either[Throwable, T] = try { Right(create(creationParameters)) }
catch { case e => Left(e) } def get(createParams: P): Either[Throwable, T] = { creationParameters = createParams lazyThing } def create(params: P): T}
![Page 8: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/8.jpg)
Madness?
def getSingleInstance[T, P](params: P)(implicit lazyCreator: Lazy[T, P]): T = { lazyCreator.get(params) match {
case Right(successValue) => successValue case Left(exception) => throw new
StackException(exception) }
}
![Page 9: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/9.jpg)
This is used by ONE client class
• Show some self-restraint
![Page 10: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/10.jpg)
![Page 11: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/11.jpg)
Hadoop
• void map(K1 key, V1 value, OutputCollector<K2, V2> output, Reporter reporter)
• void reduce(K2 key, Iterator<V2> values, OutputCollector<K3, V3> output, Reporter reporter)
![Page 12: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/12.jpg)
BIG NUMBERS
• Petabytes of data• 1k+ node Hadoop cluster• Multi-billion dollar merchandising business• Lots of users and items
![Page 13: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/13.jpg)
How should I use Map Reduce?
• Raw map reduce • Pig • Hive• Cascading• Scoobi• Scalding
![Page 14: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/14.jpg)
Decision Time
• “And every one that heareth these sayings of mine (great software engineers of the past), and doeth them not, shall be likened unto a foolish man, which built his house upon the sand.”
• “And the rain descended, and the floods came, and the winds blew, and beat upon that house; and it fell: and great was the fall of it.”
![Page 15: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/15.jpg)
I believe!
• Scalding combines the best of PIG and Cascading
![Page 16: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/16.jpg)
Good PigA = LOAD 'input' AS (x, y, z);B = FILTER A BY x > 5;DUMP B;C = FOREACH B GENERATE y, z;STORE C INTO 'output';
// do joins and group by also
![Page 17: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/17.jpg)
Bad Pig
DEFINE NV_terms `perl nv_terms2.pl` ship('$scripts/nv_terms2.pl');
i5 = stream i4 through NV_terms as (leafcat:chararray, name:chararray, name1:chararray);
i7 = foreach i5 generate leafcat, com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name) as name, com.ebay.pigudf.sic.RtlUDF(0,0,0,'$site_id',name1) as name1;
![Page 18: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/18.jpg)
Other Pig Issues
• Scheduling and DAG creation
![Page 19: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/19.jpg)
Cascading Rocks!
• What is it?• Supports large workflows and reusable
components– DAG generation– Parallel Executions
![Page 20: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/20.jpg)
Cascading code in Scala
val masterPipe = new FilterURLEncodedStrings(masterPipe, "sqr")
masterPipe = new FilterInappropriateQueries(masterPipe, "sqr”)
masterPipe = new GroupBy(masterPipe, CFields("user_id", "epoch_ts", "sqr"), sortFields)
![Page 21: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/21.jpg)
Someone should really code review this
![Page 22: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/22.jpg)
Cascading Issues
This page intentionally left blank
![Page 23: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/23.jpg)
Scalding Time
class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => tokenize(line) } .groupBy('word) { _.size } .write( Tsv( args("output") ) )
// Split a piece of text into individual words. def tokenize(text : String) : Array[String] = { // Lowercase each word and remove punctuation. text.toLowerCase.replaceAll("[^a-zA-Z0-9\\s]", "").split("\\s+") }}
![Page 24: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/24.jpg)
Scalding @ eBay
• Boilerplate reduction• Extensibility• New hires
![Page 25: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/25.jpg)
Practical Scalding Use • Pimp my pimp• Code generated boilerplate• Cascades• Traps• Testing!
![Page 26: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/26.jpg)
class eBayJob(args: Args) extends Job(args) with PipeBoilerPlate {
implicit def pipe2eBayRichPipe(pipe: Pipe) = new eBayRichPipe(pipe)
class eBayRichPipe(pipe: Pipe) extends RichPipe(pipe) with CommonFunctions
trait CommonFunctions { import Dsl._ import RichPipe.assignName def pipe: Pipe def reallyComplexFunction(field: Fields, param: Long) = {
//mind blowing code here }}}
![Page 27: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/27.jpg)
CheckoutTransactionsPipe(//default path logic) .project(//fields I need).countUserInteractions(//params).doScoreCalculation(//params).doConfidenceCalculation(//params)
Seems a bit too readable for Scala
![Page 28: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/28.jpg)
Collaborative Filtering
• Typically hard to run on large datasets
![Page 29: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/29.jpg)
Structured Data Importance
• Do people shop by brand?
Bag Dep
th
Bag Heig
ht
Bag Le
ngthBran
dColor
Country of M
anufac
ture
Materia
l
Shad
eSiz
e
Strap
Drop
Style
0
0.2
0.4
0.6
0.8
1
1.2
Handbags and Purses
Supp
ly
![Page 30: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/30.jpg)
Markov Chains
• Investigation of buying patterns in ~50 lines of code
val purchases = "firsttime" :: x.take(500).toListval pairs = purchases zip purchases.tailval grouped = pairs.groupBy(x =>
x._1.toString+"-"+x._2.toString) val sizes = grouped map { x => { x._1 -> x._2.size }} toList
![Page 31: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/31.jpg)
Mining Search Queries
• 20+ billion user queries - give me the top ones per user
De-Dupe Rank ValidateSample Data
![Page 32: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/32.jpg)
Automation
Hadoop Proxy Batch Database Load Machines
Cassandra
Jenkins
MySql
Mongo
![Page 33: Scala and Hadoop @ eBay](https://reader034.vdocument.in/reader034/viewer/2022042518/554f937fb4c905435d8b51de/html5/thumbnails/33.jpg)
Questions?
www.ebaynyc.com