using scalding for data driven product development at linkedin

Using Scalding for Data-Driven Product Development

Sasha OvsankinLinkedIn

Presented to Scala By The BayAug 9, 2014

/summary

Data-Driven Product

Development

/summary

Data-Driven Product

Development

Scalding = Hadoop + Scala

/summary

Data-Driven Product

Development

/data-driven

YourService

/data-driven

YourService

/data-driven

YourService

Value Data

/data-driven

YourService

Value Data

/data-driven

YourService

Value Data

/data-driven

YourAmazing

Service

Value Data

“Online” World

/data-driven/linkedin

Web Applications

NoSQL Data Stores

“Offline” World (Hadoop)

Hadoop Jobs

Tracking/logging

Analytics

Data Products

Messaging

Message delivery

Databases

/linkedin/big-data/links

• “LinkedIn Big Data Ecosystem”– http://lnkd.in/big-data-ecosystem

• Grid Operations– http://lnkd.in/gridops2013

/scalding

http://github.com/twitter/scalding• Scala-based DSL for Map/Reduce jobs• Built on Cascading, stable and mature Hadoop framework• Uses API similar to Scala collections:

class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => line.split("""\s+""") } .groupBy('word) { _.size } .write( Tsv( args("output") ) )}

• Succinct and powerful• High level of abstraction

/data-driven/problem/scaling

• Problem: Scaling• Solution– Distributed processing– High-level description of algorithms– Functional programming

…/solution/scalding

../problem/complexity

• Problem: Complexity• Solution– Consistent way of organizing data• Self-describing data formats (Avro)• File organization

– Type safety– Modularization

…/solution/scalding

/linkedin/hadoop/practices

• All online data end up in HDFS– Avro encoding is standard

• Production Process– CI/Automatic Build

• More info forthcoming

– Production Review– Operations and Monitoring

• More info at http://lnkd.in/gridops2013

• Result: Thousands of jobs running in production• More info at http://lnkd.in/big-data-ecosystem

../solution/scala/killer-argument

• Map & reduce -- primitivesscala> (1 to 1000) map { pow(_,2) } reduce { _ + _ }res20: Int = 333833500

/linkedin/scalding/status

• Started >1 year ago• Thousands of production LOC written in Scalding by our

team– Pretty happy with readability, maintainability and tooling

support• Dozens of flows are currently in production, and counting• Created Scalding user group• Growing interest• Learning:

– Scala[Scalding] < Scala[ _ ]

/summary

Data-Driven Product

Development

/linkedin/join-us

• Work on unique and interesting problems• Be part of great engineering community• Use latest tools and technologies• Help connect the world’s professionals to help them become

more productive and successful• We are looking for amazing people interested in Software

Engineering and Data Science– http://linkedin.com/careers

Questions?

using scalding for data driven product development at linkedin

online data

data science http

amazing service value

scalding http

scala collections

scalascalding scala

primitives scala

ecosystem grid operations

Technology

linkedin presentation, linkedin whitepaper, linkedin video,

ˆ - alpine home air · 2016. 8. 11. · water temperatures...

good linkedin, weak linkedin

data driven recruiting - linkedin...four stories about data...

how linkedin recruits with data | brendan browne's sourecon...

model rc400 single cup brewer -...

the dynamic duo · 2020-03-19 · deliver results...

linkedin profiles, linkedin publishing, linkedin lead gen...

scalding - big data programming with scala

working with the scalding type -safe api

yarn webinar series: using scalding to write applications to...

linkedin recruiting solutions “tagline” · linkedin...

capabilities-driven innovation management - konwledge...

scalding - hadoop word count in less than 70 lines of code

writing hadoop jobs in scala using scalding

@scaldingblogs.ischool.berkeley.edu/.../2012/11/twitter... ·...

data-driven best practices for linkedin sponsored inmail

scalding presentation

data driven recruiting with linkedin

linkedin on linkedin