using scalding for data driven product development at linkedin

Post on 27-Jan-2015

106 Views

Category:

Technology

3 Downloads

Preview:

Click to see full reader

DESCRIPTION

My talk on ScalaByTheBay conference http://www.scalabythebay.org/schedule.html

TRANSCRIPT

Using Scalding for Data-Driven Product Development

Sasha OvsankinLinkedIn

Presented to Scala By The BayAug 9, 2014

/summary

Data-Driven Product

Development

/summary

Data-Driven Product

Development

Scalding = Hadoop + Scala

/summary

Data-Driven Product

Development

Scalding = Hadoop + Scala

/data-driven

YourService

/data-driven

YourService

Value

/data-driven

YourService

Value Data

/data-driven

YourService

Value Data

/data-driven

YourService

Value Data

/data-driven

YourAmazing

Service

Value Data

“Online” World

/data-driven/linkedin

Web Applications

NoSQL Data Stores

ETL

“Offline” World (Hadoop)

HDFS

Hadoop Jobs

Tracking/logging

Analytics

Data Products

Messaging

Message delivery

Databases

/linkedin/big-data/links

• “LinkedIn Big Data Ecosystem”– http://lnkd.in/big-data-ecosystem

• Grid Operations– http://lnkd.in/gridops2013

/scalding

http://github.com/twitter/scalding• Scala-based DSL for Map/Reduce jobs• Built on Cascading, stable and mature Hadoop framework• Uses API similar to Scala collections:

class WordCountJob(args : Args) extends Job(args) { TextLine( args("input") ) .flatMap('line -> 'word) { line : String => line.split("""\s+""") } .groupBy('word) { _.size } .write( Tsv( args("output") ) )}

• Succinct and powerful• High level of abstraction

/data-driven/problem/scaling

• Problem: Scaling• Solution– Distributed processing– High-level description of algorithms– Functional programming

…/solution/scalding

../problem/complexity

• Problem: Complexity• Solution– Consistent way of organizing data• Self-describing data formats (Avro)• File organization

– Type safety– Modularization

…/solution/scalding

/linkedin/hadoop/practices

• All online data end up in HDFS– Avro encoding is standard

• Production Process– CI/Automatic Build

• More info forthcoming

– Production Review– Operations and Monitoring

• More info at http://lnkd.in/gridops2013

• Result: Thousands of jobs running in production• More info at http://lnkd.in/big-data-ecosystem

../solution/scala/killer-argument

• Map & reduce -- primitivesscala> (1 to 1000) map { pow(_,2) } reduce { _ + _ }res20: Int = 333833500

/linkedin/scalding/status

• Started >1 year ago• Thousands of production LOC written in Scalding by our

team– Pretty happy with readability, maintainability and tooling

support• Dozens of flows are currently in production, and counting• Created Scalding user group• Growing interest• Learning:

– Scala[Scalding] < Scala[ _ ]

/summary

Data-Driven Product

Development

Scalding = Hadoop + Scala

/linkedin/join-us

• Work on unique and interesting problems• Be part of great engineering community• Use latest tools and technologies• Help connect the world’s professionals to help them become

more productive and successful• We are looking for amazing people interested in Software

Engineering and Data Science– http://linkedin.com/careers

Questions?

top related