thank you

Thank you

Harel Ben Attia

Senior Software Engineer

RiverA data workflow management system

– Tens of Billions of Recommendations per month– Most major publishers in the World– Hundreds GBs of new data every day

Context

• Data Processing Workflows

• Multiple Types of Processing– Rollups, Grouping, Filtering, Algorithm

Calculations

• Multiple Stages of Processing– Using the output of other processes as input

Problems

• Dependency “Management”– Hardcoded into code/scripts– Time-based using cron or another scheduler

• Logic is scattered around the system– Developers need to take care of monitoring,

alerts, permissions etc. – Multiple Locations of Execution

Data Processing Management Infrastructure

• Execution Management– Full Execution History and Filtering– Monitoring and Actionable Alerting– Automatic Retries– Web UI

• Ease of Development– Declarative Data Processing Definitions– Decentralized

• Shared Data, separate development– JobLogs

• Data Driven Dependencies– Why?

Ops / NOC

Developers

A B CJ

A B CJt

Option 1 Option 2

Other Approaches

A B CJ t

Option 2

Other Approaches

D FailsD sends email

Developer of Dstill works here

Where is the code?

Other Approaches

2am is a great hour fortroubleshooting!

Data from C is missing…

C = The data of Cis all there!

Other Approaches

A B CD …

X:37 seems like a good time… C never finished after X:30

anyway

Job J has been working for more than a week before

the incident

Other Approaches

Need to rerun processes B, C and D

•Without running A again?•Without colliding with ongoing executions?

•Which hours failed?

•How to run all of them for the specific hours?

Other Approaches

“A will never take more than 15 minutes, so X:20 is more than enough”

A WILL eventually take longer

Other Approaches

• Execution Management– Full Execution History + Filtering and Searching– Monitoring and Actionable Alerting– Automatic Retries– Web UI– JobLogs

• Ease of Development– Declarative Data Processing Definitions– Decentralized

• Shared Data, separate development

• Data Driven Dependencies– Why? Robustness Reliability Parallelism

What? When?

Where? How?

Execution Layer – the “What”

• Importing from MySQL to Hive• Hive Queries• JDBC Queries• Transfer data from Hive into MySQL and to Cassandra• Running External Commands: MapReduce, Java, bash,

Legacy code, etc.

Every data processing task is called a Job

A Job can contain multiple Steps

Jobs use Parameters

Scheduling Layer – the “When”

Events that describe Data Availability

Each job registers to an event, which will trigger its execution

Each job emits an event at job completion

Events that are time dependent

The “How” and the “Where”

• Integration to other systems• Connecting to Hive/Hadoop/Cassandra• Connecting to JDBC Databases• Retries, throttling, timeouts

Both handled by the infrastructure

Logical names to all data sources

Centralized Management, email notifications and dashboards

• Monitoring and Alerts

• Location of Execution Actual location is hidden from the developer/ops

“readOnlyDataWarehouse””productionCassandra”

River UI

Restart JobFail Job and DependentsDownload JobLog

Monitoring Dashboard

Steps only contain what needs to be done

sourceDB = “productionDatabase”sourceTable = “myRawData”targetCluster = “onlineHadoopCluster”targetHiveTable = “rawDataTable”Filter = “date=#handledDate#”

Copy Data From JDBC to Hive

A bit more about triggers

Triggers have parameters as well

Date=2012-10-10,hour=15 Date=2012-10-10,hour=19

Parameters Propagate through jobs and to other triggers

Developer’s Point-of-View

Automatic Retries

Parameters Pass-through

TriggerManager

External Systems

Trigger Queue Execution Queue

Hive/Hadoop Interface

OSInterface

CassandraInerface

JDBCInterface

Spring Batch DB

Execution Manager

Spring Batch

Topology

Dependenciesfor detailed example

TriggerManager

External Systems

OSInterface

CassandraInerface

JDBCInterface

Spring Batch DB

Execution Manager

Spring Batch

Topology

T1Date=2012-01-02hour=03

Job1,Job2

Job1,Job2Job2

T3T1 Job3

Success Example

Job1,Job2Date=2012-01-02

hour=03

(from Job1) (from Job2)

T3Date=2012-01-02

hour=03

TriggerManager

External Systems

OSInterface

CassandraInerface

JDBCInterface

Spring Batch DB

Execution Manager

Spring Batch

Topology

Job2Job2

Failure Example

Job2Date=2012-01-02

hour=03

T3Date=2012-01-02

hour=03

Notable Features• Parameter Enrichment

– Example: #beginningOfMonth

• Precondition Expressions– Example: isLastDayOfMonth(#handleDate)

• Data Comparison Capabilities– Data Validations– Supports Tolerance

• Absolute and Percentage margins

• Command Line and Java Clients

River at

• 6 River Instances Running• 5 Teams• ~4100 Jobs running every day• ~50 Different Job Types

• Job Failures due to environment issues have almost no overhead

• Automatic restarts of jobs when data arrives late

Future Plans

• Multiple Dependencies• Offline Job Testing Capabilities• Improved DSL for Job Definitions• Support for Master/Worker River machines• Job Priorities• Analysis Tools

Outbrain is working on Open Sourcing River

Illustration by Chris Whetzel

Questions

Thank You

harel@outbrain.com@harelba on TwitterHarel Ben Attia

http://www.linkedin.com/in/harelba

thank you

data availability

copy data

data of cis

data processing task

worldhundreds gbs of

execution layer

joba job

job completionevents

Documents

tri-athy 2008 race organisation & marshal plan. thank you!...

acps s got talent thank you, the lue and white thank...

thank you to our principal sponsor - worldtek travel ·...

baby shower ideas and shops - thank you baby & co. thank you...

thank you for the music thank you for the music ... ·...

expressions 1. thank you (very much!) thank you, too

chapter 21 (stage 2): specific purposes zones · mr...

snow thank you - maryland.gov enterprise agency template ·...

information desk thank you thank you -19 plan: parish

thank you……………thank you…………thank...

3,oooab y? thank thanks! thank you; you very muchi …

thank you baby & co. thank you baby & co. thank you baby &...

the official magazine of the association for children with...

thank thank you! you!! - cultured palate€¦ · thank you!...

important thank you thank you to everyone who has

thank you……………thank you…………thank...

thank you thank you thank you thank you thank you

printable by personal use only · thank you thank you thank...

thank you ! !

thank you · thank you. thank you. a heartfelt thank you....