thank you. harel ben attia senior software engineer river a data workflow management system
TRANSCRIPT
![Page 1: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/1.jpg)
Thank you
![Page 2: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/2.jpg)
Harel Ben AttiaSenior Software Engineer
RiverA data workflow management system
![Page 3: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/3.jpg)
– Tens of Billions of Recommendations per month– Most major publishers in the World– Hundreds GBs of new data every day
![Page 4: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/4.jpg)
Context
• Data Processing Workflows
• Multiple Types of Processing– Rollups, Grouping, Filtering, Algorithm
Calculations
• Multiple Stages of Processing– Using the output of other processes as input
![Page 5: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/5.jpg)
Problems
• Dependency “Management”– Hardcoded into code/scripts– Time-based using cron or another scheduler
• Logic is scattered around the system– Developers need to take care of monitoring,
alerts, permissions etc. – Multiple Locations of Execution
![Page 6: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/6.jpg)
River
Data Processing Management Infrastructure
![Page 7: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/7.jpg)
River
• Execution Management– Full Execution History and Filtering– Monitoring and Actionable Alerting– Automatic Retries– Web UI
• Ease of Development– Declarative Data Processing Definitions– Decentralized
• Shared Data, separate development
– JobLogs
• Data Driven Dependencies– Why?
Ops / NOC
Developers
![Page 8: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/8.jpg)
A B C
A B CJ
A B CJt
Option 1 Option 2
Other Approaches
![Page 9: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/9.jpg)
A B CJ t
Option 2
Other Approaches
![Page 10: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/10.jpg)
D FailsD sends email
Developer of Dstill works here
Where is the code?
Other Approaches
![Page 11: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/11.jpg)
2am is a great hour fortroubleshooting!
D =
Data from C is missing…
C = The data of Cis all there!
Other Approaches
![Page 12: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/12.jpg)
A B CD …
X:37 seems like a good time… C never finished after X:30
anyway
t
Job J has been working for more than a week before
the incident
Other Approaches
![Page 13: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/13.jpg)
Need to rerun processes B, C and D
•Without running A again?•Without colliding with ongoing executions?
•Which hours failed?
•How to run all of them for the specific hours?
Other Approaches
![Page 14: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/14.jpg)
AJ
“A will never take more than 15 minutes, so X:20 is more than enough”
t
A WILL eventually take longer
X:00
Other Approaches
![Page 15: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/15.jpg)
River
• Execution Management– Full Execution History + Filtering and Searching– Monitoring and Actionable Alerting– Automatic Retries– Web UI– JobLogs
• Ease of Development– Declarative Data Processing Definitions– Decentralized
• Shared Data, separate development
• Data Driven Dependencies– Why? Robustness Reliability Parallelism
![Page 16: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/16.jpg)
River
What? When?
Where? How?
![Page 17: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/17.jpg)
Execution Layer – the “What”
• Importing from MySQL to Hive• Hive Queries• JDBC Queries• Transfer data from Hive into MySQL and to Cassandra• Running External Commands: MapReduce, Java, bash,
Legacy code, etc.
Every data processing task is called a Job
A Job can contain multiple Steps
Jobs use Parameters
![Page 18: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/18.jpg)
Scheduling Layer – the “When”
Events that describe Data Availability
Each job registers to an event, which will trigger its execution
Each job emits an event at job completion
Events that are time dependent
![Page 19: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/19.jpg)
The “How” and the “Where”
• Integration to other systems• Connecting to Hive/Hadoop/Cassandra• Connecting to JDBC Databases• Retries, throttling, timeouts
Both handled by the infrastructure
Logical names to all data sources
Centralized Management, email notifications and dashboards
• Monitoring and Alerts
• Location of Execution Actual location is hidden from the developer/ops
“readOnlyDataWarehouse””productionCassandra”
![Page 20: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/20.jpg)
River UI
Restart JobFail Job and DependentsDownload JobLog
![Page 21: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/21.jpg)
Monitoring Dashboard
![Page 22: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/22.jpg)
Monitoring Dashboard
![Page 23: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/23.jpg)
Steps
Steps only contain what needs to be done
sourceDB = “productionDatabase”sourceTable = “myRawData”targetCluster = “onlineHadoopCluster”targetHiveTable = “rawDataTable”Filter = “date=#handledDate#”
Copy Data From JDBC to Hive
![Page 24: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/24.jpg)
A bit more about triggers
Triggers have parameters as well
Date=2012-10-10,hour=15 Date=2012-10-10,hour=19
Parameters Propagate through jobs and to other triggers
![Page 25: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/25.jpg)
Developer’s Point-of-View
Automatic Retries
Parameters Pass-through
![Page 26: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/26.jpg)
TriggerManager
External SystemsExternal Systems
Trigger Queue Execution Queue
Hive/Hadoop Interface
OSInterface
CassandraInerface
JDBCInterface
Spring Batch DB
Execution Manager
Spring Batch
River
Topology
![Page 27: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/27.jpg)
Dependenciesfor detailed example
![Page 28: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/28.jpg)
TriggerManager
External SystemsExternal Systems
Trigger Queue Execution Queue
Hive/Hadoop Interface
OSInterface
CassandraInerface
JDBCInterface
Spring Batch DB
Execution Manager
Spring Batch
River
Topology
T1Date=2012-01-02hour=03
Job1,Job2
Job1,Job2Job2
Job3
Job1
T2
T2
Job3
T3T1 Job3
Success Example
Job1,Job2Date=2012-01-02
hour=03
(from Job1) (from Job2)
T3Date=2012-01-02
hour=03
![Page 29: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/29.jpg)
TriggerManager
External SystemsExternal Systems
Trigger Queue Execution Queue
Hive/Hadoop Interface
OSInterface
CassandraInerface
JDBCInterface
Spring Batch DB
Execution Manager
Spring Batch
River
Topology
Job2
Job2
Job2Job2
T3
Job3
Job3
Job3
Failure Example
Job2
Date=2012-01-02hour=03
T3Date=2012-01-02
hour=03
UI
![Page 30: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/30.jpg)
Notable Features• Parameter Enrichment
– Example: #beginningOfMonth
• Precondition Expressions– Example: isLastDayOfMonth(#handleDate)
• Data Comparison Capabilities– Data Validations– Supports Tolerance
• Absolute and Percentage margins
• Command Line and Java Clients
![Page 31: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/31.jpg)
River at
• 6 River Instances Running• 5 Teams• ~4100 Jobs running every day• ~50 Different Job Types
• Job Failures due to environment issues have almost no overhead
• Automatic restarts of jobs when data arrives late
![Page 32: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/32.jpg)
Future Plans
• Multiple Dependencies• Offline Job Testing Capabilities• Improved DSL for Job Definitions• Support for Master/Worker River machines• Job Priorities• Analysis Tools
Outbrain is working on Open Sourcing River
Illustration by Chris Whetzel
![Page 33: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/33.jpg)
Questions
![Page 34: Thank you. Harel Ben Attia Senior Software Engineer River A data workflow management system](https://reader036.vdocument.in/reader036/viewer/2022070403/56649f2a5503460f94c44427/html5/thumbnails/34.jpg)
Thank You
@harelba on TwitterHarel Ben Attiahttp://www.linkedin.com/in/harelba