building and managing complex dependencies pipeline using apache oozie
TRANSCRIPT
Building and managing complex dependencies pipeline using Apache Oozie
Purshotam Shah ([email protected])Sr. Software Engineer, Yahoo Hadoop teamApache Oozie PMC member and committer
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and Monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
3
Why Oozie?
Out-of-box support for multiple job types Java, shell, distcp Mapreduce
• Pipes, streaming pig, hive, spark
Highly scalable High availability
Hot-Hot with rolling upgrades Load balanced
Hue Integration
Oozie
Hbase
Pig
Hive
Spark
Yarn
HDFS
Hue
HCatalog
4
Security: https + kerberos / cookie-based auth
Deployment Architecture at Yahoo
Load Balancer
Oracle RAC
Hadoop Cluster, HBase, HCatalog
submit request
request redirection
Oozie Server 1
Oozie Server 2
Inter server communicationfor log streaming,sharelib update etc
ZookeeperCurator
Security: https + kerberos / cookie-based-auth
Security: https+kerberos
Lock management
Security: kerberos
Security: kerberos
Scale at Yahoo
5
Deployed on all clusters (production, non-production)One instance per cluster
75 products / 2000 + projects255 monthly users
90,00 workflow jobs daily June 2016, one busy cluster)Between 1-8 actions :Avg. 4 actions/workflowExtreme use case, submit 100-200 workflow jobs per min
2,277 coordinator jobs daily (June 2016, one busy cluster)Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min)99 % of workflow jobs kicked from coordinator
97 bundle jobs daily (June 2016, one busy cluster)
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
Data Pipelines
7
Ad ExchangeAd LatencySearch Advertising
Content ManagementContent OptimizationContent PersonalizationFlickr Video
Audience TargetingBehavioral TargetingPartner TargetingRetargetingWeb Targeting
Advertisement Content Targeting
Data Pipelines
8
Anti SpamContentRetargeting
ResearchDashboards & ReportsForecasting
Email Data Intelligence Data Management
Audience Pipeline
Use Case - Data pipeline
9
10
Large Scale Data Pipeline Requirements
Administrative One should be able to start, stop and pause all related pipelines or part of it at the
same time
Dependency Management BCP support Data is not guaranteed, start processing even if partial data is available Mandatory and optional feeds
11
Large Scale Data Pipeline Requirements
Multiple Providers If data is available from multiple providers, I want to specify the provider priority Combining dataset from multiple providers
SLA Management Monitor pipeline processing to take immediate action in case of failures or SLA misses Pipelines owners should get notified if an SLA is missed
12
Bundle
The Bundle system allows the user to define and execute a bunch of Loosely coupled set of coordinators. They are dependent on each other, but dependency is enforced via inputs and outputs.
Bundle can be used to start/stop/suspend/resume/rerun whole pipeline
13
Complex dependencies
OOZIE-1976 : Specifying coordinator input datasets in more logical ways
BCP Support
Pull data from A or B. Specify dataset as AorB. Action will start running as soon either dataset A or B is available.
<input-logic> <or name=“AorB”> <data-in dataset="A”/> <data-in dataset="B"/> </or></input-logic>
14
15
Minimum availability processing
Some time, we want to process even if partial data is available.
<input-logic><data-in dataset=“A" min=”4”/>
</input-logic>
16
Optional feeds
Dataset B is optional, Oozie will start processing as soon as A is available. It will include dataset from A and whatever is available from B.
<input-logic> <and name="optional> <data-in dataset="A"/> <data-in dataset="B" min=”0”/> </and></input-logic>
Priority Among Dataset Instances
A will have higher precedence over B and B will have higher precedence over C.
<input-logic> <or name="AorBorC"> <data-in dataset="A"/> <data-in dataset="B"/> <data-in dataset="C”/> </or></input-logic>
17
Wait for primary
Sometime we want to give preference to primary data source and switch to secondary only after waiting for some specific amount of time.
<input-logic> <or name="AorB"> <data-in dataset="A” wait=“120”/> <data-in dataset="B"/> </or></input-logic>
18
Combining Dataset From Multiple ProvidersCombine function will first check instances from A and go to B next for whatever is missing in A.
<data-in name="A" dataset="dataset_A"> <start-instance> ${coord:current(-5)} </start-instance> <end-instance> ${coord:current(-1)} </end-instance></data-in>
<data-in name="B" dataset="dataset_B"> <start-instance>${coord:current(-5)}</start-instance> <end-instance>${coord:current(-1)}</end-instance></data-in>
<input-logic> <combine name="AB"> <data-in dataset="A"/> <data-in dataset="B"/> </combine></input-logic>
19
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
21
Monitoring
Configure to receive notifications Email action HTTP notifications for job status change Email notification for SLA misses JMS notification for SLA events
By Polling CLI/REST API monitoring
• Single Job monitoring
• Bulk Monitoring for Bundles and Coordinators
• SLA monitoring
22
Monitoring
Email action can be added to workflow to send mail Job status change notification for coordinator action
oozie.coord.action.notification.url oozie.coord.action.notification.proxy
Job status change notification for workflow “oozie.wf.workflow.notification.url” “oozie.wf.workflow.notification.proxy”
23
Job Monitoring - polling
Supported for both CLI and web service Single job monitoring Bulk job monitoring
Multiple parameter like, • Bundle name, bundle id, username, startcreatedtime, endcreatedtime
Multiple job status such as• oozie jobs -bulk bundle=bundle-app-1; actionstatus=RUNNING; actionstatus=FAILED
24
Oozie can actively track SLAs on Jobs’ Start-time, End-time, Duration
Access/Filter SLA info via Web-console dashboard REST API JMS Messages Email alert
SLA Monitoring
25
SLA dashboard – tabular view
26
SLA dashboard – Graph view
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
28
User view BCP SLA support No Color coding Paging/oncall Threshold Consolidated email Multi grid view
Monitoring Limitations
29
Data pipeline monitoring use case from Y!
30
Setup cron job which periodically pull SLA information from oozie If there is any SLA miss, notification is sent to internal monitoring
system› Pages and sends mobile alert to on-call person› Send email alert
Case-1
31
Case-1
32
Case-2
Divided into four section SLA Details Error jobs Long Running Jobs Running jobs
33
SLA information
34
SLA-status
35
Long Waiting jobs
36
Long Waiting jobs – missing dependencies
37
Error Jobs
38
Running job details
39
Job explorer
40
Feeds - jobs
41
Validation job
Data pipe line also run periodically validation jobs to validate the output Those multiple pipeline has multiple validation requirement, One example of validation
job is to validate the number of click impression with billing details.
42
Alert
43
Reprocessing
One of the biggest requirements of a pipeline is to reprocess whole dependent DAG.
Oozie does not support any data dependencies This makes it very difficult to rerun the whole pipeline for a particular
nominal time.
44
Reprocessing
To solve Oozie limitation, they have built a job dependency DAG. It is very similar to job explorer->feed lookup feature. job explorer->feed lookup is based on the output produced by
coordinator jobs. Job dependencies DAG is based on the input to jobs. Currently there is no UI to this, they parse oozie jobs daily and store the
dependencies in text file.
45
Reprocessing
Rerun the failed action and all dependent coordinator jobs.• Easy to do• Cons
– Difficult to monitor
Create a new coordinator for timeline which has failed• Easy to monitor
46
Reprocessing
47
Reprocessing
48
Consolidate SLA Monitoring
Agenda
Oozie at Yahoo1
Data Pipelines
SLA and monitoring
Monitoring Limitations and User monitoring systems
Future Work
2
3
4
5
50
Future Work
Oozie Unit testing framework No unit tests now. Directly tested by running in staging
Coordinator Dependency management Better reprocessing
Aperiodic and Incremental processing Managed through workarounds
51
Oozie BOF at Ballroom B
THANK YOUPurshotam Shah ([email protected])Sr. Software Engineer, Yahoo Hadoop team