building and managing complex dependencies pipeline using apache oozie

Post on 07-Jan-2017

415 Views

Category:

Technology

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Building and managing complex dependencies pipeline using Apache Oozie

Purshotam Shah (purushah@yahoo-inc.com)Sr. Software Engineer, Yahoo Hadoop teamApache Oozie PMC member and committer

Agenda

Oozie at Yahoo1

Data Pipelines

SLA and Monitoring

Monitoring Limitations and User monitoring systems

Future Work

2

3

4

5

3

Why Oozie?

Out-of-box support for multiple job types Java, shell, distcp Mapreduce

• Pipes, streaming pig, hive, spark

Highly scalable High availability

Hot-Hot with rolling upgrades Load balanced

Hue Integration

Oozie

Hbase

Pig

Hive

Spark

Yarn

HDFS

Hue

HCatalog

4

Security: https + kerberos / cookie-based auth

Deployment Architecture at Yahoo

Load Balancer

Oracle RAC

Hadoop Cluster, HBase, HCatalog

submit request

request redirection

Oozie Server 1

Oozie Server 2

Inter server communicationfor log streaming,sharelib update etc

ZookeeperCurator

Security: https + kerberos / cookie-based-auth

Security: https+kerberos

Lock management

Security: kerberos

Security: kerberos

Scale at Yahoo

5

Deployed on all clusters (production, non-production)One instance per cluster

75 products / 2000 + projects255 monthly users

90,00 workflow jobs daily June 2016, one busy cluster)Between 1-8 actions :Avg. 4 actions/workflowExtreme use case, submit 100-200 workflow jobs per min

2,277 coordinator jobs daily (June 2016, one busy cluster)Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min)99 % of workflow jobs kicked from coordinator

97 bundle jobs daily (June 2016, one busy cluster)

Agenda

Oozie at Yahoo1

Data Pipelines

SLA and monitoring

Monitoring Limitations and User monitoring systems

Future Work

2

3

4

5

Data Pipelines

7

Ad ExchangeAd LatencySearch Advertising

Content ManagementContent OptimizationContent PersonalizationFlickr Video

Audience TargetingBehavioral TargetingPartner TargetingRetargetingWeb Targeting

Advertisement Content Targeting

Data Pipelines

8

Anti SpamContentRetargeting

ResearchDashboards & ReportsForecasting

Email Data Intelligence Data Management

Audience Pipeline

Use Case - Data pipeline

9

10

Large Scale Data Pipeline Requirements

Administrative One should be able to start, stop and pause all related pipelines or part of it at the

same time

Dependency Management BCP support Data is not guaranteed, start processing even if partial data is available Mandatory and optional feeds

11

Large Scale Data Pipeline Requirements

Multiple Providers If data is available from multiple providers, I want to specify the provider priority Combining dataset from multiple providers

SLA Management Monitor pipeline processing to take immediate action in case of failures or SLA misses Pipelines owners should get notified if an SLA is missed

12

Bundle

The Bundle system allows the user to define and execute a bunch of Loosely coupled set of coordinators. They are dependent on each other, but dependency is enforced via inputs and outputs.

Bundle can be used to start/stop/suspend/resume/rerun whole pipeline

13

Complex dependencies

OOZIE-1976 : Specifying coordinator input datasets in more logical ways

BCP Support

Pull data from A or B. Specify dataset as AorB. Action will start running as soon either dataset A or B is available.

<input-logic> <or name=“AorB”> <data-in dataset="A”/> <data-in dataset="B"/> </or></input-logic>

14

15

Minimum availability processing

Some time, we want to process even if partial data is available.

<input-logic><data-in dataset=“A" min=”4”/>

</input-logic>

16

Optional feeds

Dataset B is optional, Oozie will start processing as soon as A is available. It will include dataset from A and whatever is available from B.

<input-logic> <and name="optional> <data-in dataset="A"/> <data-in dataset="B" min=”0”/> </and></input-logic>

Priority Among Dataset Instances

A will have higher precedence over B and B will have higher precedence over C.

<input-logic> <or name="AorBorC"> <data-in dataset="A"/> <data-in dataset="B"/> <data-in dataset="C”/> </or></input-logic>

17

Wait for primary

Sometime we want to give preference to primary data source and switch to secondary only after waiting for some specific amount of time.

<input-logic> <or name="AorB"> <data-in dataset="A” wait=“120”/> <data-in dataset="B"/> </or></input-logic>

18

Combining Dataset From Multiple ProvidersCombine function will first check instances from A and go to B next for whatever is missing in A.

<data-in name="A" dataset="dataset_A"> <start-instance> ${coord:current(-5)} </start-instance> <end-instance> ${coord:current(-1)} </end-instance></data-in>

<data-in name="B" dataset="dataset_B"> <start-instance>${coord:current(-5)}</start-instance> <end-instance>${coord:current(-1)}</end-instance></data-in>

<input-logic> <combine name="AB"> <data-in dataset="A"/> <data-in dataset="B"/> </combine></input-logic>

19

Agenda

Oozie at Yahoo1

Data Pipelines

SLA and monitoring

Monitoring Limitations and User monitoring systems

Future Work

2

3

4

5

21

Monitoring

Configure to receive notifications Email action HTTP notifications for job status change Email notification for SLA misses JMS notification for SLA events

By Polling CLI/REST API monitoring

• Single Job monitoring

• Bulk Monitoring for Bundles and Coordinators

• SLA monitoring

22

Monitoring

Email action can be added to workflow to send mail Job status change notification for coordinator action

oozie.coord.action.notification.url oozie.coord.action.notification.proxy

Job status change notification for workflow “oozie.wf.workflow.notification.url” “oozie.wf.workflow.notification.proxy”

23

Job Monitoring - polling

Supported for both CLI and web service Single job monitoring Bulk job monitoring

Multiple parameter like, • Bundle name, bundle id, username, startcreatedtime, endcreatedtime

Multiple job status such as• oozie jobs -bulk bundle=bundle-app-1; actionstatus=RUNNING; actionstatus=FAILED

24

Oozie can actively track SLAs on Jobs’ Start-time, End-time, Duration

Access/Filter SLA info via Web-console dashboard REST API JMS Messages Email alert

SLA Monitoring

25

SLA dashboard – tabular view

26

SLA dashboard – Graph view

Agenda

Oozie at Yahoo1

Data Pipelines

SLA and monitoring

Monitoring Limitations and User monitoring systems

Future Work

2

3

4

5

28

User view BCP SLA support No Color coding Paging/oncall Threshold Consolidated email Multi grid view

Monitoring Limitations

29

Data pipeline monitoring use case from Y!

30

Setup cron job which periodically pull SLA information from oozie If there is any SLA miss, notification is sent to internal monitoring

system› Pages and sends mobile alert to on-call person› Send email alert

Case-1

31

Case-1

32

Case-2

Divided into four section SLA Details Error jobs Long Running Jobs Running jobs

33

SLA information

34

SLA-status

35

Long Waiting jobs

36

Long Waiting jobs – missing dependencies

37

Error Jobs

38

Running job details

39

Job explorer

40

Feeds - jobs

41

Validation job

Data pipe line also run periodically validation jobs to validate the output Those multiple pipeline has multiple validation requirement, One example of validation

job is to validate the number of click impression with billing details.

42

Alert

43

Reprocessing

One of the biggest requirements of a pipeline is to reprocess whole dependent DAG.

Oozie does not support any data dependencies This makes it very difficult to rerun the whole pipeline for a particular

nominal time.

44

Reprocessing

To solve Oozie limitation, they have built a job dependency DAG. It is very similar to job explorer->feed lookup feature. job explorer->feed lookup is based on the output produced by

coordinator jobs. Job dependencies DAG is based on the input to jobs. Currently there is no UI to this, they parse oozie jobs daily and store the

dependencies in text file.

45

Reprocessing

Rerun the failed action and all dependent coordinator jobs.• Easy to do• Cons

– Difficult to monitor

Create a new coordinator for timeline which has failed• Easy to monitor

46

Reprocessing

47

Reprocessing

48

Consolidate SLA Monitoring

Agenda

Oozie at Yahoo1

Data Pipelines

SLA and monitoring

Monitoring Limitations and User monitoring systems

Future Work

2

3

4

5

50

Future Work

Oozie Unit testing framework No unit tests now. Directly tested by running in staging

Coordinator Dependency management Better reprocessing

Aperiodic and Incremental processing Managed through workarounds

51

Oozie BOF at Ballroom B

THANK YOUPurshotam Shah (purushah@yahoo-inc.com)Sr. Software Engineer, Yahoo Hadoop team

top related