building and managing complex dependencies pipeline using apache oozie

52
Building and managing complex dependencies pipeline using Apache Oozie Purshotam Shah ([email protected]) Sr. Software Engineer, Yahoo Hadoop team Apache Oozie PMC member and committer

Upload: hadoop-summit

Post on 07-Jan-2017

415 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Building and managing complex dependencies pipeline using Apache Oozie

Building and managing complex dependencies pipeline using Apache Oozie

Purshotam Shah ([email protected])Sr. Software Engineer, Yahoo Hadoop teamApache Oozie PMC member and committer

Page 2: Building and managing complex dependencies pipeline using Apache Oozie

Agenda

Oozie at Yahoo1

Data Pipelines

SLA and Monitoring

Monitoring Limitations and User monitoring systems

Future Work

2

3

4

5

Page 3: Building and managing complex dependencies pipeline using Apache Oozie

3

Why Oozie?

Out-of-box support for multiple job types Java, shell, distcp Mapreduce

• Pipes, streaming pig, hive, spark

Highly scalable High availability

Hot-Hot with rolling upgrades Load balanced

Hue Integration

Oozie

Hbase

Pig

Hive

Spark

Yarn

HDFS

Hue

HCatalog

Page 4: Building and managing complex dependencies pipeline using Apache Oozie

4

Security: https + kerberos / cookie-based auth

Deployment Architecture at Yahoo

Load Balancer

Oracle RAC

Hadoop Cluster, HBase, HCatalog

submit request

request redirection

Oozie Server 1

Oozie Server 2

Inter server communicationfor log streaming,sharelib update etc

ZookeeperCurator

Security: https + kerberos / cookie-based-auth

Security: https+kerberos

Lock management

Security: kerberos

Security: kerberos

Page 5: Building and managing complex dependencies pipeline using Apache Oozie

Scale at Yahoo

5

Deployed on all clusters (production, non-production)One instance per cluster

75 products / 2000 + projects255 monthly users

90,00 workflow jobs daily June 2016, one busy cluster)Between 1-8 actions :Avg. 4 actions/workflowExtreme use case, submit 100-200 workflow jobs per min

2,277 coordinator jobs daily (June 2016, one busy cluster)Frequency: 5, 10, 15 mins, hourly, daily, weekly, monthly (25% : < 15 min)99 % of workflow jobs kicked from coordinator

97 bundle jobs daily (June 2016, one busy cluster)

Page 6: Building and managing complex dependencies pipeline using Apache Oozie

Agenda

Oozie at Yahoo1

Data Pipelines

SLA and monitoring

Monitoring Limitations and User monitoring systems

Future Work

2

3

4

5

Page 7: Building and managing complex dependencies pipeline using Apache Oozie

Data Pipelines

7

Ad ExchangeAd LatencySearch Advertising

Content ManagementContent OptimizationContent PersonalizationFlickr Video

Audience TargetingBehavioral TargetingPartner TargetingRetargetingWeb Targeting

Advertisement Content Targeting

Page 8: Building and managing complex dependencies pipeline using Apache Oozie

Data Pipelines

8

Anti SpamContentRetargeting

ResearchDashboards & ReportsForecasting

Email Data Intelligence Data Management

Audience Pipeline

Page 9: Building and managing complex dependencies pipeline using Apache Oozie

Use Case - Data pipeline

9

Page 10: Building and managing complex dependencies pipeline using Apache Oozie

10

Large Scale Data Pipeline Requirements

Administrative One should be able to start, stop and pause all related pipelines or part of it at the

same time

Dependency Management BCP support Data is not guaranteed, start processing even if partial data is available Mandatory and optional feeds

Page 11: Building and managing complex dependencies pipeline using Apache Oozie

11

Large Scale Data Pipeline Requirements

Multiple Providers If data is available from multiple providers, I want to specify the provider priority Combining dataset from multiple providers

SLA Management Monitor pipeline processing to take immediate action in case of failures or SLA misses Pipelines owners should get notified if an SLA is missed

Page 12: Building and managing complex dependencies pipeline using Apache Oozie

12

Bundle

The Bundle system allows the user to define and execute a bunch of Loosely coupled set of coordinators. They are dependent on each other, but dependency is enforced via inputs and outputs.

Bundle can be used to start/stop/suspend/resume/rerun whole pipeline

Page 13: Building and managing complex dependencies pipeline using Apache Oozie

13

Complex dependencies

OOZIE-1976 : Specifying coordinator input datasets in more logical ways

Page 14: Building and managing complex dependencies pipeline using Apache Oozie

BCP Support

Pull data from A or B. Specify dataset as AorB. Action will start running as soon either dataset A or B is available.

<input-logic> <or name=“AorB”> <data-in dataset="A”/> <data-in dataset="B"/> </or></input-logic>

14

Page 15: Building and managing complex dependencies pipeline using Apache Oozie

15

Minimum availability processing

Some time, we want to process even if partial data is available.

<input-logic><data-in dataset=“A" min=”4”/>

</input-logic>

Page 16: Building and managing complex dependencies pipeline using Apache Oozie

16

Optional feeds

Dataset B is optional, Oozie will start processing as soon as A is available. It will include dataset from A and whatever is available from B.

<input-logic> <and name="optional> <data-in dataset="A"/> <data-in dataset="B" min=”0”/> </and></input-logic>

Page 17: Building and managing complex dependencies pipeline using Apache Oozie

Priority Among Dataset Instances

A will have higher precedence over B and B will have higher precedence over C.

<input-logic> <or name="AorBorC"> <data-in dataset="A"/> <data-in dataset="B"/> <data-in dataset="C”/> </or></input-logic>

17

Page 18: Building and managing complex dependencies pipeline using Apache Oozie

Wait for primary

Sometime we want to give preference to primary data source and switch to secondary only after waiting for some specific amount of time.

<input-logic> <or name="AorB"> <data-in dataset="A” wait=“120”/> <data-in dataset="B"/> </or></input-logic>

18

Page 19: Building and managing complex dependencies pipeline using Apache Oozie

Combining Dataset From Multiple ProvidersCombine function will first check instances from A and go to B next for whatever is missing in A.

<data-in name="A" dataset="dataset_A"> <start-instance> ${coord:current(-5)} </start-instance> <end-instance> ${coord:current(-1)} </end-instance></data-in>

<data-in name="B" dataset="dataset_B"> <start-instance>${coord:current(-5)}</start-instance> <end-instance>${coord:current(-1)}</end-instance></data-in>

<input-logic> <combine name="AB"> <data-in dataset="A"/> <data-in dataset="B"/> </combine></input-logic>

19

Page 20: Building and managing complex dependencies pipeline using Apache Oozie

Agenda

Oozie at Yahoo1

Data Pipelines

SLA and monitoring

Monitoring Limitations and User monitoring systems

Future Work

2

3

4

5

Page 21: Building and managing complex dependencies pipeline using Apache Oozie

21

Monitoring

Configure to receive notifications Email action HTTP notifications for job status change Email notification for SLA misses JMS notification for SLA events

By Polling CLI/REST API monitoring

• Single Job monitoring

• Bulk Monitoring for Bundles and Coordinators

• SLA monitoring

Page 22: Building and managing complex dependencies pipeline using Apache Oozie

22

Monitoring

Email action can be added to workflow to send mail Job status change notification for coordinator action

oozie.coord.action.notification.url oozie.coord.action.notification.proxy

Job status change notification for workflow “oozie.wf.workflow.notification.url” “oozie.wf.workflow.notification.proxy”

Page 23: Building and managing complex dependencies pipeline using Apache Oozie

23

Job Monitoring - polling

Supported for both CLI and web service Single job monitoring Bulk job monitoring

Multiple parameter like, • Bundle name, bundle id, username, startcreatedtime, endcreatedtime

Multiple job status such as• oozie jobs -bulk bundle=bundle-app-1; actionstatus=RUNNING; actionstatus=FAILED

Page 24: Building and managing complex dependencies pipeline using Apache Oozie

24

Oozie can actively track SLAs on Jobs’ Start-time, End-time, Duration

Access/Filter SLA info via Web-console dashboard REST API JMS Messages Email alert

SLA Monitoring

Page 25: Building and managing complex dependencies pipeline using Apache Oozie

25

SLA dashboard – tabular view

Page 26: Building and managing complex dependencies pipeline using Apache Oozie

26

SLA dashboard – Graph view

Page 27: Building and managing complex dependencies pipeline using Apache Oozie

Agenda

Oozie at Yahoo1

Data Pipelines

SLA and monitoring

Monitoring Limitations and User monitoring systems

Future Work

2

3

4

5

Page 28: Building and managing complex dependencies pipeline using Apache Oozie

28

User view BCP SLA support No Color coding Paging/oncall Threshold Consolidated email Multi grid view

Monitoring Limitations

Page 29: Building and managing complex dependencies pipeline using Apache Oozie

29

Data pipeline monitoring use case from Y!

Page 30: Building and managing complex dependencies pipeline using Apache Oozie

30

Setup cron job which periodically pull SLA information from oozie If there is any SLA miss, notification is sent to internal monitoring

system› Pages and sends mobile alert to on-call person› Send email alert

Case-1

Page 31: Building and managing complex dependencies pipeline using Apache Oozie

31

Case-1

Page 32: Building and managing complex dependencies pipeline using Apache Oozie

32

Case-2

Divided into four section SLA Details Error jobs Long Running Jobs Running jobs

Page 33: Building and managing complex dependencies pipeline using Apache Oozie

33

SLA information

Page 34: Building and managing complex dependencies pipeline using Apache Oozie

34

SLA-status

Page 35: Building and managing complex dependencies pipeline using Apache Oozie

35

Long Waiting jobs

Page 36: Building and managing complex dependencies pipeline using Apache Oozie

36

Long Waiting jobs – missing dependencies

Page 37: Building and managing complex dependencies pipeline using Apache Oozie

37

Error Jobs

Page 38: Building and managing complex dependencies pipeline using Apache Oozie

38

Running job details

Page 39: Building and managing complex dependencies pipeline using Apache Oozie

39

Job explorer

Page 40: Building and managing complex dependencies pipeline using Apache Oozie

40

Feeds - jobs

Page 41: Building and managing complex dependencies pipeline using Apache Oozie

41

Validation job

Data pipe line also run periodically validation jobs to validate the output Those multiple pipeline has multiple validation requirement, One example of validation

job is to validate the number of click impression with billing details.

Page 42: Building and managing complex dependencies pipeline using Apache Oozie

42

Alert

Page 43: Building and managing complex dependencies pipeline using Apache Oozie

43

Reprocessing

One of the biggest requirements of a pipeline is to reprocess whole dependent DAG.

Oozie does not support any data dependencies This makes it very difficult to rerun the whole pipeline for a particular

nominal time.

Page 44: Building and managing complex dependencies pipeline using Apache Oozie

44

Reprocessing

To solve Oozie limitation, they have built a job dependency DAG. It is very similar to job explorer->feed lookup feature. job explorer->feed lookup is based on the output produced by

coordinator jobs. Job dependencies DAG is based on the input to jobs. Currently there is no UI to this, they parse oozie jobs daily and store the

dependencies in text file.

Page 45: Building and managing complex dependencies pipeline using Apache Oozie

45

Reprocessing

Rerun the failed action and all dependent coordinator jobs.• Easy to do• Cons

– Difficult to monitor

Create a new coordinator for timeline which has failed• Easy to monitor

Page 46: Building and managing complex dependencies pipeline using Apache Oozie

46

Reprocessing

Page 47: Building and managing complex dependencies pipeline using Apache Oozie

47

Reprocessing

Page 48: Building and managing complex dependencies pipeline using Apache Oozie

48

Consolidate SLA Monitoring

Page 49: Building and managing complex dependencies pipeline using Apache Oozie

Agenda

Oozie at Yahoo1

Data Pipelines

SLA and monitoring

Monitoring Limitations and User monitoring systems

Future Work

2

3

4

5

Page 50: Building and managing complex dependencies pipeline using Apache Oozie

50

Future Work

Oozie Unit testing framework No unit tests now. Directly tested by running in staging

Coordinator Dependency management Better reprocessing

Aperiodic and Incremental processing Managed through workarounds

Page 51: Building and managing complex dependencies pipeline using Apache Oozie

51

Oozie BOF at Ballroom B

Page 52: Building and managing complex dependencies pipeline using Apache Oozie

THANK YOUPurshotam Shah ([email protected])Sr. Software Engineer, Yahoo Hadoop team