apache hadoop india summit 2011 talk "oozie - workflow for hadoop" by andreas n

30
Andreas Neumann Oozie – Workflow for Hadoop

Upload: yahoo-developer-network

Post on 15-Jan-2015

7.259 views

Category:

Technology


0 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

Andreas Neumann

Oozie – Workflow for Hadoop

Page 2: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 2 -

Who Am I?

Dr. Andreas Neumann

Software Architect, Yahoo!anew <at> yahoo-inc <dot> com

At Yahoo! (2008-present)

- Grid architecture

- Content Platform

- Research

At IBM (2000-2008)

- Database (DB2) Development

- Enterprise Search

Page 3: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 3 -

Oozie Overview

Main Features

– Execute and monitor workflows in Hadoop

– Periodic scheduling of workflows

– Trigger execution by data availability

– HTTP and command line interface + Web console

Adoption

– ~100 users on mailing list since launch on github

– In production at Yahoo!, running >200K jobs/day

Page 4: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 4 -

Oozie Workflow Overview

Purpose:

Execution of workflows on the Grid

Oozie

Hadoop/Pig/HDFS

DB

WS API

Tomcatweb-app

Page 5: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 5 -

Oozie Workflow

startJavaMain

M/Rstreaming

job

decision

fork

Pigjob

M/Rjob

joinOK

OK

OK

OK

end

Java Main

FSjob

OK OK

ENOUGH

MORE

Directed Acyclic Graph of Jobs

Page 6: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 6 -

Oozie Workflow Example

<workflow-app name=’wordcount-wf’>

<start to=‘wordcount’/>

<action name=’wordcount'>

<map-reduce>

<job-tracker>foo.com:9001</job-tracker>

<name-node>hdfs://bar.com:9000</name-node>

<configuration>

<property>

<name>mapred.input.dir</name>

<value>${inputDir}</value>

</property>

<property>

<name>mapred.output.dir</name>

<value>${outputDir}</value>

</property>

</configuration>

</map-reduce>

<ok to=’end'/>

<error to=’kill'/>

</action>

<kill name=‘kill’/>

<end name=‘end’/>

</workflow-app>

Start

M-Rwordcou

nt

End

OKStart

Kill

Error

Page 7: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 7 -

Oozie Workflow Nodes

• Control Flow:– start/end/kill

– decision

– fork/join

• Actions:– map-reduce

– pig

– hdfs

– sub-workflow

– java – run custom Java code

Page 8: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 8 -

Oozie Workflow Application

A HDFS directory containing:

– Definition file: workflow.xml

– Configuration file: config-default.xml

– App files: lib/ directory with JAR and SO files

– Pig Scripts

Page 9: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 9 -

Running an Oozie Workflow Job

Application Deployment:$ hadoop fs –put wordcount-wf hdfs://bar.com:9000/usr/abc/wordcount

Workflow Job Parameters:$ cat job.properties

oozie.wf.application.path = hdfs://bar.com:9000/usr/abc/wordcount

input = /usr/abc/input-data

output = /user/abc/output-data

Job Execution:$ oozie job –run -config job.properties

job: 1-20090525161321-oozie-xyz-W

Page 10: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 10 -

Monitoring an Oozie Workflow Job

Workflow Job Status:$ oozie job -info 1-20090525161321-oozie-xyz-W

------------------------------------------------------------------------

Workflow Name : wordcount-wf

App Path : hdfs://bar.com:9000/usr/abc/wordcount

Status : RUNNING

Workflow Job Log:$ oozie job –log 1-20090525161321-oozie-xyz-W

Workflow Job Definition:$ oozie job –definition 1-20090525161321-oozie-xyz-W

Page 11: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 11 -

Oozie Coordinator Overview

Purpose:

– Coordinated execution of workflows on the Grid

– Workflows are backwards compatible

Hadoop

Tomcat

Oozie Client

Oozie Workflow

WS API Oozie Coordinator

Check Data Availability

Page 12: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 12 -

Oozie Application Lifecycle

1*f

Action1

WF

2*f

Action2

WF

N*f… …

ActionN

… …

WF

0*f

Action0

WF

actioncreate

actionstart

start end

Coordinator Job

Oozie Coordinator Engine

Oozie Workflow Engine

A

B C

Page 13: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 13 -

Use Case 1: Time Triggers

• Execute your workflow every 15 minutes (CRON)

00:15 00:30 00:45 01:00

Page 14: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 14 -

Example 1: Run Workflow every 15 mins

<coordinator-app name=“coord1” start="2009-01-08T00:00Z" end="2010-01-01T00:00Z" frequency=”15" xmlns="uri:oozie:coordinator:0.1"> <action> <workflow> <app-path>hdfs://bar:9000/usr/abc/logsprocessor-wf</app-path> <configuration> <property> <name>key1</name><value>value1</value> </property> </configuration> </workflow> </action> </coordinator-app>

Page 15: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 15 -

Use Case 2: Time and Data Triggers

• Materialize your workflow every hour, but only run them when the input data is ready.

01:00 02:00 03:00 04:00

Hadoop

Input Data Exists?

Page 16: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 16 -

Example 2: Data Triggers

<coordinator-app name=“coord1” frequency=“${1*HOURS}”…> <datasets> <dataset name="logs" frequency=“${1*HOURS}” initial-instance="2009-01-01T00:00Z"> <uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets> <input-events> <data-in name=“inputLogs” dataset="logs"> <instance>${current(0)}</instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://bar:9000/usr/abc/logsprocessor-wf</app-path> <configuration> <property> <name>inputData</name><value>${dataIn(‘inputLogs’)}</value> </property> </configuration> </workflow> </action> </coordinator-app>

Page 17: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 17 -

Use Case 3: Rolling Windows

• Access 15 minute datasets and roll them up into hourly datasets

00:15 00:30 00:45 01:00

01:00

01:15 01:30 01:45 02:00

02:00

Page 18: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 18 -

Example 3: Rolling Windows

<coordinator-app name=“coord1” frequency=“${1*HOURS}”…> <datasets> <dataset name="logs" frequency=“15” initial-instance="2009-01-01T00:00Z"> <uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}</uri-

template> </dataset> </datasets> <input-events> <data-in name=“inputLogs” dataset="logs"> <start-instance>${current(-3)}</start-instance> <end-instance>${current(0)}</end-instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://bar:9000/usr/abc/logsprocessor-wf</app-path> <configuration> <property> <name>inputData</name><value>${dataIn(‘inputLogs’)}</value> </property> </configuration> </workflow> </action> </coordinator-app>

Page 19: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 19 -

Use Case 4: Sliding Windows

• Access last 24 hours of data, and roll them up every hour.

01:00 02:00 03:00 24:00

24:00

02:00 03:00 04:00+1 day01:00

+1 day01:00

03:00 04:00 05:00+1 day02:00

+1 day02:00

Page 20: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 20 -

Example 4: Sliding Windows

<coordinator-app name=“coord1” frequency=“${1*HOURS}”…> <datasets> <dataset name="logs" frequency=“${1*HOURS}” initial-instance="2009-01-01T00:00Z"> <uri-template>hdfs://bar:9000/app/logs/${YEAR}/${MONTH}/${DAY}/${HOUR}</uri-template> </dataset> </datasets> <input-events> <data-in name=“inputLogs” dataset="logs"> <start-instance>${current(-23)}</start-instance> <end-instance>${current(0)}</end-instance> </data-in> </input-events> <action> <workflow> <app-path>hdfs://bar:9000/usr/abc/logsprocessor-wf</app-path> <configuration> <property> <name>inputData</name><value>${dataIn(‘inputLogs’)}</value> </property> </configuration> </workflow> </action> </coordinator-app>

Page 21: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 21 -

Oozie Coordinator Application

A HDFS directory containing:

– Definition file: coordinator.xml

– Configuration file: coord-config-default.xml

Page 22: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 22 -

Running an Oozie Coordinator Job

Application Deployment:$ hadoop fs –put coord_job hdfs://bar.com:9000/usr/abc/coord_job

Coordinator Job Parameters:$ cat job.properties

oozie.coord.application.path = hdfs://bar.com:9000/usr/abc/coord_job

Job Execution:$ oozie job –run -config job.properties

job: 1-20090525161321-oozie-xyz-C

Page 23: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 23 -

Monitoring an Oozie Coordinator Job

Coordinator Job Status:$ oozie job -info 1-20090525161321-oozie-xyz-C

------------------------------------------------------------------------

Job Name : wordcount-coord

App Path : hdfs://bar.com:9000/usr/abc/coord_job

Status : RUNNING

Coordinator Job Log:$ oozie job –log 1-20090525161321-oozie-xyz-C

Coordinator Job Definition:$ oozie job –definition 1-20090525161321-oozie-xyz-C

Page 24: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 24 -

Oozie Web Console: List Jobs

Page 25: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 25 -

Oozie Web Console: Job Details

Page 26: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 26 -

Oozie Web Console: Failed Action

Page 27: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 27 -

Oozie Web Console: Error Messages

Page 28: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 28 -

What’s Next For Oozie?

New Features– More out-of-the-box actions: distcp, hive, …

– Authentication framework• Authenticate a client with Oozie

• Authenticate an Oozie workflow with downstream services

– Bundles: Manage multiple coordinators together

– Asynchronous data sets and coordinators

Scalability– Memory footprint

– Data notification instead of polling

Integration with Howl (http://github.com/yahoo/howl)

Page 29: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

- 29 -

We Need You!

Oozie is Open Source• Source: http://github.com/yahoo/oozie

• Docs: http://yahoo.github.com/oozie

• List: http://tech.groups.yahoo.com/group/Oozie-users/

To Contribute:• https://github.com/yahoo/oozie/wiki/How-To-Contribute

Page 30: Apache Hadoop India Summit 2011 talk "Oozie - Workflow for Hadoop" by Andreas N

Thank You!

github.com/yahoo/oozie/wiki/How-To-Contribute