workflow management cmsc 491 hadoop-based distributed computing spring 2015 adam shook
TRANSCRIPT
Problem!
• "Okay, Hadoop is great, but how do people actually do this?“ – A Real Person– Package jobs?– Chaining actions together?– Run these on a schedule?– Pre and post processing?– Retry failures?
Apache OozieWorkflow Scheduler for Hadoop
• Scalable, reliable, and extensible workflow scheduler system to manage Apache Hadoop jobs
• Workflow jobs are DAGs of actions• Coordinator jobs are recurrent Oozie Workflow
jobs triggered by time and data availability• Supports several types of jobs:• Java MapReduce
• Streaming MapReduce• Pig• Hive
• Sqoop• Distcp• Java programs• Shell scripts
Why should I care?
• Retry jobs in the event of a failure• Execute jobs at a specific time or when data is
available• Correctly order job execution based on
dependencies• Provide a common framework for
communication• Use the workflow to couple resources instead
of some home-grown code base
Actions
• Have a type, and each type has a defined set of configuration variables
• Each action must specify what to do based on success or failure
Workflow DAGs
start JavaMain
M/Rstreaming
job
decision
fork
Pigjob
M/Rjob
joinOK
OK
OK
OK
end
Java Main
FSjob
OK OK
ENOUGH
MORE
Workflow LanguageFlow-control Node DescriptionDecision Expressing “switch-case” logic
Fork Splits one path of execution into multiple concurrent pathsJoin Waits until every concurrent execution path of a previous fork node arrives to itKill Forces a workflow job to abort execution
Action Node Descriptionjava Invokes the main() method from the specified java classfs Manipulate files and directories in HDFS; supports commands: move, delete, mkdirMapReduce Starts a Hadoop map/reduce job; that could be java MR job, streaming job or pipe job
Pig Runs a Pig jobSub workflow Runs a child workflow job
Hive Runs a Hive jobShell Runs a Shell commandssh Starts a shell command on a remote machine as a remote secure shell
Sqoop Runs a Sqoop jobEmail Sending emails from Oozie workflow applicationDistcp Runs a Hadoop Distcp MapReduce jobCustom Does what you program it to do
Oozie Workflow Application
• An HDFS Directory containing:
• Definition file: workflow.xml• Configuration file: config-default.xml• App files: lib/ directory with JAR and other
dependencies
WordCount Workflow<workflow-app name='wordcount-wf'> <start to='wordcount'/> <action name='wordcount'> <map-reduce> <job-tracker>foo.com:9001</job-tracker> <name-node>hdfs://bar.com:9000</name-node> <configuration> <property> <name>mapred.input.dir</name> <value>${inputDir}</value> </property> <property> <name>mapred.output.dir</name> <value>${outputDir}</value> </property> </configuration> </map-reduce> <ok to='end'/> <error to='kill'/> </action> <kill name='kill'/> <end name='end'/></workflow-app>
StartM-R
wordcount EndOKStart
Kill
Error
Coordinators
• Oozie executes workflows based on– Time Dependency– Data Dependency
Hadoop
Tomcat
Oozie Client
Oozie Workflow
WS API Oozie Coordinator
Check Data Availability
Bundle
• Bundles are higher-level abstractions that batch a set of coordinators together
• No explicit dependencies between them, but they can be used to define a pipeline
Interacting with Oozie
• Read-Only Web Console• CLI• Java client• Web Service Endpoints• Directly with Oozie DB using SQL
<workflow-app name="filestream_wf" xmlns="uri:oozie:workflow:0.1"> <start to="java-node"/> <action name="java-node"/> <java> <job-tracker>foo:9001</job-tracker> <name-node>bar:9000</name-node> <main-class>org.foo.bar.PullFileStream</main-class> </java> <ok to="mr-node"/> <error to="fail"/> </action> <action name="mr-node"> <map-reduce> <job-tracker>foo:9001</job-tracker> <name-node>bar:9000</name-node> <configuration> ... </configuration> </map-reduce> <ok to="email-node">
<error to="fail"/> </action> ...
...<action name="email-node">
<email xmlns="uri:oozie:email-action:0.1"><to>[email protected]</to><cc>[email protected]</cc><subject>Email notification</subject><body>The wf completed</body>
</email><ok to="myotherjob"/><error to="errorcleanup"/>
</action><end name="end"/><kill name="fail"/>
</workflow-app>
1
23
<?xml version="1.0"?><coordinator-app end="${COORD_END}" frequency="${coord:days(1)}" name="daily_job_coord" start="${COORD_START}" timezone="UTC" xmlns="uri:oozie:coordinator:0.1" xmlns="uri:oozie:sla:0.1"> <action> <workflow> <app-path>hdfs://bar:9000/user/hadoop/oozie/app/test_job</app-path> </workflow> <sla:info> <sla:nominal-time>${coord:nominalTime()}</sla:nominal-time> <sla:should-start>${X * MINUTES}</sla:should-start> <sla:should-end>${Y * MINUTES}</sla:should-end> <sla:alert-contact>[email protected]</sla:alert-contact> </sla:info> </action></coordinator-app>
4, 5
6
Review
• Oozie ties together many Hadoop ecosystem components to "productionalize" this stuff
• Advanced control flow and action extendibility lets Oozie do whatever you would need it to do at any point in the workflow
• XML is gross
References
• http://oozie.apache.org• https://
cwiki.apache.org/confluence/display/OOZIE/Index
• http://www.slideshare.net/mattgoeke/oozie-riot-games
• http://www.slideshare.net/mislam77/oozie-sweet-13451212
• http://www.slideshare.net/ChicagoHUG/everything-you-wanted-to-know-but-were-afraid-to-ask-about-oozie