oozie hugnov11

25
Oozie Evolution Gateway to Hadoop Eco-System Mohammad Islam

Upload: mislam77

Post on 11-Nov-2014

669 views

Category:

Technology


0 download

DESCRIPTION

Oozie is a Scheduler for Apache Hadoop jobs.

TRANSCRIPT

Page 1: Oozie hugnov11

Oozie Evolution Gateway to Hadoop Eco-System

Mohammad Islam

Page 2: Oozie hugnov11

Agenda

•  What is Oozie? •  What is in the Next Release? •  Challenges •  Future Works •  Q & A

Page 3: Oozie hugnov11

Oozie O

ozie

Oozie in Hadoop Eco-System

HDFS

Map-Reduce

HC

atalog

Pig Sqoop Hive

Page 4: Oozie hugnov11

Oozie : The Conductor

Page 5: Oozie hugnov11

A Workflow Engine

•  Oozie executes workflow defined as DAG of jobs •  The job type includes: Map-Reduce/Pig/Hive/Any script/

Custom Java Code etc

start M/R job

M/R streaming

job

decision

fork

Pig job

M/R job

join

end Java FS job

ENOUGH

MORE

Page 6: Oozie hugnov11

A Scheduler

•  Oozie executes workflow based on: –  Time Dependency (Frequency) –  Data Dependency

Hadoop

Oozie Server

Oozie Client

Oozie Workflow

WS API Oozie Coordinator

Check Data Availability

Page 7: Oozie hugnov11

REST-API for Hadoop Components

•  Direct access to Hadoop components – Emulates the command line through REST

API. •  Supported Products:

– Pig – Map Reduce

Page 8: Oozie hugnov11

Three Questions … Do you need Oozie?

If any one of your answers is YES, then you should consider Oozie!

Q3 : Do you need monitoring and operational support for your jobs?

Q2 : Does your job start based on time or data availability?

Q1 : Do you have multiple jobs with dependency?

Page 9: Oozie hugnov11

What Oozie is NOT

•  Oozie is not a resource scheduler

•  Oozie is not for off-grid scheduling o  Note: Off-grid execution is possible through SSH action.

•  If you want to submit your job occasionally, Oozie is an option.

o  Oozie provides REST API based submission.

Page 10: Oozie hugnov11

Oozie in Apache

Main Contributors

Page 11: Oozie hugnov11

Oozie in Apache

•  Y! internal usages: – Total number of user : 375 – Total number of processed jobs ≈ 750K/

month •  External downloads:

– 2500+ in last year from GitHub – A large number of downloads maintained by

3rd party packaging.

Page 12: Oozie hugnov11

Oozie Usages Contd.

•  User Community: – Membership

•  Y! internal - 286 •  External – 163

– Message (approximate) •  Y! internal – 7/day •  External – 8/day

Page 13: Oozie hugnov11

Next Release …

•  Integration with Hadoop 0.23

•  HCatalog integration – Non-polling approach

Page 14: Oozie hugnov11

Usability

•  Script Action •  Distcp Action •  Suspend Action •  Mini-Oozie for CI

– Like Mini-cluster •  Support multiple versions

– Pig, Distcp, Hive etc.

Page 15: Oozie hugnov11

Reliability

•  Auto-Retry in WF Action level

•  High-Availability – Hot-Warm through ZooKeeper

Page 16: Oozie hugnov11

Manageability

•  Email action

•  Query Pig Stats/Hadoop Counters – Runtime control of Workflow based on stats – Application-level control using the stats

Page 17: Oozie hugnov11

Challenges : Queue Starvation

•  Which Queue? – Not a Hadoop queue issue. – Oozie internal queue to process the Oozie

sub-tasks. – Oozie’s main execution engine.

•  User Problem : –  Job’s kill/suspend takes very long time.

Page 18: Oozie hugnov11

Challenges : Queue Starvation

Technical Problem: •  Before execution, every task acquires lock on the job id.

In Queue

•  Special high-priority tasks (such as Kill or Suspend) couldn’t get the lock and therefore, starve.

J2 J1(H) J2 J1

J1 J2

Starvation for High Priority Task!

J1 J1

Page 19: Oozie hugnov11

J1

Challenges : Queue Starvation

Resolution:

In Queue

J2 J1(H) J2 J1

J1 J2

J1

• Add the high priority task in both the interrupt list and normal queue. •  Before de-queue, check if there is any task in the interrupt list for the same job id. If there is one, execute that first.

In Interrupt List

J1(H)

finds a task in interrupt queue

Page 20: Oozie hugnov11

Oozie Futures

•  Easy adoption – Modeling tool –  IDE integration – Modular Configurations

•  Allow job notification through JMS •  Event-based data processing •  Prioritization

– By user, system level.

Page 21: Oozie hugnov11

Take Away ..

• Oozie is –  In Apache! –  Reliable and feature-rich. –  Growing fast.

Page 22: Oozie hugnov11

Q & A

Mohammad K Islam [email protected]

http://incubator.apache.org/oozie/

Page 23: Oozie hugnov11

Who needs Oozie?

•  Multiple jobs that have sequential/conditional/parallel dependency

•  Need to run job/Workflow periodically. •  Need to launch job when data is available. •  Operational requirements:

– Easy monitoring – Reprocessing – Catch-up

Page 24: Oozie hugnov11

T1 T1 T2 T1 T1 T1 T1 T2

Challenges : Queue Starvation Problem:

•  Consider queue with tasks of type T1 and T2. Max Concurrency = 2.

In Queue Running C (T1) C (T2)

2 1

•  Over-provisioned task (marked by red) is pushed back to the queue. •  At high load, it gets penalized in favor of same type, but later arrival of tasks .

Starvation!

0 1 0

T1 cannot execute and is pushed to head of queue

Page 25: Oozie hugnov11

T1 cannot execute, so skip by one node to front

Challenges : Queue Starvation

•  Before de-queuing any task, check its concurrency. •  If violated, skip and get the next task.

T1 T2 T1 T1 T1 T1 T2

In Queue Running C (T1) C (T2)

2 1 0 1 0

T1 now executes normally

Resolution:

Enqueue T2 now

2