apache falcon - data management platform for hadoop

Apache Falcon

Data Management Platform for Hadoop

●Ajay Yadava

●Committer, Apache Falcon

●Lead - Apache Falcon @ Inmobi

What is Apache Falcon? Falcon is a data processing and management solution for Hadoop

designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. Falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on Hadoop clusters.

Core Services

Process Relays

Late Data Manage

mentRetentio

Replication

AcquisitionOperabili

Holistic declarati

on of intent

Anonymization

Lineage

Life of Byte

Data Relays

Data Retention as a service

Late Data Management

Data Replication as a service

Data Acquisition As Service

Holistic Declaration of Intent

Operability – Dashboard

Overview

Entity Dependency Graph

Cluster

Feed Process

depends

Cluster specification

Used by distcp for replication Writing to HDFS

Used to submit processes as MR

Submit oozie jobs

Used for alerts

HDFS directories used by Falcon

Hive metastore to register/deregister partitions & get data availability events

Feed specification<feed description="enhanced clicks replication feed" name="repl-feed" xmlns="uri:falcon:feed:0.1"> <frequency>minutes(5)</frequency> <late-arrival cut-off="hours(1)"/> <sla slaLow="hours(2)" slaHigh="hours(3)"/> <clusters> <cluster name="primary" type="source"> <validity start="2013-01-01T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name="secondary" type="target"> <validity start="2013-11-15T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> <locations> <location type="data" path="/data/clicks/repl-enhanced/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}"/> </locations> </cluster> </clusters> <locations> <location type="data" path="/data/clicks/enhanced/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}"/> </locations> <ACL owner="testuser-ut-user" group="group" permission="0x644"/></feed>

Frequency

Location

SLA Monitoring

Data Retention

Data Replication

Process specification

<process name="clicks-hourly" xmlns="uri:falcon:process:0.1"> <clusters> <cluster name="corp"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z"/> </cluster> <parallel>1</parallel> <order>LIFO</order> <frequency>hours(1)</frequency> <inputs> <input name="click" feed="clicks-enhanced" start="yesterday(0,0)" end="latest(0)" partition="*/US"/> </inputs> <outputs> <output name="clicksummary" feed="click-hourly" instance="today(0,0)"/> </outputs> <workflow name="test" version="1.0.0" engine="oozie" path="/user/guest/workflow" lib="/user/guest/workflowlib"/> <retry policy="periodic" delay="hours(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)"> <late-input input="click" workflow-path="hdfs://clicks/late/workflow"/> </late-process></process>

Where should the process run?

How should the process run?

What to consume?

What to produce?

Late Data Handling

Architecture

Falcon UnitA pipeline validation framework

Motivation for Falcon Unit

●User errors caught only at deploy time.

● Input/Output feeds and paths not getting resolved.

●Errors in specification.

● Integration Tests require environment setup/tearDown.

●Messy deployment scripts.

●Debugging was cumbersome.

Falcon Unit

In Process execution env.● Local Oozie● Local File System● Local Job Runner● Local Message Queue

Actual cluster● Oozie● HDFS● YARN● Active MQ

Test suite

ExampleProcess Submission:

submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Local

submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Cluster Mode

Process Scheduling:scheduleProcess(“daily_clicks_agg”, startTime, numInstances, clusterName);

Process Verification:

getInstanceStatus(EntityType.Process,“daily_clicks_agg”, scheduleTime);

Deployment

Embedded Mode

Distributed Mode

Monitoring

SLA Monitoring●Alerts based on data Availability●Dashboard●Pluggable Alerting System

●Email●JMS Notifications

Pipeline view

Triage

●Better Authentication and Authorization●Even Better UI●Even Better monitoring●Process SLAs●Streaming support●A more powerful scheduler●Pipeline Recovery

Community

Questions?●Apache Falcon

●falcon.apache.org●dev@falcon.apache.org / user@falcon.apache.org

●Ajay Yadava●ajayyadava@apache.org

apache falcon - data management platform for hadoop

Data & Analytics

apache falcon at hadoop summit 2013

discover hdp 2.2: apache falcon for hadoop data governance

apache hadoop java api

data governance in apache falcon - hadoop summit brussels...

introduction to apache hadoop & pig -...

apache hadoop releaseshadoop.apache.org/old/releases.pdf ·...

apache falcon devops

introduccion apache hadoop

apache falcon - sanjeev tripurari

making apache hadoop secure

nosql, apache solr and apache hadoop

apache hadoop ecosystem - lias (lab · apache hadoop...

apache hadoop india summit 2011 talk "making apache hadoop...

apache hadoop by shah

apache hadoop yarn

apache hadoop ingestion patterns & apache flume

apache hadoop developer training.pdf

apache hadoop filesystem internals - snia · apache hadoop...

hortonworks technical preview for apache falcon · apache...

apache hadoop tutorial -...