apache falcon - data management platform for hadoop

Post on 21-Jan-2017

563 Views

Category:

Data & Analytics

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Apache Falcon

Data Management Platform for Hadoop

●Ajay Yadava

●Committer, Apache Falcon

●Lead - Apache Falcon @ Inmobi

4

What is Apache Falcon? Falcon is a data processing and management solution for Hadoop

designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. Falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on Hadoop clusters.

5

6

Core Services

Core Services

Process Relays

Late Data Manage

mentRetentio

n

Replication

AcquisitionOperabili

ty

Holistic declarati

on of intent

Anonymization

Lineage

SLA

Life of Byte

8

Data Relays

9

Data Retention as a service

Late Data Management

Data Replication as a service

Data Acquisition As Service

14

Holistic Declaration of Intent

Operability – Dashboard

Overview

17

Entity Dependency Graph

Cluster

Feed Process

depends

depends

depends

Cluster specification

19

<cluster colo="SF-datacenter" description="" name="prod-cluster" xmlns="uri:falcon:cluster:0.1"> <interfaces> <interface type="readonly" endpoint="hftp://nn:50070" version="1.1.2"/> <interface type="write" endpoint="hdfs://nn:8020" version="1.1.2"/> <interface type="execute" endpoint="rm:8050" version="1.1.2"/> <interface type="workflow" endpoint="http://oozie:41000/oozie/" version="4.0.0"/> <interface type="registry" endpoint="http://oozie:41000/oozie/" version="4.0.0"/> <interface type="messaging" endpoint="tcp://:61616?daemon=true" version="5.4.3"/> </interfaces> <locations> <location name="staging" path="/projects/falcon/staging"/> <!--mandatory--> <location name="temp" path="/projects/falcon/tmp"/> <!--optional--> <location name="working" path="/projects/falcon/working"/> <!--optional--> </locations></cluster>

Used by distcp for replication Writing to HDFS

Used to submit processes as MR

Submit oozie jobs

Used for alerts

HDFS directories used by Falcon

Hive metastore to register/deregister partitions & get data availability events

Feed specification<feed description="enhanced clicks replication feed" name="repl-feed" xmlns="uri:falcon:feed:0.1"> <frequency>minutes(5)</frequency> <late-arrival cut-off="hours(1)"/> <sla slaLow="hours(2)" slaHigh="hours(3)"/> <clusters> <cluster name="primary" type="source"> <validity start="2013-01-01T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name="secondary" type="target"> <validity start="2013-11-15T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> <locations> <location type="data" path="/data/clicks/repl-enhanced/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}"/> </locations> </cluster> </clusters> <locations> <location type="data" path="/data/clicks/enhanced/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}"/> </locations> <ACL owner="testuser-ut-user" group="group" permission="0x644"/></feed>

20

Frequency

Location

SLA Monitoring

Data Retention

Data Replication

Process specification

21

<process name="clicks-hourly" xmlns="uri:falcon:process:0.1"> <clusters> <cluster name="corp"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z"/> </cluster> <parallel>1</parallel> <order>LIFO</order> <frequency>hours(1)</frequency> <inputs> <input name="click" feed="clicks-enhanced" start="yesterday(0,0)" end="latest(0)" partition="*/US"/> </inputs> <outputs> <output name="clicksummary" feed="click-hourly" instance="today(0,0)"/> </outputs> <workflow name="test" version="1.0.0" engine="oozie" path="/user/guest/workflow" lib="/user/guest/workflowlib"/> <retry policy="periodic" delay="hours(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)"> <late-input input="click" workflow-path="hdfs://clicks/late/workflow"/> </late-process></process>

Where should the process run?

How should the process run?

What to consume?

What to produce?

Late Data Handling

Retry

Architecture

22

23

24

Falcon UnitA pipeline validation framework

25

Motivation for Falcon Unit

●User errors caught only at deploy time.

● Input/Output feeds and paths not getting resolved.

●Errors in specification.

● Integration Tests require environment setup/tearDown.

●Messy deployment scripts.

●Debugging was cumbersome.

26

Falcon Unit

27

Falcon Unit

In Process execution env.● Local Oozie● Local File System● Local Job Runner● Local Message Queue

Actual cluster● Oozie● HDFS● YARN● Active MQ

Test suite

ExampleProcess Submission:

submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Local

submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Cluster Mode

Process Scheduling:scheduleProcess(“daily_clicks_agg”, startTime, numInstances, clusterName);

Process Verification:

getInstanceStatus(EntityType.Process,“daily_clicks_agg”, scheduleTime);

28

29

30

31

32

Deployment

33

34

Embedded Mode

Distributed Mode

35

Monitoring

36

SLA Monitoring●Alerts based on data Availability●Dashboard●Pluggable Alerting System

●Email●JMS Notifications

37

Pipeline view

38

Triage

39

40

●Better Authentication and Authorization●Even Better UI●Even Better monitoring●Process SLAs●Streaming support●A more powerful scheduler●Pipeline Recovery

41

Community

42

43

Questions?●Apache Falcon

●falcon.apache.org●dev@falcon.apache.org / user@falcon.apache.org

●Ajay Yadava●ajayyadava@apache.org

44

top related