apache falcon - data management platform for hadoop

44
Apache Falcon Data Management Platform for Hadoop

Upload: ajay-yadava

Post on 21-Jan-2017

563 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Page 1: Apache Falcon - Data Management Platform For Hadoop

Apache Falcon

Data Management Platform for Hadoop

Page 2: Apache Falcon - Data Management Platform For Hadoop
Page 3: Apache Falcon - Data Management Platform For Hadoop
Page 4: Apache Falcon - Data Management Platform For Hadoop

●Ajay Yadava

●Committer, Apache Falcon

●Lead - Apache Falcon @ Inmobi

4

Page 5: Apache Falcon - Data Management Platform For Hadoop

What is Apache Falcon? Falcon is a data processing and management solution for Hadoop

designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. Falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on Hadoop clusters.

5

Page 6: Apache Falcon - Data Management Platform For Hadoop

6

Page 7: Apache Falcon - Data Management Platform For Hadoop

Core Services

Core Services

Process Relays

Late Data Manage

mentRetentio

n

Replication

AcquisitionOperabili

ty

Holistic declarati

on of intent

Anonymization

Lineage

SLA

Page 8: Apache Falcon - Data Management Platform For Hadoop

Life of Byte

8

Page 9: Apache Falcon - Data Management Platform For Hadoop

Data Relays

9

Page 10: Apache Falcon - Data Management Platform For Hadoop

Data Retention as a service

Page 11: Apache Falcon - Data Management Platform For Hadoop

Late Data Management

Page 12: Apache Falcon - Data Management Platform For Hadoop

Data Replication as a service

Page 13: Apache Falcon - Data Management Platform For Hadoop

Data Acquisition As Service

Page 14: Apache Falcon - Data Management Platform For Hadoop

14

Page 15: Apache Falcon - Data Management Platform For Hadoop

Holistic Declaration of Intent

Page 16: Apache Falcon - Data Management Platform For Hadoop

Operability – Dashboard

Page 17: Apache Falcon - Data Management Platform For Hadoop

Overview

17

Page 18: Apache Falcon - Data Management Platform For Hadoop

Entity Dependency Graph

Cluster

Feed Process

depends

depends

depends

Page 19: Apache Falcon - Data Management Platform For Hadoop

Cluster specification

19

<cluster colo="SF-datacenter" description="" name="prod-cluster" xmlns="uri:falcon:cluster:0.1"> <interfaces> <interface type="readonly" endpoint="hftp://nn:50070" version="1.1.2"/> <interface type="write" endpoint="hdfs://nn:8020" version="1.1.2"/> <interface type="execute" endpoint="rm:8050" version="1.1.2"/> <interface type="workflow" endpoint="http://oozie:41000/oozie/" version="4.0.0"/> <interface type="registry" endpoint="http://oozie:41000/oozie/" version="4.0.0"/> <interface type="messaging" endpoint="tcp://:61616?daemon=true" version="5.4.3"/> </interfaces> <locations> <location name="staging" path="/projects/falcon/staging"/> <!--mandatory--> <location name="temp" path="/projects/falcon/tmp"/> <!--optional--> <location name="working" path="/projects/falcon/working"/> <!--optional--> </locations></cluster>

Used by distcp for replication Writing to HDFS

Used to submit processes as MR

Submit oozie jobs

Used for alerts

HDFS directories used by Falcon

Hive metastore to register/deregister partitions & get data availability events

Page 20: Apache Falcon - Data Management Platform For Hadoop

Feed specification<feed description="enhanced clicks replication feed" name="repl-feed" xmlns="uri:falcon:feed:0.1"> <frequency>minutes(5)</frequency> <late-arrival cut-off="hours(1)"/> <sla slaLow="hours(2)" slaHigh="hours(3)"/> <clusters> <cluster name="primary" type="source"> <validity start="2013-01-01T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name="secondary" type="target"> <validity start="2013-11-15T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> <locations> <location type="data" path="/data/clicks/repl-enhanced/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}"/> </locations> </cluster> </clusters> <locations> <location type="data" path="/data/clicks/enhanced/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}"/> </locations> <ACL owner="testuser-ut-user" group="group" permission="0x644"/></feed>

20

Frequency

Location

SLA Monitoring

Data Retention

Data Replication

Page 21: Apache Falcon - Data Management Platform For Hadoop

Process specification

21

<process name="clicks-hourly" xmlns="uri:falcon:process:0.1"> <clusters> <cluster name="corp"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z"/> </cluster> <parallel>1</parallel> <order>LIFO</order> <frequency>hours(1)</frequency> <inputs> <input name="click" feed="clicks-enhanced" start="yesterday(0,0)" end="latest(0)" partition="*/US"/> </inputs> <outputs> <output name="clicksummary" feed="click-hourly" instance="today(0,0)"/> </outputs> <workflow name="test" version="1.0.0" engine="oozie" path="/user/guest/workflow" lib="/user/guest/workflowlib"/> <retry policy="periodic" delay="hours(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)"> <late-input input="click" workflow-path="hdfs://clicks/late/workflow"/> </late-process></process>

Where should the process run?

How should the process run?

What to consume?

What to produce?

Late Data Handling

Retry

Page 22: Apache Falcon - Data Management Platform For Hadoop

Architecture

22

Page 23: Apache Falcon - Data Management Platform For Hadoop

23

Page 24: Apache Falcon - Data Management Platform For Hadoop

24

Page 25: Apache Falcon - Data Management Platform For Hadoop

Falcon UnitA pipeline validation framework

25

Page 26: Apache Falcon - Data Management Platform For Hadoop

Motivation for Falcon Unit

●User errors caught only at deploy time.

● Input/Output feeds and paths not getting resolved.

●Errors in specification.

● Integration Tests require environment setup/tearDown.

●Messy deployment scripts.

●Debugging was cumbersome.

26

Page 27: Apache Falcon - Data Management Platform For Hadoop

Falcon Unit

27

Falcon Unit

In Process execution env.● Local Oozie● Local File System● Local Job Runner● Local Message Queue

Actual cluster● Oozie● HDFS● YARN● Active MQ

Test suite

Page 28: Apache Falcon - Data Management Platform For Hadoop

ExampleProcess Submission:

submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Local

submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Cluster Mode

Process Scheduling:scheduleProcess(“daily_clicks_agg”, startTime, numInstances, clusterName);

Process Verification:

getInstanceStatus(EntityType.Process,“daily_clicks_agg”, scheduleTime);

28

Page 29: Apache Falcon - Data Management Platform For Hadoop

29

Page 30: Apache Falcon - Data Management Platform For Hadoop

30

Page 31: Apache Falcon - Data Management Platform For Hadoop

31

Page 32: Apache Falcon - Data Management Platform For Hadoop

32

Page 33: Apache Falcon - Data Management Platform For Hadoop

Deployment

33

Page 34: Apache Falcon - Data Management Platform For Hadoop

34

Embedded Mode

Page 35: Apache Falcon - Data Management Platform For Hadoop

Distributed Mode

35

Page 36: Apache Falcon - Data Management Platform For Hadoop

Monitoring

36

Page 37: Apache Falcon - Data Management Platform For Hadoop

SLA Monitoring●Alerts based on data Availability●Dashboard●Pluggable Alerting System

●Email●JMS Notifications

37

Page 38: Apache Falcon - Data Management Platform For Hadoop

Pipeline view

38

Page 39: Apache Falcon - Data Management Platform For Hadoop

Triage

39

Page 40: Apache Falcon - Data Management Platform For Hadoop

40

Page 41: Apache Falcon - Data Management Platform For Hadoop

●Better Authentication and Authorization●Even Better UI●Even Better monitoring●Process SLAs●Streaming support●A more powerful scheduler●Pipeline Recovery

41

Page 42: Apache Falcon - Data Management Platform For Hadoop

Community

42

Page 43: Apache Falcon - Data Management Platform For Hadoop

43