apache falcon - data management platform for hadoop
Post on 21-Jan-2017
563 Views
Preview:
TRANSCRIPT
Apache Falcon
Data Management Platform for Hadoop
●Ajay Yadava
●Committer, Apache Falcon
●Lead - Apache Falcon @ Inmobi
4
What is Apache Falcon? Falcon is a data processing and management solution for Hadoop
designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. Falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on Hadoop clusters.
5
6
Core Services
Core Services
Process Relays
Late Data Manage
mentRetentio
n
Replication
AcquisitionOperabili
ty
Holistic declarati
on of intent
Anonymization
Lineage
SLA
Life of Byte
8
Data Relays
9
Data Retention as a service
Late Data Management
Data Replication as a service
Data Acquisition As Service
14
Holistic Declaration of Intent
Operability – Dashboard
Overview
17
Entity Dependency Graph
Cluster
Feed Process
depends
depends
depends
Cluster specification
19
<cluster colo="SF-datacenter" description="" name="prod-cluster" xmlns="uri:falcon:cluster:0.1"> <interfaces> <interface type="readonly" endpoint="hftp://nn:50070" version="1.1.2"/> <interface type="write" endpoint="hdfs://nn:8020" version="1.1.2"/> <interface type="execute" endpoint="rm:8050" version="1.1.2"/> <interface type="workflow" endpoint="http://oozie:41000/oozie/" version="4.0.0"/> <interface type="registry" endpoint="http://oozie:41000/oozie/" version="4.0.0"/> <interface type="messaging" endpoint="tcp://:61616?daemon=true" version="5.4.3"/> </interfaces> <locations> <location name="staging" path="/projects/falcon/staging"/> <!--mandatory--> <location name="temp" path="/projects/falcon/tmp"/> <!--optional--> <location name="working" path="/projects/falcon/working"/> <!--optional--> </locations></cluster>
Used by distcp for replication Writing to HDFS
Used to submit processes as MR
Submit oozie jobs
Used for alerts
HDFS directories used by Falcon
Hive metastore to register/deregister partitions & get data availability events
Feed specification<feed description="enhanced clicks replication feed" name="repl-feed" xmlns="uri:falcon:feed:0.1"> <frequency>minutes(5)</frequency> <late-arrival cut-off="hours(1)"/> <sla slaLow="hours(2)" slaHigh="hours(3)"/> <clusters> <cluster name="primary" type="source"> <validity start="2013-01-01T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name="secondary" type="target"> <validity start="2013-11-15T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> <locations> <location type="data" path="/data/clicks/repl-enhanced/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}"/> </locations> </cluster> </clusters> <locations> <location type="data" path="/data/clicks/enhanced/${YEAR}/${MONTH}/${DAY}/${HOUR}/${MINUTE}"/> </locations> <ACL owner="testuser-ut-user" group="group" permission="0x644"/></feed>
20
Frequency
Location
SLA Monitoring
Data Retention
Data Replication
Process specification
21
<process name="clicks-hourly" xmlns="uri:falcon:process:0.1"> <clusters> <cluster name="corp"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z"/> </cluster> <parallel>1</parallel> <order>LIFO</order> <frequency>hours(1)</frequency> <inputs> <input name="click" feed="clicks-enhanced" start="yesterday(0,0)" end="latest(0)" partition="*/US"/> </inputs> <outputs> <output name="clicksummary" feed="click-hourly" instance="today(0,0)"/> </outputs> <workflow name="test" version="1.0.0" engine="oozie" path="/user/guest/workflow" lib="/user/guest/workflowlib"/> <retry policy="periodic" delay="hours(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)"> <late-input input="click" workflow-path="hdfs://clicks/late/workflow"/> </late-process></process>
Where should the process run?
How should the process run?
What to consume?
What to produce?
Late Data Handling
Retry
Architecture
22
23
24
Falcon UnitA pipeline validation framework
25
Motivation for Falcon Unit
●User errors caught only at deploy time.
● Input/Output feeds and paths not getting resolved.
●Errors in specification.
● Integration Tests require environment setup/tearDown.
●Messy deployment scripts.
●Debugging was cumbersome.
26
Falcon Unit
27
Falcon Unit
In Process execution env.● Local Oozie● Local File System● Local Job Runner● Local Message Queue
Actual cluster● Oozie● HDFS● YARN● Active MQ
Test suite
ExampleProcess Submission:
submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Local
submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Cluster Mode
Process Scheduling:scheduleProcess(“daily_clicks_agg”, startTime, numInstances, clusterName);
Process Verification:
getInstanceStatus(EntityType.Process,“daily_clicks_agg”, scheduleTime);
28
29
30
31
32
Deployment
33
34
Embedded Mode
Distributed Mode
35
Monitoring
36
SLA Monitoring●Alerts based on data Availability●Dashboard●Pluggable Alerting System
●Email●JMS Notifications
37
Pipeline view
38
Triage
39
40
●Better Authentication and Authorization●Even Better UI●Even Better monitoring●Process SLAs●Streaming support●A more powerful scheduler●Pipeline Recovery
41
Community
42
43
Questions?●Apache Falcon
●falcon.apache.org●dev@falcon.apache.org / user@falcon.apache.org
●Ajay Yadava●ajayyadava@apache.org
44
top related