data pipeline testing now made easy! · what is in store for you? some history and introduction to...

Post on 04-Jul-2020

0 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Data Pipeline testing now made easy!

With Apache Falcon

$whoami❖ Pallavi Rao

➢ Architect, InMobi➢ Committer, Apache Falcon➢ Contributor, Apache PIG

❖ Pavan Kumar Kolamuri➢ Sr. Software Engineer, InMobi➢ Contributor, Apache Falcon and Oozie

2

What is in store for you?❖ Some history and introduction to Apache Falcon

❖ Falcon Unit - A new feature in v0.7

❖ Falcon Unit - How it simplifies testing pipelines

❖ Demo , Q&A

3

Once upon a time...

4

What kept us up at night?

5

❏ Failures

❏ Data arriving late

❏ Re-processing

❏ Varied Data Replication

❏ Varied Data Retention

❏ Data Archival

❏ Lineage

❏ SLA monitoring

The pattern

6

7

The concoction

Concoction.. distributed

8

Some maladies cured

9

Data Management

Data Governance

Process Management

● Relays● Late Data Handling● Failure Retries ● Reruns

● Data Import/Export● Retention● Replication● Archival

● Lineage● Audit● SLA

Sample pipeline in Falcon

10

Click Logs

Click Enhancer

Enhanced Clicks

Hourly Aggregation

Hourly Clicks

Daily clicks

Daily Aggregation

Metadata

Retention : 2 hours Frequency : 5 minsLate Data arrival

Retention : 2 daysReplication required

Retention : 1 day

Retention : 7 days Replication required

Falcon Feed

Falcon Process

Cluster Specification<cluster colo="default" description="" name="corp" xmlns="uri:falcon:cluster:0.1"> <tags>consumer=consumer@xyz.com, owner=producer@xyz.com, _department_type=forecasting</tags> <interfaces> <interface type="readonly" endpoint="webhdfs://localhost:14000" version="1.1.2"/> <interface type="write" endpoint="hdfs://localhost:9000" version="1.1.2"/> <interface type="execute" endpoint="localhost:8032" version="1.1.2"/> <interface type="workflow" endpoint="http://localhost:11000/oozie/" version="4.1.0"/> <interface type="registry" endpoint="thrift://localhost:12000" version="0.11.0"/> </interfaces> <locations> <location name="staging" path="/projects/falcon/staging"/> <location name="temp" path="/tmp"/> <location name="working" path="/projects/falcon/working"/> </locations> <properties> <property name="field1" value="value1"/> </properties></cluster>

11

How to access data

Where to execute

HCat

Falcon cache

User defined props

Feed specification<feed description="enhanced clicks replication feed" name="repl-feed" xmlns="uri:falcon:feed:0.1"> <frequency>minutes(5)</frequency> <late-arrival cut-off="hours(1)"/> <sla slaLow="hours(2)" slaHigh="hours(3)"/> <clusters> <cluster name="corp" type="source"> <validity start="2013-01-01T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> </cluster> <cluster name="secondary" type="target"> <validity start="2013-11-15T00:00Z" end="2030-01-01T00:00Z"/> <retention limit="days(2)" action="delete"/> <locations> <location type="data" path="/data/clicks/repl-enhanced/${YEAR}/${MONTH}/${DAY}/${HOUR}

/${MINUTE}"/> </locations> </cluster> </clusters> …..</feed>

12

Frequency

Location

SLA Monitoring

Data Retention

Data Replication

Process specification

13

<process name="clicks-hourly" xmlns="uri:falcon:process:0.1"> <clusters> <cluster name="corp"> <validity start="2011-11-02T00:00Z" end="2011-12-30T00:00Z"/> </cluster> <parallel>1</parallel> <order>LIFO</order> <frequency>hours(1)</frequency> <inputs> <input name="click" feed="clicks-enhanced" start="yesterday(0,0)" end="latest(0)" partition="*/US"/> </inputs> <outputs> <output name="clicksummary" feed="click-hourly" instance="today(0,0)"/> </outputs> <workflow name="test" version="1.0.0" engine="oozie" path="/user/guest/workflow" lib="/user/guest/workflowlib"

/> <retry policy="periodic" delay="hours(10)" attempts="3"/> <late-process policy="exp-backoff" delay="hours(1)"> <late-input input="click" workflow-path="hdfs://clicks/late/workflow"/> </late-process></process>

Where should the process run?

How should the process run?

What to consume?

What to produce?

Processing logic

Late Data processing

Why Falcon Unit?

14

Before Falcon Unit

Unit Tests for each module using either PigUnit or MRUnit or JUnit.

Integration Tests executed by bringing up AWS instances with the entire stack.

15

Before Falcon Unit

16

Falcon Feed

Falcon Process

Spec. Invalid

OK

OK

Invalid Output

Invalid Input

Improper Replication

OK

OK

Motivation for Falcon Unit

❖ User errors caught only at deploy time.

➢ Input/Output feeds and paths not getting resolved.

➢ Errors in specification.

❖ Integration Tests require environment setup/tearDown.

➢ Messy deployment scripts.

➢ Time consuming.

❖ Debugging was cumbersome.

➢ Logs scattered.

17

Falcon Unit

18

Falcon Unit

In Process execution env.● Local Oozie● Local File System● Local Job Runner● Local Message Queue

Actual cluster● Oozie● HDFS● YARN● Active MQ

Test suite

What you can test

19

Data Management

Data Governance

Process Management

● Data creation● Data injection● Retention● Replication

● Lineage● Data availability for verification

● Validation of definition ● Entity scheduling and status verification● Correctness of data window being picked up.● Reruns● Missing dependencies/properties

After Falcon Unit

20

Falcon Feed

Falcon Process

OK

OK

OK

OK

OK

OK

OK

OK

TESTED

OK

Falcon Unit Illustrated

21

Capabilities with example

22

❖ Entity creation and data flow validation.

❖ Data Injection.

❖ Data Retention and Replication.

❖ Seamless API for cluster and local mode.

Example Pipeline

23

Hourly Clicks

Daily clicks

Daily Aggregation

...

Consumes

Produces

Deferred Clicks

...

24

Cluster CreationCluster Creation :

→ Local Mode

submit(EntityType.Cluster, coloName, clusterName, propsMap);

submitCluster(); - Uses defaults

→ Cluster Mode

submit(EntityType.Cluster, <Path to Cluster XML>);

Feed CreationSubmit Feed

submit(EntityType.Feed, <Path to Hourly Clicks XML>);

Inject DatacreateData("HourlyClicks", "local", scheduleTime, <test data path>, numinstances);

25

/projects/falcon/clicks/hourly/2015/09/28/00/<data>

…./01/<data>

…./02/<data>

…./03/<data>

…./04/<data>

Process CreationProcess Submission:

submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Local

submit(EntityType.Process, <Path to Daily clicks Agg XML>); → Cluster Mode

Process Scheduling:scheduleProcess(“daily_clicks_agg”, startTime, numInstances, clusterName);

Process Verification:getInstanceStatus(EntityType.Process,“daily_clicks_agg”, scheduleTime);

26

Data Retention Data Retention: ● Data retention can be validated by scheduling feed in both cluster mode

and local modecreateData("HourlyClicks", "local", timeStamp, <test data path>);

schedule(EntityType.FEED, "HourlyClicks", "local" );

status = getInstanceStatus(EntityType.FEED, "HourlyClicks");

● Falcon Unit provides APIs for validation of existence of paths.

27

Data Replication

Data Replication can also be tested using Falcon Unit :submitCluster(coloName, srcCluster, propsMap);

submitCluster(coloName, targetCluster, propsMap);

createData("HourlyClicks", "srcCluster", timeStamp, <test data path>);

schedule(EntityType.FEED, “HourlyClicks”, targetCluster);

status = getInstanceStatus(EntityType.FEED, feed, targetCluster);

Assert.assertEquals(status, WorkflowStatus.SUCCEEDED);

28

Going forward ...

❖ Improved data injection➢ Generation of test data from template➢ Sampling of production data for testing

❖ Support for other data lifecycle operations➢ Data ingestion, export

❖ Maven plugin for build time validation of definitions.

29

Demo

30

Questions?

31

If you want to ask later - user@falcon.apache.orgIf you want to contribute - dev@falcon.apache.org

top related