discover hdp 2.1: apache falcon for data governance in hadoop

23
Page 1 © Hortonworks Inc. 2014 Discover HDP 2.1 Apache Falcon for Data Governance in Hadoop Hortonworks. We do Hadoop.

Upload: hortonworks

Post on 27-Aug-2014

1.226 views

Category:

Software


2 download

DESCRIPTION

Beginning with HDP 2.1, Hortonworks Data Platform ships with Apache Falcon for Hadoop data governance. Himanshu Bari, Hortonworks senior product manager, and Venkatesh Seetharam, Hortonworks co-founder and committer to Apache Falcon, lead this 30-minute webinar, including: + Why you need Apache Falcon + Key new Falcon features + Demo: Defining data pipelines with replication; policies for retention and late data arrival; managing Falcon server with Ambari

TRANSCRIPT

Page 1: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 1 © Hortonworks Inc. 2014

Discover HDP 2.1 Apache Falcon for Data Governance in Hadoop

Hortonworks. We do Hadoop.

Page 2: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 2 © Hortonworks Inc. 2014

Speakers

Justin Sears

Hortonworks Product Marketing Manager

Himanshu Bari

Hortonworks Senior Product Manager & PM for Apache Falcon & Apache Storm in Hortonworks Data Platform

Venkatesh Seetharam

Foundational Hadoop Architect, Engineer & Committer for Apache Falcon and Apache Knox Gateway projects

Page 3: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 3 © Hortonworks Inc. 2014

Agenda

•  Why You Need Apache Falcon

•  Key New Falcon Features

•  Demo –  Defining data pipelines

–  Policies for retention

–  Managing Falcon server with Apache Ambari

Page 4: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 4 © Hortonworks Inc. 2014

OPERATIONS  TOOLS  

Provision, Manage & Monitor

DEV  &  DATA  TOOLS  

Build & Test

A Modern Data Architecture

APPLICAT

IONS  

DATA

   SYSTEM  

REPOSITORIES  

RDBMS   EDW   MPP  

Business    Analy<cs   Custom  Applica<ons   Packaged  

Applica<ons  

Gov

erna

nce

&

Inte

grat

ion

ENTERPRISE HADOOP

Secu

rity

Ope

ratio

ns

Data Access

Data Management

SOURC

ES  

OLTP,  ERP,  CRM  Systems  

Documents,    Emails  

Web  Logs,  Click  Streams  

Social  Networks   Machine  Generated  

Sensor  Data  

GeolocaCon  Data  

Page 5: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 5 © Hortonworks Inc. 2014

HDP 2.1: Enterprise Hadoop

HDP 2.1 Hortonworks Data Platform

   

Provision,  Manage  &  Monitor  

 Ambari  

Zookeeper  

Scheduling    

Oozie  

Data  Workflow,  Lifecycle  &  Governance  

 Falcon  Sqoop  Flume  NFS  

WebHDFS  

YARN  :  Data  Opera<ng  System  

DATA    MANAGEMENT  

DATA    ACCESS  GOVERNANCE  &  INTEGRATION   OPERATIONS  

Script    Pig      

Search    

Solr      

SQL    

Hive/Tez,  HCatalog  

   

NoSQL    

HBase  Accumulo  

   

Stream      

Storm  

     

Others    

In-­‐Memory  AnalyCcs,    ISV  engines  

1   °   °   °   °   °   °   °   °   °  

°   °   °   °   °   °   °   °   °   °  

°   °   °   °   °   °   °   °   °   °  

°  

°  

N  

HDFS    (Hadoop  Distributed  File  System)  

Batch    

Map  Reduce  

   

SECURITY  

Authen<ca<on  Authoriza<on  Accoun<ng  

Data  Protec<on    

Storage:  HDFS  Resources:  YARN  Access:  Hive,  …    Pipeline:  Falcon  Cluster:  Knox  

Page 6: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 6 © Hortonworks Inc. 2014

NoSQL    

HBase  Accumulo  

   

Stream      

Storm  

     

Others    

In-­‐Memory  AnalyCcs,    ISV  engines  

Script    Pig      

Search    

Solr      

HDP 2.1: Enterprise Hadoop

HDP 2.1 Hortonworks Data Platform

   

Provision,  Manage  &  Monitor  

 Ambari  

Zookeeper  

Scheduling    

Oozie  

DATA    MANAGEMENT  

OPERATIONS  

1   °   °   °   °   °   °   °   °   °  

°   °   °   °   °   °   °   °   °   °  

°   °   °   °   °   °   °   °   °   °  

°  

°  

N  

HDFS    (Hadoop  Distributed  File  System)  

SECURITY  

Authen<ca<on  Authoriza<on  Accoun<ng  

Data  Protec<on    

Storage:  HDFS  Resources:  YARN  Access:  Hive,  …    Pipeline:  Falcon  Cluster:  Knox  

YARN  :  Data  Opera<ng  System  

DATA    ACCESS  

SQL    

Hive/Tez,  HCatalog  

   

Batch    

Map  Reduce  

   

Data  Workflow,  Lifecycle  &  Governance  

 Falcon  Sqoop  Flume  NFS  

WebHDFS  

GOVERNANCE  &  INTEGRATION  

Page 7: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 7 © Hortonworks Inc. 2014

Outline

Falcon Overview

Features Architecture & Demo

Page 8: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 8 © Hortonworks Inc. 2014

Simple Data Pipeline in Hadoop

Relatively simple Oozie workflow

Job1 Job2 JobN

Job3

Has a

Simple data pipeline

Raw Data

Clean Data

Prepped Data

HDFS data lake

MR/Pig/Hive BI

TOOLS Data

Sources MR/Pig/Hive

Page 9: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 9 © Hortonworks Inc. 2014

Quickly Gets Complicated….

Data stewards

•  Impact analysis •  Monitor pipeline •  Track ownership •  Late data &

failure handling

Compliance teams

•  Audit •  Retention •  Eviction

IT admins

•  Monitor infra •  Replication •  Archival

Business & data analysts

•  Verify data quality

Manually write & wire

Multiple complex Oozie workflows

Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN

Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN

Other Hadoop tools

Eg. DistCp

Typical data governance requirements Raw Clean Prep

Page 10: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 10 © Hortonworks Inc. 2014

Apache Falcon to the Rescue Data pipeline

Raw Clean Prep

Defined in

Auto generate & orchestrate

Adds the required data governance features

Falcon adds the required data governance features

DEFINITION Replication | Retention

Eviction | Late data MONITORING

TRACING Audit | Lineage

Tagging

Multiple complex Oozie workflows

Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN

Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN

Other Hadoop ecosystem

tools

Eg. DistCp

Page 11: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 11 © Hortonworks Inc. 2014

Outline

Falcon Overview

Features Architecture & Demo

Page 12: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 12 © Hortonworks Inc. 2014

Falcon Basic Concepts

• Feed: Defines a “dataset” so a.k.a ‘datasets’ • Process: Consumes feeds, invokes processing logic & produces feeds

All these put together represent ‘Data Pipelines’ in Hadoop

CLUSTER

FEED aka

DATASET PROCESS

INPUT TO

CREATES

• Cluster: : Represents the “interfaces” to a Hadoop cluster

Page 13: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 13 © Hortonworks Inc. 2014

Data Pipeline Definition

XML based pipeline specification Modular - Clusters, feeds & processes defined separately and then linked together Easy to re-use across multiple pipelines

Out of the box policies Predefined policies for replication, retention & late data handling Easily customization of policies

Extensible Plug in external solutions at any step of the pipeline

Eg. Invoke third party data obfuscation components

Page 14: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 14 © Hortonworks Inc. 2014

Replication & Retention

Staged Data

Retain 5 Years

Cleansed Data

Retain 3 Years

Conformed Data

Retain 3 Years

Presented Data

Retain Last Copy Only

•  Sophisticated retention policies expressed in one place

•  Simplify data retention for audit, compliance, or for data re-processing

Page 15: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 15 © Hortonworks Inc. 2014

Data Pipeline Monitoring

DATA Primary site DR site

Centralized monitoring of data pipeline with Falcon + Ambari

Pipeline run alerts

Hadoop Cluster-1 Hadoop Cluster-2

Pipeline run history

Pipeline scheduling

raw clean prep raw clean prep

Page 16: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 16 © Hortonworks Inc. 2014

Data Pipeline Tracing

.

Purchase feed

Customer feed

Product feed Store feed

View dependencies between clusters, datasets and processes

Data pipeline dependencies

Add arbitrary tags to feeds & processes

Credit feed

Sensitive encrypted

Data pipeline tagging

Know who modified a dataset when and into what

Data pipeline audits

File-1

File-2

File-3

Analyze how a dataset reached a particular state

Data pipeline lineage

Page 17: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 17 © Hortonworks Inc. 2014

Falcon User Flow

Create cluster entity & process XML specifications

Validate and save

specifications to HDFS

Kick off Feeds &

processes

Schedule “Instances” of

feeds & process to run

Ensure feeds & processes

run as expected

Update feeds & processes as needed

User

Falcon Server

Falcon CLI or API

Define pipeline Deploy pipeline Manage pipeline

‘instance’ suspend,

resume, kill SCHEDULE SUBMIT

Page 18: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 18 © Hortonworks Inc. 2014

Outline

Falcon Overview

Features Architecture & Demo

Page 19: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 19 © Hortonworks Inc. 2014

Falcon Architecture

Centralized Falcon Orchestration Framework

Hadoop ecosystem tools

Falcon  Server   JMS  

API  &  UI  

AMBARI  

HDFS / Hive

Oozie

Entity Specs

Scheduled Jobs

Process Status

MapRed / Pig / Hive / Sqoop / Flume / DistCP

Data stewards

+ Hadoop admins

Page 20: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 20 © Hortonworks Inc. 2014

Clickstream enrichment data pipeline

Use case description

•  Clicks & impressions data lands hourly in my primary cluster (under HDFS location /../{date}).

•  Cluster is located in the Oregon data center. •  Data arrives from all NA-west-coast production servers. •  The input data feeds are often late for up to 4 hrs. •  We need to enrich the clickstream data with Ad impression metadata and make it

available to our marketing data science team for customer segmentation analysis. •  Primary Hadoop cluster does not need the raw and enriched click data after 3 months. •  Our IT policy requires us to backup all enriched click data and store it for 3 years in

our secondary Hadoop cluster in the Virginia data center.

Page 21: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 21 © Hortonworks Inc. 2014

Falcon Entity Relationships CLICKSTREAM ENRICHMENT PIPELINE

Clicks

DATASET

Enriched clicks

DATASET Click

enrichment

PROCESS Clicks ingest

PROCESS

Oregon Hadoop cluster PRIMARY CLUSTER

Virginia Hadoop cluster

BACKUP CLUSTER

Creates

Runs on

Stored on

Backup

to

Create

Impressions ingest

PROCESS

Creates Impressions

DATASET

Runs on

Page 22: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 22 © Hortonworks Inc. 2014

Learn More About Data Governance in Hadoop

Hortonworks.com/labs/data-management/

Register for the remaining 4 Discover HDP 2.1 Webinars

Hortonworks.com/webinars

Next Webinar:

Apache Hadoop 2.4.0,

YARN and HDFS Wednesday, May 28, 9am Pacific

Page 23: Discover HDP 2.1: Apache Falcon for Data Governance in Hadoop

Page 23 © Hortonworks Inc. 2014

Thank you!