discover hdp 2.1: apache falcon for data governance in hadoop
DESCRIPTION
Beginning with HDP 2.1, Hortonworks Data Platform ships with Apache Falcon for Hadoop data governance. Himanshu Bari, Hortonworks senior product manager, and Venkatesh Seetharam, Hortonworks co-founder and committer to Apache Falcon, lead this 30-minute webinar, including: + Why you need Apache Falcon + Key new Falcon features + Demo: Defining data pipelines with replication; policies for retention and late data arrival; managing Falcon server with AmbariTRANSCRIPT
Page 1 © Hortonworks Inc. 2014
Discover HDP 2.1 Apache Falcon for Data Governance in Hadoop
Hortonworks. We do Hadoop.
Page 2 © Hortonworks Inc. 2014
Speakers
Justin Sears
Hortonworks Product Marketing Manager
Himanshu Bari
Hortonworks Senior Product Manager & PM for Apache Falcon & Apache Storm in Hortonworks Data Platform
Venkatesh Seetharam
Foundational Hadoop Architect, Engineer & Committer for Apache Falcon and Apache Knox Gateway projects
Page 3 © Hortonworks Inc. 2014
Agenda
• Why You Need Apache Falcon
• Key New Falcon Features
• Demo – Defining data pipelines
– Policies for retention
– Managing Falcon server with Apache Ambari
Page 4 © Hortonworks Inc. 2014
OPERATIONS TOOLS
Provision, Manage & Monitor
DEV & DATA TOOLS
Build & Test
A Modern Data Architecture
APPLICAT
IONS
DATA
SYSTEM
REPOSITORIES
RDBMS EDW MPP
Business Analy<cs Custom Applica<ons Packaged
Applica<ons
Gov
erna
nce
&
Inte
grat
ion
ENTERPRISE HADOOP
Secu
rity
Ope
ratio
ns
Data Access
Data Management
SOURC
ES
OLTP, ERP, CRM Systems
Documents, Emails
Web Logs, Click Streams
Social Networks Machine Generated
Sensor Data
GeolocaCon Data
Page 5 © Hortonworks Inc. 2014
HDP 2.1: Enterprise Hadoop
HDP 2.1 Hortonworks Data Platform
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume NFS
WebHDFS
YARN : Data Opera<ng System
DATA MANAGEMENT
DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS
Script Pig
Search
Solr
SQL
Hive/Tez, HCatalog
NoSQL
HBase Accumulo
Stream
Storm
Others
In-‐Memory AnalyCcs, ISV engines
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
Batch
Map Reduce
SECURITY
Authen<ca<on Authoriza<on Accoun<ng
Data Protec<on
Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox
Page 6 © Hortonworks Inc. 2014
NoSQL
HBase Accumulo
Stream
Storm
Others
In-‐Memory AnalyCcs, ISV engines
Script Pig
Search
Solr
HDP 2.1: Enterprise Hadoop
HDP 2.1 Hortonworks Data Platform
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
DATA MANAGEMENT
OPERATIONS
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
SECURITY
Authen<ca<on Authoriza<on Accoun<ng
Data Protec<on
Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon Cluster: Knox
YARN : Data Opera<ng System
DATA ACCESS
SQL
Hive/Tez, HCatalog
Batch
Map Reduce
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume NFS
WebHDFS
GOVERNANCE & INTEGRATION
Page 7 © Hortonworks Inc. 2014
Outline
Falcon Overview
Features Architecture & Demo
Page 8 © Hortonworks Inc. 2014
Simple Data Pipeline in Hadoop
Relatively simple Oozie workflow
Job1 Job2 JobN
Job3
Has a
Simple data pipeline
Raw Data
Clean Data
Prepped Data
HDFS data lake
MR/Pig/Hive BI
TOOLS Data
Sources MR/Pig/Hive
Page 9 © Hortonworks Inc. 2014
Quickly Gets Complicated….
Data stewards
• Impact analysis • Monitor pipeline • Track ownership • Late data &
failure handling
Compliance teams
• Audit • Retention • Eviction
IT admins
• Monitor infra • Replication • Archival
Business & data analysts
• Verify data quality
Manually write & wire
Multiple complex Oozie workflows
Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN
Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN
Other Hadoop tools
Eg. DistCp
Typical data governance requirements Raw Clean Prep
Page 10 © Hortonworks Inc. 2014
Apache Falcon to the Rescue Data pipeline
Raw Clean Prep
Defined in
Auto generate & orchestrate
Adds the required data governance features
Falcon adds the required data governance features
DEFINITION Replication | Retention
Eviction | Late data MONITORING
TRACING Audit | Lineage
Tagging
Multiple complex Oozie workflows
Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN
Job1 Job2 JobN Job3 Job4 Job7 Job6 JobN
Other Hadoop ecosystem
tools
Eg. DistCp
Page 11 © Hortonworks Inc. 2014
Outline
Falcon Overview
Features Architecture & Demo
Page 12 © Hortonworks Inc. 2014
Falcon Basic Concepts
• Feed: Defines a “dataset” so a.k.a ‘datasets’ • Process: Consumes feeds, invokes processing logic & produces feeds
All these put together represent ‘Data Pipelines’ in Hadoop
CLUSTER
FEED aka
DATASET PROCESS
INPUT TO
CREATES
• Cluster: : Represents the “interfaces” to a Hadoop cluster
Page 13 © Hortonworks Inc. 2014
Data Pipeline Definition
XML based pipeline specification Modular - Clusters, feeds & processes defined separately and then linked together Easy to re-use across multiple pipelines
Out of the box policies Predefined policies for replication, retention & late data handling Easily customization of policies
Extensible Plug in external solutions at any step of the pipeline
Eg. Invoke third party data obfuscation components
Page 14 © Hortonworks Inc. 2014
Replication & Retention
Staged Data
Retain 5 Years
Cleansed Data
Retain 3 Years
Conformed Data
Retain 3 Years
Presented Data
Retain Last Copy Only
• Sophisticated retention policies expressed in one place
• Simplify data retention for audit, compliance, or for data re-processing
Page 15 © Hortonworks Inc. 2014
Data Pipeline Monitoring
DATA Primary site DR site
Centralized monitoring of data pipeline with Falcon + Ambari
Pipeline run alerts
Hadoop Cluster-1 Hadoop Cluster-2
Pipeline run history
Pipeline scheduling
raw clean prep raw clean prep
Page 16 © Hortonworks Inc. 2014
Data Pipeline Tracing
.
Purchase feed
Customer feed
Product feed Store feed
View dependencies between clusters, datasets and processes
Data pipeline dependencies
Add arbitrary tags to feeds & processes
Credit feed
Sensitive encrypted
Data pipeline tagging
Know who modified a dataset when and into what
Data pipeline audits
File-1
File-2
File-3
Analyze how a dataset reached a particular state
Data pipeline lineage
Page 17 © Hortonworks Inc. 2014
Falcon User Flow
Create cluster entity & process XML specifications
Validate and save
specifications to HDFS
Kick off Feeds &
processes
Schedule “Instances” of
feeds & process to run
Ensure feeds & processes
run as expected
Update feeds & processes as needed
User
Falcon Server
Falcon CLI or API
Define pipeline Deploy pipeline Manage pipeline
‘instance’ suspend,
resume, kill SCHEDULE SUBMIT
Page 18 © Hortonworks Inc. 2014
Outline
Falcon Overview
Features Architecture & Demo
Page 19 © Hortonworks Inc. 2014
Falcon Architecture
Centralized Falcon Orchestration Framework
Hadoop ecosystem tools
Falcon Server JMS
API & UI
AMBARI
HDFS / Hive
Oozie
Entity Specs
Scheduled Jobs
Process Status
MapRed / Pig / Hive / Sqoop / Flume / DistCP
Data stewards
+ Hadoop admins
Page 20 © Hortonworks Inc. 2014
Clickstream enrichment data pipeline
Use case description
• Clicks & impressions data lands hourly in my primary cluster (under HDFS location /../{date}).
• Cluster is located in the Oregon data center. • Data arrives from all NA-west-coast production servers. • The input data feeds are often late for up to 4 hrs. • We need to enrich the clickstream data with Ad impression metadata and make it
available to our marketing data science team for customer segmentation analysis. • Primary Hadoop cluster does not need the raw and enriched click data after 3 months. • Our IT policy requires us to backup all enriched click data and store it for 3 years in
our secondary Hadoop cluster in the Virginia data center.
Page 21 © Hortonworks Inc. 2014
Falcon Entity Relationships CLICKSTREAM ENRICHMENT PIPELINE
Clicks
DATASET
Enriched clicks
DATASET Click
enrichment
PROCESS Clicks ingest
PROCESS
Oregon Hadoop cluster PRIMARY CLUSTER
Virginia Hadoop cluster
BACKUP CLUSTER
Creates
Runs on
Stored on
Backup
to
Create
Impressions ingest
PROCESS
Creates Impressions
DATASET
Runs on
Page 22 © Hortonworks Inc. 2014
Learn More About Data Governance in Hadoop
Hortonworks.com/labs/data-management/
Register for the remaining 4 Discover HDP 2.1 Webinars
Hortonworks.com/webinars
Next Webinar:
Apache Hadoop 2.4.0,
YARN and HDFS Wednesday, May 28, 9am Pacific
Page 23 © Hortonworks Inc. 2014
Thank you!