discover hdp 2.2: apache falcon for hadoop data governance
TRANSCRIPT
Page 1 © Hortonworks Inc. 2014
Discover HDP 2.2: Apache Falcon for Hadoop Data Governance
Hortonworks. We do Hadoop.
Page 2 © Hortonworks Inc. 2014
Speakers
Justin Sears
Hortonworks Product Marketing Manager
Andrew Ahn
Hortonworks Director of Product Management for Data Governance in Hortonworks Data Platform
Venkatesh Seetharam
Foundational Hadoop Architect, Committer and PMC Member for Apache Falcon
Page 3 © Hortonworks Inc. 2014
Agenda
• Introduction to Apache Falcon
• New Innovation in Apache Falcon 0.6.0 § HDFS Mirroring
§ Cloud Replication
• A Look Ahead
• Q & A
We’ll move quickly: • Attendee phone lines are muted
• Text any questions to Andrew Ahn using Webex chat • Questions answered at the end
• Unanswered questions and answers in upcoming blog post
Page 4 © Hortonworks Inc. 2014
Big Data, Hadoop & Data Center Re-platforming
Business Drivers
• From reactive analytics to proactive interactions
• Insights that drive competitive advantage & optimal returns
Financial Drivers
• Cost of data systems, as % of IT spend, continues to grow
• Cost advantages of commodity hardware & open source software
$ Technical Drivers
• Data is growing exponentially & existing systems overwhelmed
• Predominantly driven by NEW types of data that can inform analytics
There is an inequitable balance between vendor and customer in the market
Page 5 © Hortonworks Inc. 2014
Clickstream Capture and analyze website visitors’ data trails and optimize your website
Sensors Discover patterns in data streaming automatically from remote sensors and machines
Server Logs Research logs to diagnose process failures and prevent security breaches
New Types of Data Hadoop Value:
Sentiment Understand how your customers feel about your brand and products – right now
Geographic Analyze location-based data to manage operations where they occur
Unstructured Understand patterns in files across millions of web pages, emails, and documents
Page 6 © Hortonworks Inc. 2014
A Shift from Reactive to Proactive Interactions
HDP and Hadoop allow organizations to use data to shift interactions from…
Reactive Post Transaction
Proactive Pre Decision
…to Real-time Personalization From static branding
…to repair before break From break then fix
…to Designer Medicine From mass treatment
…to Automated Algorithms From Educated Investing
…to 1x1 Targeting From mass branding
A shift in Advertising
A shift in Financial Services
A shift in Healthcare
A shift in Retail
A shift in Telco
Page 7 © Hortonworks Inc. 2014
Enterprise Goals for the Modern Data Architecture
• Consolidate siloed data sets structured and unstructured
• Central data set on a single cluster
• Multiple workloads across batch interactive and real time
• Central services for security, governance and operation
• Preserve existing investment in current tools and platforms
• Single view of the customer, product, supply chain
APP
LIC
ATIO
NS
DAT
A S
YSTE
M
Business Analytics
Custom Applications
Packaged Applications
RDBMS
EDW
MPP
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
Interactive Real-Time Batch CRM
ERP
Other 1 ° ° °
° ° ° °
HDFS (Hadoop Distributed File System)
SOU
RC
ES
EXISTING Systems
Clickstream Web &Social
Geoloca9on Sensor & Machine
Server Logs
Unstructured
Page 8 © Hortonworks Inc. 2014
YARN Transformed Hadoop & Opened a New Era
YARN The Architectural Center of Hadoop
• Common data platform, many applications
• Support multi-tenant access & processing
• Batch, interactive & real-time use cases
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV Engines
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Page 9 © Hortonworks Inc. 2014
YARN Extends Hadoop to Other Data Center Leaders
YARN The Architectural Center of Hadoop
• Common data platform, many applications
• Support multi-tenant access & processing
• Batch, interactive & real-time use cases
• Supports 3rd-party ISV tools
(ex. SAS, Syncsort, Actian, etc.)
YARN Ready Applications Facilitates ongoing innovation and enterprise adoption via ecosystem of new and existing “YARN Ready” solutions
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV Engines
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Page 10 © Hortonworks Inc. 2014
Enterprise Hadoop: Central Set of Services
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
Enables Apache Hadoop to be an Enterprise Data Platform with centralized services for:
• Governance
• Operations
• Security
Everything that plugs into Hadoop inherits these services
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and manage
according to policy
Deploy and effectively
manage the platform
Provide layered approach to
security through Authentication, Authorization,
Accounting, and Data Protection
SECURITY GOVERNANCE OPERATIONS
Script
Pig
SQL
Hive
Java Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Others
ISV Engines
YARN: Data Operating System (Cluster Resource Management)
HDFS (Hadoop Distributed File System)
Tez Slider Slider Tez Tez
Page 11 © Hortonworks Inc. 2014
Hortonworks Development Investment for the Enterprise
Vertical Integration with YARN and HDFS
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and manage
according to policy
Deploy and effectively
manage the platform
Provide layered approach to
security through Authentication, Authorization,
Accounting, and Data Protection
SECURITY GOVERNANCE OPERATIONS
Script
Pig
SQL
Hive
Java Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Others
ISV Engines
YARN: Data Operating System (Cluster Resource Management)
HDFS (Hadoop Distributed File System)
Tez Slider Slider Tez Tez
• Ensure engines can run reliably and respectfully in a YARN based cluster • Implement features throughout the stack to accommodate
Page 12 © Hortonworks Inc. 2014
Hortonworks Development Investment for the Enterprise
Horizontal Integration for Enterprise Services
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and manage
according to policy
Deploy and effectively
manage the platform
Provide layered approach to
security through Authentication, Authorization,
Accounting, and Data Protection
SECURITY GOVERNANCE OPERATIONS
Script
Pig
SQL
Hive
Java Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Others
ISV Engines
YARN: Data Operating System (Cluster Resource Management)
HDFS (Hadoop Distributed File System)
Tez Slider Slider Tez Tez
• Ensure consistent enterprise services are applied across the entire Hadoop stack • Integrate with and extend existing data center solutions for these key requirements
Page 13 © Hortonworks Inc. 2014
Hortonworks Data Platform 2.2
HDP Delivers Enterprise Hadoop
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume Kafka NFS
WebHDFS
Authentication Authorization
Audit Data Protection
Storage: HDFS
Resources: YARN Access: Hive
Pipeline: Falcon Cluster: Ranger Cluster: Knox
Deployment Choice Linux Windows Cloud
YARN is the architectural center of HDP
• Common data set across all applications
• Batch, interactive & real-time workloads
• Multi-tenant access & processing
Provides comprehensive enterprise capabilities
• Governance
• Security
• Operations
Enables broad ecosystem adoption
• ISVs can plug directly into Hadoop
The widest range of deployment options • Linux & Windows
• On premises & cloud
Others
ISV Engines
On-Premises
Page 14 © Hortonworks Inc. 2014
Hortonworks Data Platform 2.2
HDP Delivers Enterprise Hadoop
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
SECURITY OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Authentication Authorization
Audit Data Protection
Storage: HDFS
Resources: YARN Access: Hive
Pipeline: Falcon Cluster: Ranger Cluster: Knox
YARN is the architectural center of HDP
• Common data set across all applications
• Batch, interactive & real-time workloads
• Multi-tenant access & processing
Provides comprehensive enterprise capabilities
• Governance
• Security
• Operations
Enables broad ecosystem adoption
• ISVs can plug directly into Hadoop
The widest range of deployment options • Linux & Windows
• On premises & cloud
Others
ISV Engines
Deployment Choice Linux Windows Cloud On-Premises
GOVERNANCE
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume Kafka NFS
WebHDFS
Page 16 © Hortonworks Inc. 2014
Falcon Overview
Centrally Manage Data Lifecycle – Centralized definition & management of pipelines for data ingest, process &
export
Business Continuity & Disaster Recovery – Out of the box policies for data replication & retention
– End to end monitoring of data pipelines
Address audit & compliance requirements – Visualize data pipeline lineage – Track data pipeline audit logs
– Tag data with business metadata
The data traffic cop
Page 17 © Hortonworks Inc. 2014
Falcon Architecture
Centralized Falcon Orchestration Framework
Hadoop ecosystem tools
Falcon Server JMS
API & UI
AMBARI
HDFS / Hive
Oozie
Entity Specs Scheduled Jobs Process
Status
MapRed / Pig / Hive / Sqoop / Flume / DistCP
Data stewards
+ Hadoop admins
Page 18 © Hortonworks Inc. 2014
Data Pipeline: Definition
• XML based pipeline specification – Modular - Clusters, feeds & processes defined separately and then linked together – Easy to re-use across multiple pipelines
• Out of the box policies – Predefined policies for replication, late data handling & eviction – Easily customization of policies
• Extensible – Plug in external solutions at any step of the pipeline – Eg. Invoke third party data obfuscation components
Page 19 © Hortonworks Inc. 2014
Data Pipeline: Monitoring
DATA Primary site DR site
Centralized monitoring of data pipeline with Falcon + Ambari
Pipeline run alerts
Hadoop Cluster-1 Hadoop Cluster-2
Pipeline run history
Pipeline Scheduling
raw clean prep raw clean prep
Page 20 © Hortonworks Inc. 2014
Data Pipeline: Tracing
.
Purchase feed
Customer feed
Product feed Store feed
View dependencies between clusters,
datasets and processes
Data pipeline dependencies
Add arbitrary tags to feeds & processes
Credit feed
Sensitive Encrypted
Data pipeline tagging
Coming Soon
Know who modified a dataset when and into
what
Data pipeline audits
File-1
File-2
File-3
Analyze how a dataset reached a
particular state
Data pipeline lineage
Page 21 © Hortonworks Inc. 2014
Replication with Falcon
Staged Data Presented Data
Cleansed Data
Conformed Data
Staged Data Presented Data
Rep
licat
ion
Failover Hadoop Cluster
Primary Hadoop Cluster
Rep
licat
ion
BI / Analy9cs
BusinessObjects BI
• Falcon manages workflow and replication • Enables business continuity without requiring full data reprocessing • Failover clusters can be smaller than primary clusters
Page 22 © Hortonworks Inc. 2014
Data Retention with Falcon
Staged Data Presented Data
Cleansed Data
Conformed Data
Retain 5 Years
Retain Last Copy Only
Retain 3 Years
Retain 3 Years
• Sophisticated retention policies expressed in one place • Simplify data retention for audit, compliance, or for data re-processing
Ret
entio
n P
olic
y
Page 23 © Hortonworks Inc. 2014
Late Data Handling with Falcon
Staged Data Combined Data
Online Transaction Data
(via Sqoop)
Web Log Data (via FTP)
Wait up to 4 hours for FTP data to arrive
• Processing waits until all required input data is available • Checks for late data arrivals, issues retrigger processing as necessary • Eliminates writing complex data handling rules within applications
Page 24 © Hortonworks Inc. 2014
November 2014 Future Release
Falcon Investment Plans
• Authentication & Authorization Integration
• Pipeline, (HDFS file & Hive) table Lineage GA
• HDFS DR Replication with Recipes • UI for Lineage management • Replicate to Cloud - Azure & S3 Post-HDP 2.2 Tech Preview • Hive/HCat metastore Replication • Expanded UI Entity creation and
management.
• Hive/HCat metastore Replication GA • Pipeline Run Notification via SNMP,
e-mail, etc. • Hive ACID support • HDFS Snapshot Integration • File import SSH & SCP • Visual Pipeline Designer • Resource Metrics • Automated migration of data through
HDFS storage tiers
DATES AND FEATURES SUBJECT TO CHANGE
Page 26 © Hortonworks Inc. 2014
DR Mirroring of HDFS with Recipes
• Mirroring for Disaster Recovery and Business continuity use cases.
• Customizable for mulitple targets and frequency of synchronization
• Recipes: Template model re-use of complex workflows
Recipe
Reduce
Cleanse
Replicate
Properties
Workflow Template
Recipe
Reduce
Cleanse
Replicate
Properties
Workflow Template
Recipe
Reduce
Cleanse
Replicate
Properties
Workflow Template
Page 28 © Hortonworks Inc. 2014
Replication to Cloud
• Seemlessly replicate to Cloud targets
• Replicate from Cloud as a source.
• Support for Amazon S3 and Microsoft Azure
AzureAmazon S3
On Prem Cluster