discover hdp 2.2: apache falcon for hadoop data governance

37
Page 1 © Hortonworks Inc. 2014 Discover HDP 2.2: Apache Falcon for Hadoop Data Governance Hortonworks. We do Hadoop.

Upload: hortonworks

Post on 20-Aug-2015

2.710 views

Category:

Software


0 download

TRANSCRIPT

Page 1 © Hortonworks Inc. 2014

Discover HDP 2.2: Apache Falcon for Hadoop Data Governance

Hortonworks. We do Hadoop.

Page 2 © Hortonworks Inc. 2014

Speakers

Justin Sears

Hortonworks Product Marketing Manager

Andrew Ahn

Hortonworks Director of Product Management for Data Governance in Hortonworks Data Platform

Venkatesh Seetharam

Foundational Hadoop Architect, Committer and PMC Member for Apache Falcon

Page 3 © Hortonworks Inc. 2014

Agenda

•  Introduction to Apache Falcon

•  New Innovation in Apache Falcon 0.6.0 §  HDFS Mirroring

§  Cloud Replication

•  A Look Ahead

•  Q & A

We’ll move quickly: •  Attendee phone lines are muted

•  Text any questions to Andrew Ahn using Webex chat •  Questions answered at the end

•  Unanswered questions and answers in upcoming blog post

Page 4 © Hortonworks Inc. 2014

Big Data, Hadoop & Data Center Re-platforming

Business Drivers

•  From reactive analytics to proactive interactions

•  Insights that drive competitive advantage & optimal returns

Financial Drivers

•  Cost of data systems, as % of IT spend, continues to grow

•  Cost advantages of commodity hardware & open source software

$ Technical Drivers

•  Data is growing exponentially & existing systems overwhelmed

•  Predominantly driven by NEW types of data that can inform analytics

There is an inequitable balance between vendor and customer in the market

Page 5 © Hortonworks Inc. 2014

Clickstream Capture and analyze website visitors’ data trails and optimize your website

Sensors Discover patterns in data streaming automatically from remote sensors and machines

Server Logs Research logs to diagnose process failures and prevent security breaches

New Types of Data Hadoop Value:

Sentiment Understand how your customers feel about your brand and products – right now

Geographic Analyze location-based data to manage operations where they occur

Unstructured Understand patterns in files across millions of web pages, emails, and documents

Page 6 © Hortonworks Inc. 2014

A Shift from Reactive to Proactive Interactions

HDP and Hadoop allow organizations to use data to shift interactions from…

Reactive Post Transaction

Proactive Pre Decision

…to Real-time Personalization From static branding

…to repair before break From break then fix

…to Designer Medicine From mass treatment

…to Automated Algorithms From Educated Investing

…to 1x1 Targeting From mass branding

A shift in Advertising

A shift in Financial Services

A shift in Healthcare

A shift in Retail

A shift in Telco

Page 7 © Hortonworks Inc. 2014

Enterprise Goals for the Modern Data Architecture

•  Consolidate siloed data sets structured and unstructured

•  Central data set on a single cluster

•  Multiple workloads across batch interactive and real time

•  Central services for security, governance and operation

•  Preserve existing investment in current tools and platforms

•  Single view of the customer, product, supply chain

APP

LIC

ATIO

NS

DAT

A S

YSTE

M

Business Analytics

Custom Applications

Packaged Applications

RDBMS

EDW

MPP

YARN: Data Operating System

1 ° ° ° ° ° ° ° ° °

° ° ° ° ° ° ° ° ° N

Interactive Real-Time Batch CRM

ERP

Other 1 ° ° °

° ° ° °

HDFS (Hadoop Distributed File System)

SOU

RC

ES

EXISTING  Systems  

Clickstream   Web    &Social  

Geoloca9on   Sensor    &  Machine  

Server    Logs  

Unstructured  

Page 8 © Hortonworks Inc. 2014

YARN Transformed Hadoop & Opened a New Era

YARN The Architectural Center of Hadoop

•  Common data platform, many applications

•  Support multi-tenant access & processing

•  Batch, interactive & real-time use cases

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

Others

ISV Engines

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Page 9 © Hortonworks Inc. 2014

YARN Extends Hadoop to Other Data Center Leaders

YARN The Architectural Center of Hadoop

•  Common data platform, many applications

•  Support multi-tenant access & processing

•  Batch, interactive & real-time use cases

•  Supports 3rd-party ISV tools

(ex. SAS, Syncsort, Actian, etc.)

YARN Ready Applications Facilitates ongoing innovation and enterprise adoption via ecosystem of new and existing “YARN Ready” solutions

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

Others

ISV Engines

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Page 10 © Hortonworks Inc. 2014

Enterprise Hadoop: Central Set of Services

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

° °

° °

° ° ° ° °

° ° ° ° °

Enables Apache Hadoop to be an Enterprise Data Platform with centralized services for:

•  Governance

•  Operations

•  Security

Everything that plugs into Hadoop inherits these services

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Load data and manage

according to policy

Deploy and effectively

manage the platform

Provide layered approach to

security through Authentication, Authorization,

Accounting, and Data Protection

SECURITY GOVERNANCE OPERATIONS

Script

Pig

SQL

Hive

Java Scala

Cascading

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Others

ISV Engines

YARN: Data Operating System (Cluster Resource Management)

HDFS (Hadoop Distributed File System)

Tez Slider Slider Tez Tez

Page 11 © Hortonworks Inc. 2014

Hortonworks Development Investment for the Enterprise

Vertical Integration with YARN and HDFS

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

° °

° °

° ° ° ° °

° ° ° ° °

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Load data and manage

according to policy

Deploy and effectively

manage the platform

Provide layered approach to

security through Authentication, Authorization,

Accounting, and Data Protection

SECURITY GOVERNANCE OPERATIONS

Script

Pig

SQL

Hive

Java Scala

Cascading

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Others

ISV Engines

YARN: Data Operating System (Cluster Resource Management)

HDFS (Hadoop Distributed File System)

Tez Slider Slider Tez Tez

•  Ensure engines can run reliably and respectfully in a YARN based cluster •  Implement features throughout the stack to accommodate

Page 12 © Hortonworks Inc. 2014

Hortonworks Development Investment for the Enterprise

Horizontal Integration for Enterprise Services

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

° °

° °

° ° ° ° °

° ° ° ° °

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Load data and manage

according to policy

Deploy and effectively

manage the platform

Provide layered approach to

security through Authentication, Authorization,

Accounting, and Data Protection

SECURITY GOVERNANCE OPERATIONS

Script

Pig

SQL

Hive

Java Scala

Cascading

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Others

ISV Engines

YARN: Data Operating System (Cluster Resource Management)

HDFS (Hadoop Distributed File System)

Tez Slider Slider Tez Tez

•  Ensure consistent enterprise services are applied across the entire Hadoop stack •  Integrate with and extend existing data center solutions for these key requirements

Page 13 © Hortonworks Inc. 2014

Hortonworks Data Platform 2.2

HDP Delivers Enterprise Hadoop

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Data Workflow, Lifecycle & Governance

Falcon Sqoop Flume Kafka NFS

WebHDFS

Authentication Authorization

Audit Data Protection

Storage: HDFS

Resources: YARN Access: Hive

Pipeline: Falcon Cluster: Ranger Cluster: Knox

Deployment Choice Linux Windows Cloud

YARN is the architectural center of HDP

•  Common data set across all applications

•  Batch, interactive & real-time workloads

•  Multi-tenant access & processing

Provides comprehensive enterprise capabilities

•  Governance

•  Security

•  Operations

Enables broad ecosystem adoption

•  ISVs can plug directly into Hadoop

The widest range of deployment options •  Linux & Windows

•  On premises & cloud

Others

ISV Engines

On-Premises

Page 14 © Hortonworks Inc. 2014

Hortonworks Data Platform 2.2

HDP Delivers Enterprise Hadoop

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

SECURITY OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Authentication Authorization

Audit Data Protection

Storage: HDFS

Resources: YARN Access: Hive

Pipeline: Falcon Cluster: Ranger Cluster: Knox

YARN is the architectural center of HDP

•  Common data set across all applications

•  Batch, interactive & real-time workloads

•  Multi-tenant access & processing

Provides comprehensive enterprise capabilities

•  Governance

•  Security

•  Operations

Enables broad ecosystem adoption

•  ISVs can plug directly into Hadoop

The widest range of deployment options •  Linux & Windows

•  On premises & cloud

Others

ISV Engines

Deployment Choice Linux Windows Cloud On-Premises

GOVERNANCE

Data Workflow, Lifecycle & Governance

Falcon Sqoop Flume Kafka NFS

WebHDFS

Page 15 © Hortonworks Inc. 2014

Introduction to Apache Falcon

Page 16 © Hortonworks Inc. 2014

Falcon Overview

Centrally Manage Data Lifecycle – Centralized definition & management of pipelines for data ingest, process &

export

Business Continuity & Disaster Recovery – Out of the box policies for data replication & retention

– End to end monitoring of data pipelines

Address audit & compliance requirements – Visualize data pipeline lineage – Track data pipeline audit logs

– Tag data with business metadata

The data traffic cop

Page 17 © Hortonworks Inc. 2014

Falcon Architecture

Centralized Falcon Orchestration Framework

Hadoop ecosystem tools

Falcon  Server   JMS  

API  &  UI  

AMBARI  

HDFS / Hive

Oozie

Entity Specs Scheduled Jobs Process

Status

MapRed / Pig / Hive / Sqoop / Flume / DistCP

Data stewards

+ Hadoop admins

Page 18 © Hortonworks Inc. 2014

Data Pipeline: Definition

• XML based pipeline specification – Modular - Clusters, feeds & processes defined separately and then linked together – Easy to re-use across multiple pipelines

• Out of the box policies – Predefined policies for replication, late data handling & eviction – Easily customization of policies

• Extensible – Plug in external solutions at any step of the pipeline – Eg. Invoke third party data obfuscation components

Page 19 © Hortonworks Inc. 2014

Data Pipeline: Monitoring

DATA Primary site DR site

Centralized monitoring of data pipeline with Falcon + Ambari

Pipeline run alerts

Hadoop Cluster-1 Hadoop Cluster-2

Pipeline run history

Pipeline Scheduling

raw clean prep raw clean prep

Page 20 © Hortonworks Inc. 2014

Data Pipeline: Tracing

.

Purchase feed

Customer feed

Product feed Store feed

View dependencies between clusters,

datasets and processes

Data pipeline dependencies

Add arbitrary tags to feeds & processes

Credit feed

Sensitive Encrypted

Data pipeline tagging

Coming Soon

Know who modified a dataset when and into

what

Data pipeline audits

File-1

File-2

File-3

Analyze how a dataset reached a

particular state

Data pipeline lineage

Page 21 © Hortonworks Inc. 2014

Replication with Falcon

Staged Data Presented Data

Cleansed Data

Conformed Data

Staged Data Presented Data

Rep

licat

ion

Failover Hadoop Cluster

Primary Hadoop Cluster

Rep

licat

ion

BI  /  Analy9cs  

BusinessObjects BI

•  Falcon manages workflow and replication •  Enables business continuity without requiring full data reprocessing •  Failover clusters can be smaller than primary clusters

Page 22 © Hortonworks Inc. 2014

Data Retention with Falcon

Staged Data Presented Data

Cleansed Data

Conformed Data

Retain 5 Years

Retain Last Copy Only

Retain 3 Years

Retain 3 Years

•  Sophisticated retention policies expressed in one place •  Simplify data retention for audit, compliance, or for data re-processing

Ret

entio

n P

olic

y

Page 23 © Hortonworks Inc. 2014

Late Data Handling with Falcon

Staged Data Combined Data

Online Transaction Data

(via Sqoop)

Web Log Data (via FTP)

Wait up to 4 hours for FTP data to arrive

•  Processing waits until all required input data is available •  Checks for late data arrivals, issues retrigger processing as necessary •  Eliminates writing complex data handling rules within applications

Page 24 © Hortonworks Inc. 2014

November 2014 Future Release

Falcon Investment Plans

•  Authentication & Authorization Integration

•  Pipeline, (HDFS file & Hive) table Lineage GA

•  HDFS DR Replication with Recipes •  UI for Lineage management •  Replicate to Cloud - Azure & S3 Post-HDP 2.2 Tech Preview •  Hive/HCat metastore Replication •  Expanded UI Entity creation and

management.

•  Hive/HCat metastore Replication GA •  Pipeline Run Notification via SNMP,

e-mail, etc. •  Hive ACID support •  HDFS Snapshot Integration •  File import SSH & SCP •  Visual Pipeline Designer •  Resource Metrics •  Automated migration of data through

HDFS storage tiers

DATES AND FEATURES SUBJECT TO CHANGE

Page 25 © Hortonworks Inc. 2014

New in Apache Falcon 0.6.0: HDFS Mirroring

Page 26 © Hortonworks Inc. 2014

DR Mirroring of HDFS with Recipes

• Mirroring for Disaster Recovery and Business continuity use cases.

• Customizable for mulitple targets and frequency of synchronization

• Recipes: Template model re-use of complex workflows

Recipe

Reduce

Cleanse

Replicate

Properties

Workflow Template

Recipe

Reduce

Cleanse

Replicate

Properties

Workflow Template

Recipe

Reduce

Cleanse

Replicate

Properties

Workflow Template

Page 27 © Hortonworks Inc. 2014

New in Apache Falcon 0.6.0: Cloud Replication

Page 28 © Hortonworks Inc. 2014

Replication to Cloud

• Seemlessly replicate to Cloud targets

• Replicate from Cloud as a source.

• Support for Amazon S3 and Microsoft Azure

AzureAmazon S3

On Prem Cluster

Page 29 © Hortonworks Inc. 2014

A Look Ahead

Page 30 © Hortonworks Inc. 2014

Page 31 © Hortonworks Inc. 2014

Page 32 © Hortonworks Inc. 2014

Page 33 © Hortonworks Inc. 2014

Page 34 © Hortonworks Inc. 2014

Page 35 © Hortonworks Inc. 2014

Page 36 © Hortonworks Inc. 2014

Q & A

Page 37 © Hortonworks Inc. 2014

Thank you! Learn more at: hortonworks.com/hadoop/falcon/

Register for the remaining 5 Discover HDP 2.2 Webinars

Hortonworks.com/webinars