Page 1 © Hortonworks Inc. 2014
Discover HDP 2.2 Data Storage Innovations in Hadoop Distributed File System (HDFS)
Hortonworks. We do Hadoop.
Page 2 © Hortonworks Inc. 2014
Speakers
Rohit Bakhshi
Hortonworks Senior Product Manager & PM for Apache Hadoop & Apache Solr in Hortonworks Data Platform
Jitendra Pandey
Hortonworks Senior Architect for HDFS
Page 3 © Hortonworks Inc. 2014
Agenda
• Overview of HDFS
• New HDFS Innovation in HDP 2.2 – Heterogeneous storage
– Encryption
– Operational security enhancements
• Q & A
We’ll move quickly: • Attendee phone lines are muted • Text any questions to Jitendra using Webex chat
• Questions will be answered at the end of the call • Unanswered questions and answers in upcoming FAQ/blog post
Page 4 © Hortonworks Inc. 2014
Big Data, Hadoop & Data Center Re-platforming
Business Drivers
• From reactive analytics to proactive interactions
• Insights that drive competitive advantage & optimal returns
Financial Drivers
• Cost of data systems, as % of IT spend, continues to grow
• Cost advantages of commodity hardware & open source software
$ Technical Drivers
• Data is growing exponentially & existing systems overwhelmed
• Predominantly driven by NEW types of data that can inform analytics
There is an inequitable balance between vendor and customer in the market
Page 5 © Hortonworks Inc. 2014
Clickstream Capture and analyze website visitors’ data trails and optimize your website
Sensors Discover patterns in data streaming automatically from remote sensors and machines
Server Logs Research logs to diagnose process failures and prevent security breaches
New Types of Data Hadoop Value:
Sentiment Understand how your customers feel about your brand and products – right now
Geographic Analyze location-based data to manage operations where they occur
Unstructured Understand patterns in files across millions of web pages, emails, and documents
Page 6 © Hortonworks Inc. 2014
A Shift from Reactive to Proactive Interactions
HDP and Hadoop allow organizations to use data to shift interactions from…
Reactive Post Transaction
Proactive Pre Decision
…to Real-time Personalization From static branding
…to repair before break From break then fix
…to Designer Medicine From mass treatment
…to Automated Algorithms From Educated Investing
…to 1x1 Targeting From mass branding
A shift in Advertising
A shift in Financial Services
A shift in Healthcare
A shift in Retail
A shift in Telco
Page 7 © Hortonworks Inc. 2014
Enterprise Goals for the Modern Data Architecture
• Consolidate siloed data sets structured and unstructured
• Central data set on a single cluster
• Multiple workloads across batch interactive and real time
• Central services for security, governance and operation
• Preserve existing investment in current tools and platforms
• Single view of the customer, product, supply chain
APP
LIC
ATIO
NS
DAT
A S
YSTE
M
Business Analytics
Custom Applications
Packaged Applications
RDBMS
EDW
MPP
YARN: Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° ° N
Interactive Real-Time Batch CRM
ERP
Other 1 ° ° °
° ° ° °
HDFS (Hadoop Distributed File System)
SOU
RC
ES
EXISTING Systems
Clickstream Web &Social
Geoloca9on Sensor & Machine
Server Logs
Unstructured
Page 8 © Hortonworks Inc. 2014
YARN Transformed Hadoop & Opened a New Era
YARN The Architectural Center of Hadoop
• Common data platform, many applications
• Support multi-tenant access & processing
• Batch, interactive & real-time use cases
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV Engines
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Page 9 © Hortonworks Inc. 2014
YARN Extends Hadoop to Other Data Center Leaders
YARN The Architectural Center of Hadoop
• Common data platform, many applications
• Support multi-tenant access & processing
• Batch, interactive & real-time use cases
• Supports 3rd-party ISV tools
(ex. SAS, Syncsort, Actian, etc.)
YARN Ready Applications Facilitates ongoing innovation and enterprise adoption via ecosystem of new and existing “YARN Ready” solutions
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV Engines
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Page 10 © Hortonworks Inc. 2014
Enterprise Hadoop: Central Set of Services
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
Enables Apache Hadoop to be an Enterprise Data Platform with centralized services for:
• Governance
• Operations
• Security
Everything that plugs into Hadoop inherits these services
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and manage
according to policy
Deploy and effectively
manage the platform
Provide layered approach to
security through Authentication, Authorization,
Accounting, and Data Protection
SECURITY GOVERNANCE OPERATIONS
Script
Pig
SQL
Hive
Java Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Others
ISV Engines
YARN: Data Operating System (Cluster Resource Management)
HDFS (Hadoop Distributed File System)
Tez Slider Slider Tez Tez
Page 11 © Hortonworks Inc. 2014
Hortonworks Development Investment for the Enterprise
Vertical Integration with YARN and HDFS
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and manage
according to policy
Deploy and effectively
manage the platform
Provide layered approach to
security through Authentication, Authorization,
Accounting, and Data Protection
SECURITY GOVERNANCE OPERATIONS
Script
Pig
SQL
Hive
Java Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Others
ISV Engines
YARN: Data Operating System (Cluster Resource Management)
HDFS (Hadoop Distributed File System)
Tez Slider Slider Tez Tez
• Ensure engines can run reliably and respectfully in a YARN based cluster • Implement features throughout the stack to accommodate
Page 12 © Hortonworks Inc. 2014
Hortonworks Development Investment for the Enterprise
Horizontal Integration for Enterprise Services
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Load data and manage
according to policy
Deploy and effectively
manage the platform
Provide layered approach to
security through Authentication, Authorization,
Accounting, and Data Protection
SECURITY GOVERNANCE OPERATIONS
Script
Pig
SQL
Hive
Java Scala
Cascading
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Others
ISV Engines
YARN: Data Operating System (Cluster Resource Management)
HDFS (Hadoop Distributed File System)
Tez Slider Slider Tez Tez
• Ensure consistent enterprise services are applied across the entire Hadoop stack • Integrate with and extend existing data center solutions for these key requirements
Page 13 © Hortonworks Inc. 2014
Hortonworks Data Platform 2.2
HDP Delivers Enterprise Hadoop
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume Kafka NFS
WebHDFS
Authentication Authorization
Audit Data Protection
Storage: HDFS
Resources: YARN Access: Hive
Pipeline: Falcon Cluster: Ranger Cluster: Knox
Deployment Choice Linux Windows Cloud
YARN is the architectural center of HDP
• Common data set across all applications
• Batch, interactive & real-time workloads
• Multi-tenant access & processing
Provides comprehensive enterprise capabilities
• Governance
• Security
• Operations
Enables broad ecosystem adoption
• ISVs can plug directly into Hadoop
The widest range of deployment options • Linux & Windows
• On premises & cloud
Others
ISV Engines
On-Premises
Page 14 © Hortonworks Inc. 2014
Hortonworks Data Platform 2.2
HDP Delivers Enterprise Hadoop
YARN: Data Operating System (Cluster Resource Management)
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume Kafka NFS
WebHDFS
Authentication Authorization
Audit Data Protection
Storage: HDFS
Resources: YARN Access: Hive
Pipeline: Falcon Cluster: Ranger Cluster: Knox
YARN is the architectural center of HDP
• Common data set across all applications
• Batch, interactive & real-time workloads
• Multi-tenant access & processing
Provides comprehensive enterprise capabilities
• Governance
• Security
• Operations
Enables broad ecosystem adoption
• ISVs can plug directly into Hadoop
The widest range of deployment options • Linux & Windows
• On premises & cloud
Others
ISV Engines
Deployment Choice Linux Windows Cloud On-Premises
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° °
° °
° ° ° ° °
° ° ° ° °
HDFS (Hadoop Distributed File System)
Page 15 © Hortonworks Inc. 2014
Overview of HDFS
Page 16 © Hortonworks Inc. 2014
HDFS enables the Common Data Platform
HDFS Storage Platform for Modern Data Architecture
• Common data platform across multiple
application workloads
• Reliable
• Scalable
• Cost Efficient
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV Engines
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Page 17 © Hortonworks Inc. 2014
HDFS Innovations on HDP 2.2
Page 18 © Hortonworks Inc. 2014
HDFS in HDP 2.2: What’s New
Heterogeneous Storage • Archive and SSD Tiers
• Tech Preview: Enable intermediate data to stored in memory
Heterogeneous Storage
THEM
E
Encryp9on • Tech Preview: Transparent Data Encryp?on
Security
THEM
E
DataNode does not require Root to start • HDFS services in a Kerberized cluster no longer need Root to start
Security
THEM
E
Page 19 © Hortonworks Inc. 2014
New in HDP 2.2: Heterogeneous Storage
Page 20 © Hortonworks Inc. 2014
Heterogeneous Storage
Before • DataNode is a single storage • Storage is uniform - Only storage type Disk • Storage types hidden from the file system
New Architecture • DataNode is a collection of storages • Support different types of storages
– Disk, SSDs, Memory
All disks as a single storage
S3 Swift SAN Filers
Collection of tiered storages
Page 21 © Hortonworks Inc. 2014
HDFS Storage Architecture - Now
Page 22 © Hortonworks Inc. 2014
Storage Policies: Archival D
ISK
DIS
K
DIS
K
DIS
K
DIS
K
DIS
K
DIS
K
DIS
K
DIS
K
AR
CH
IVE
AR
CH
IVE
AR
CH
IVE
AR
CH
IVE
AR
CH
IVE
AR
CH
IVE
AR
CH
IVE
AR
CH
IVE
AR
CH
IVE
Warm 1 replica on DISK,
others on ARCHIVE
Hot All replicas on DISK
Cold All replicas on
ARCHIVE
HDP Cluster
Page 23 © Hortonworks Inc. 2014
Storage Policy: SSD S
SD
DIS
K
DIS
K
SS
D
DIS
K
DIS
K
SS
D
DIS
K
DIS
K
SS
D
DIS
K
DIS
K
SS
D
DIS
K
DIS
K
HDP Cluster
A
SS
D
DIS
K
DIS
K
A A
SSD All replicas on SSD DataSet A
Page 24 © Hortonworks Inc. 2014
Store Intermediate Data in Memory
Application Process
Memory Tier
Write block to memory
Lazy persist block to disk
RAM_DISK
Tech Preview feature
For data writes that:
- Need low latency writes
- Where data is regenerate-able
Page 25 © Hortonworks Inc. 2014
New in HDP 2.2: Encryption
Page 26 © Hortonworks Inc. 2014
HDFS Transparent Data Encryption
• HDFS Encryption – Transparent Encryption in HDFS – HDFS-6134 – Designate a dir as encryption zone, all files in the zone are encrypted – Dependency on Key Management Server
• Key Management Server - HADOOP-10433 – The custodian for all encryption keys in Hadoop – REST API for key CRUD operations
• Key Provider API - HADOOP-10141 – API to allow Hadoop code (NN, DN, DFS Clients) CRUD operations on key material
Page 27 © Hortonworks Inc. 2014
1
°
°
°
°
° °
° °
° °
° °
° N °
HDFS Transparent Data Encryption
DATA ACCESS
DATA MANAGEMENT
1 ° ° ° ° °
° ° ° ° ° °
° ° ° ° ° °
SECURITY
YARN
HDFS Client
° ° ° ° ° °
° ° ° ° ° °
° °
° °
° °
° °
° HDFS (Hadoop Distributed File System)
Encryp9on Zone
(aIributes -‐ EZKey ID, version) HDFS-‐6134
Encrypted File (aIributes -‐ EDEK, IV)
Name Node
KeyProvider API
KeyProvider API
Key Management System (KMS) Hadoop-‐10433
KeyProvider API – Hadoop-‐10141
EDEK
DEK
Crypto Stream (r/w with DEK)
DEKs EZKs
Acronym Descrip?on
EZ Encryp?on Zone (an HDFS directory)
EZK Encryp?on Zone Key; master key associated with all files in an EZ
DEK Data Encryp?on Key, unique key associated with each file. EZ Key used to generate DEK
EDEK Encrypted DEK, Name Node only has access to encrypted DEK.
IV Ini?aliza?on Vector
EDEK
EDEK
Page 28 © Hortonworks Inc. 2014
New in HDP 2.2: Operational Security Enhancements
Page 29 © Hortonworks Inc. 2014
DataNode does not require root
Enables Organizations to run services without utilizing root privilege
For Kerberized clusters
DataNode no longer needs to run as the Linux root user when starting
DataNode no longer needs to bind to privileged ports
DataNode utilizes SASL to transfer blocks between HDFS clients and DataNodes.
Page 30 © Hortonworks Inc. 2014
Q & A
Page 31 © Hortonworks Inc. 2014
Thank you! Learn more at: hortonworks.com/hadoop/hdfs/
Register for the remaining 4 Discover HDP 2.2 Webinars
Hortonworks.com/webinars