up-armoring the elephant: adding kerberos-based security to hadoop
DESCRIPTION
Presented at HadoopDay, Seattle, 2010.TRANSCRIPT
Jakob Homan
• HDFS full-time @Y!
• ApacheHadoop committer
• Past six months –security!
• @blueboxtraveler
Who I am
8/14/20102
Using Hadoop at Yahoo!
38,000+ Nodes170 PB worth of
storage
More than 1,000,000 MR jobs monthly
Almost every product uses
Hadoop in some way
8/14/20103
As of 2009, 72% percent of
patches going into the
Hadoop source code were
coming from Yahoo! 72%
Developing Hadoop at Yahoo!
8/14/20104
Yahoo! provides extensive
QE and QA resources to
test Hadoop releases at
scale. Q{A,E}
Developing Hadoop at Yahoo!
8/14/20105
Developing Hadoop at Yahoo!
8/14/20106
The Yahoo! distribution of
Hadoop, available on
Github, is the same code
we run internally on our
servers.
Patches important to
stability and performance
and stability are applied
here, as well as Apache.
Developing Hadoop at Yahoo!
8/14/20107
The rest of the family
Hadoop at Yahoo! Sunnyvale
8/14/20108
Why do we need a secure Hadoop?
• Different clusters for different data not a workable solution
• Costs of operating clusters andmoving data too high
Silos don’t cut it anymore
• Personably Identifiable Information
• Financial data
• Regulatory requirements
More sensitive data
8/14/20109
Current state of security in Hadoop
8/14/201010
Lessthanideal
File system
• POSIX-style permissions
• Audit logging available
Authentication
• Do we really know who we’re talking to?
• Both users and services
Authorization
• Who can see files, launch jobs?
• File systems permissions help with this
Current state of security in Hadoop
8/14/201011
Bowser copyright Nintendo
The elephant is too trusting
8/14/201012
Which can let bad people do bad things
8/14/201013
Why is securing Hadoop hard?
8/14/201014
Industry-standard network authentication protocol.
Open-source project from MIT.
Acts as trusted third party to identify and authenticate components in an Hadoop cluster.
It’s out there: Microsoft’s Active Directory can act as a KDC.
Enter Kerberos!
8/14/201015
Kerberos workflow
User or service authenticatesto KDC
• Users use kinit, can be automatic upon login
• Services use keytabs
KDC provides a ticket-granting-ticket (TGT)
• This verifies identity to other actors in system
• TGTs last for 10 hours, renewable for up to 7 days
User or service presents this ticket
to NN or JT
8/14/201016
RPC upgraded to use SASL/GSSAPI
Hadoop RPC
• Hadoop has own RPC framework
• Lots of players:
• Namenode
• Datanodes
• Clients
• JobTracker
• TaskTrackers
Simple Authentication and
Security Layer
• RFC 2222 –Standard for lightweight authentication between clients and servers
• Works with GSSAPI to Support Kerberos as an authentication method
• Delegation tokens are also supported
Delegation Tokens
• DIGEST-MD5-based identifiers generated by Namenode
• Alleviate load on Kerberos server when 10,000s of tasks launch simultaneously
• Used to support cross-cluster authentication
8/14/201017
What does a secure Hadoop look like?
8/14/201018
Like this
8/14/201019
Everyone now authenticated
Users browsing filesystem on command line
Users submitting jobs on command
line
Servers within system
• Datanodes Namenode
• Tasktrackers JobTracker
• SNN NameNode
Oozie
• Submits jobs on behalf of users
• Configurable proxy user
8/14/201020
Additional security throughout system
• MapReduce system directory
• Task directory
• On-node HDFS directories
On-disk directory permissions are
700
• Linux Task Controller now runs as userwho owns job
Tasks run as user who launched them
• Use privileged ports for non-RPC calls
• Working on making this pluggable for other types of solutions
DataNodes’ports secured
• Streaming tasks can verify the identity of the TaskTracker and vice versaStreaming secured
8/14/201021
How do I write a secure MapReduce job?
8/14/201022
Word count pre-security
Word count post-
security
This is how
8/14/201023
No
changes!
UserGroupInformation.java
• Completely re-written – nexus of authentication code
• Really should never have been public
New type of DistributedCache
• Public is available to all users
• Private is secured for only submitting user
Significant user-facing changes
8/14/201024
Authenticating users for web access is pluggable
• Yahoo! has internal internal web authentication system, other organizations do as well.
Would really like to have a SPNEGO implementation
• Any volunteers?
Until then, the Doctor has returned
• Simple plugin returns DrWho for web access
Secure web access is pluggable
8/14/201025
DistCP works… in 3 out of 4 cases
Destination
Cluster
Unsecure 20 Secure 20
So
urc
eC
luste
r Unsecure 20
✔ ✗
Secure 20
✔ ✔
8/14/201026
Out of scope
On-disk encryption
Datanode directories’ permissions more
locked down
Actual block files and metadata not
encrypted
On-the-wire encryption
RPC and http data transfer sent in the
clear
Assumption that network is secure
8/14/201027
Impact on performance
4%Maximum performance degradationallowed by our performance team. We met or bested this requirement.
8/14/201028
Download from http://yhoo.it/aVAke1
Take security for a test drive
8/14/201029
Gory details at http://bit.ly/aze3Ba
Or build a secure cluster at home
8/14/201030
Other projects and security
Pig
• We’ve worked with the Pig team.
• Pig 6 and 7support security
Hbase
• Work in progress
• JIRA: HBASE-2016
Oozie
• Extensive collaboration.
• Oozie 2supports security
Hive
• Early work in progress
• JIRA: HIVE-1264
8/14/201031
Yahoo!’s distribution
• Security deployed to all clusters.
• The rest soon.
• All patches in Yahoo!’s gitrepository at: http://github.com/yahoo/hadoop-common
• Committed to open-sourcing all improvements and bug fixes to Y20S.
Current state
8/14/201032
Apache Distribution
• All of the security work has been forward-ported to trunk
• Still working on securing new-to-trunk features
• 22 will be first fully secured Apache release
Current state
8/14/201033
Security list
8/14/201034
Send security holes to this
email list
Already have had two security issues identified,
fixes in-flight
Questions?
8/14/201035