making apache hadoop secure

of 23/23
Making Apache Hadoop Secure Devaraj Das [email protected] Yahoo’s Hadoop Team

Post on 25-Feb-2016

51 views

Category:

Documents

0 download

Embed Size (px)

DESCRIPTION

Making Apache Hadoop Secure. Devaraj Das [email protected] Yahoo’s Hadoop Team. Introductions. Who I am Principal Engineer at Yahoo! Sunnyvale Working on Apache Hadoop and related projects MapReduce , Hadoop Security, HCatalog Apache Hadoop Committer/PMC member - PowerPoint PPT Presentation

TRANSCRIPT

Hadoop Security

Making Apache Hadoop SecureDevaraj [email protected] Hadoop Team

Berlin Buzzwords 20111IntroductionsWho I amPrincipal Engineer at Yahoo! SunnyvaleWorking on Apache Hadoop and related projectsMapReduce, Hadoop Security, HCatalogApache Hadoop Committer/PMC memberApache HCatalog CommitterBerlin Buzzwords 20112ProblemDifferent yahoos need different data.PII versus financialNeed assurance that only the right people can see data.Need to log who looked at the data.Yahoo! has more yahoos than clusters.Requires isolation or trust.Security improves ability to share clusters between groups3Berlin Buzzwords 2011HistoryOriginally, Hadoop had no security.Only used by small teams who trusted each otherOn data all of them had access toUsers and groups were added in 0.16Prevented accidents, but easy to bypasshadoop fs Dhadoop.job.ugi=joe rmr /user/joeWe needed more4Berlin Buzzwords 2011Why is Security Hard?Hadoop is Distributedruns on a cluster of computers.Trust must be mutual between Hadoop Servers and the clients

Berlin Buzzwords 2011Need DelegationNot just client-server, the servers access other services on behalf of others.MapReduce need to have users permissionsEven if the user logs outMapReduce jobs need to:Get and keep the necessary credentialsRenew them while the job is runningDestroy them when the job finishesBerlin Buzzwords 2011SolutionPrevent unauthorized HDFS accessAll HDFS clients must be authenticated.Including tasks running as part of MapReduce jobsAnd jobs submitted through Oozie.Users must also authenticate serversOtherwise fraudulent servers could steal credentialsIntegrate Hadoop with KerberosProven open source distributed authentication system.

7Berlin Buzzwords 2011RequirementsSecurity must be optional.Not all clusters are shared between users.Hadoop must not prompt for passwordsMakes it easy to make trojan horse versions.Must have single sign on.Must handle the launch of a MapReduce job on 4,000 NodesPerformance / Reliability must not be compromisedBerlin Buzzwords 2011Security DefinitionsAuthentication Who is the user?Hadoop 0.20 completely trusted the userSent user and groups over wireWe need it on both RPC and Web UI.Authorization What can that user do?HDFS had owners and permissions since 0.16.Auditing Who did that?Berlin Buzzwords 2011AuthenticationRPC authentication using Java SASL (Simple Authentication and Security Layer)Changes low-level transportGSSAPI (supports Kerberos v5)Digest-MD5 (needed for authentication using various Hadoop Tokens)SimpleWebUI authentication done via pluginYahoo! uses internal plugin, SPNEGO, etc.Berlin Buzzwords 2011AuthorizationHDFSCommand line and semantics unchangedMapReduce added Access Control ListsLists of users and groups that have access.mapreduce.job.acl-view-job view jobmapreduce.job.acl-modify-job kill or modify jobCode for determining group membership is pluggable.Checked on the masters.All servlets enforce permissions.Berlin Buzzwords 2011AuditingHDFS can track access to filesMapReduce can track who ran each jobProvides fine grain logs of who did whatWith strong authentication, logs provide audit trailsBerlin Buzzwords 2011Kerberos and Single Sign-onKerberos allows user to sign in onceObtains Ticket Granting Ticket (TGT)kinit get a new Kerberos ticketklist list your Kerberos ticketskdestroy destroy your Kerberos ticketTGTs last for 10 hours, renewable for 7 days by defaultOnce you have a TGT, Hadoop commands just workhadoop fs ls /hadoop jar wordcount.jar in-dir out-dir

13Berlin Buzzwords 2011Kerberos Dataflow

14Berlin Buzzwords 2011HDFS Delegation TokensTo prevent authentication flood at the start of a job, NameNode creates delegation tokens.Krb credentials are not passed to the JTAllows user to authenticate once and pass credentials to all tasks of a job.JobTracker automatically renews tokens while job is running.Max lifetime of delegation tokens is 7 days.Cancels tokens when job finishes.Berlin Buzzwords 2011Other tokens.Block Access TokenShort-lived tokens for securely accessing the DataNodes from HDFS Clients doing I/OGenerated by NameNodeJob TokenFor Task to TaskTracker Shuffle (HTTP) of intermediate dataFor Task to TaskTracker RPCGenerated by JobTrackerMapReduce Delegation TokenFor accessing the JobTracker from tasksGenerated by JobTrackerBerlin Buzzwords 2011Proxy-UsersOozie (and other trusted services) run operations on Hadoop clusters on behalf of other usersConfigure HDFS and MapReduce with the oozie user as a proxy:Group of users that the proxy can impersonateWhich hosts they can impersonate from17Berlin Buzzwords 2011Primary Communication Paths18

Berlin Buzzwords 2011Task IsolationTasks now run as the user.Via a small setuid programCant signal other users tasks or TaskTrackerCant read other tasks jobconf, files, outputs, or logsDistributed cachePublic files shared between jobs and usersPrivate files shared between jobsBerlin Buzzwords 2011Questions?Questions should be sent to:common/hdfs/[email protected] holes should be sent to:[email protected] from0.20.203 release of Apache Hadoophttp://svn.apache.org/repos/asf/hadoop/common/branches/branch-0.20-security/Thanks!(also thanks to Owen OMalley for the slides)Berlin Buzzwords 2011If time permitsBerlin Buzzwords 2011Upgrading to SecurityNeed a KDC with all of the user accounts.Need service principals for all of the servers.Need user accounts on all of the slavesIf you use the default group mapping, you need user accounts on the masters too.Need to install policy files for stronger encryption for Javahttp://bit.ly/dhM6qW

Berlin Buzzwords 2011Mapping to UsernamesKerberos principals need to be mapped to usernames on servers. Examples:[email protected] -> ddasjt/[email protected] -> mapredOperator can define translation.Berlin Buzzwords 2011