secure search - using apache sentry to add authentication and authorization support to solr:...

39

Upload: lucidworks

Post on 18-Jul-2015

676 views

Category:

Technology


2 download

TRANSCRIPT

Secure Solr With Apache Sentry Gregory Chanan, Engineer @ Cloudera gchanan AT cloudera.com

Who Am I? •  Software Engineer at Cloudera •  Apache Solr Committer •  Apache Sentry Committer (incubating) •  Apache HBase Committer

Overview •  Motivation

•  Why security for Solr / SolrCloud? •  Why Apache Sentry?

•  Authentication •  Authorization

•  Collection-level •  Document-level

•  Secure Impersonation •  Performance •  Future Work

Overview •  Motivation

•  Why security for Solr / SolrCloud? •  Why Apache Sentry?

•  Authentication •  Authorization

•  Collection-level •  Document-level

•  Secure Impersonation •  Performance •  Future Work

Why Security? •  Apache Solr only provides minimal security features

“Solr  allows  any  client  with  access  to  it  to  add,  update,  and  delete  documents    (and  of  course  search/read  too),  including  access  to  the  Solr  configura<on  and  schema  files  and  the  administra<ve  user  interface.”[1]    

•  In the past, deployed as a single server “It  is  strongly  recommended  that  the  applica<on  server  containing  Solr  be  firewalled  such  the  only  clients  with  access  to  Solr  are  your  own.”  [1]  

Why Security? •  SolrCloud driving adoption in Big Data space

•  Now, a component of a multi-tenant Hadoop cluster •  Non-­‐solr  users  on  cluster  •  Solr  communicates  across  machines  and  services  

Overview •  Motivation

•  Why security for Solr / SolrCloud? •  Why Apache Sentry?

•  Authentication •  Authorization

•  Collection-level •  Document-level

•  Secure Impersonation •  Performance •  Future Work

Why Apache Sentry? •  Sentry already established in Hadoop ecosystem

•  Has  understood  authen<ca<on  model  (kerberos)  •  Has  understood  privilege/ac<on  model  

•  Security-focused project •  Solr  focus  on  Search  Engine  •  Sentry  focus  on  Security  

Overview •  Motivation

•  Why security for Solr / SolrCloud? •  Why Apache Sentry?

•  Authentication •  Authorization

•  Collection-level •  Document-level

•  Secure Impersonation •  Performance •  Future Work

Authentication •  Authentication: Verifying identity of a user or service •  Solr supports authenticating with dependent services (i.e. HDFS

and ZooKeeper*) •  Sentry goal: support other services / users authenticating with

Solr •  Consistent with other HTTP-level Hadoop services (e.g. Oozie

and HttpFs), Apache Sentry uses: •  Kerberos: a mutual authentication protocol that works on the

basis of “tickets” •  SPNego: a negotiation mechanism for selecting an underlying

authentication protocol

SPNego advantages •  HTTP Tools have built-in support for SPNego/Kerberos

•  Web browsers •  curl (with --negotiate) •  HTTP libraries, including Apache HttpClient (used by solrj)

•  Although an authentication (not authorization) protocol, can be used for cluster-level access control •  Only grant kerberos credentials to users who should have access to the cluster

Authentication Setup •  Server side: use Sentry-provided web.xml which has a kerberos/

SPNego aware filter •  Have  to  setup  keytabs/principals/JAAS  configura<ons    

•  Client side: Sentry provides HttpClient / HttpSolrServer configuration for communicating with kerberos/SPNego aware Solr servers •  Have  to  setup  keytabs/principals/JAAS  configura<ons  

•  Cloudera Manager can do setup for you

Overview •  Motivation

•  Why security for Solr / SolrCloud? •  Why Apache Sentry?

•  Authentication •  Authorization

•  Collection-level •  Document-level

•  Secure Impersonation •  Performance •  Future Work

Authorization •  Authorization: Controlling access to resources •  Solr does not provide collection/document authorization support

•  Does support “hooks” via solr.xml and solrconfig.xml to override request handler implementation

•  Sentry uses these “hooks” to implement collection and document level authorization

Overview •  Motivation

•  Why security for Solr / SolrCloud? •  Why Apache Sentry?

•  Authentication •  Authorization

•  Collection-level •  Document-level

•  Secure Impersonation •  Performance •  Future Work

Collection-level Authorization •  Sentry supports role-based granting of privileges

•  each  role  can  be  granted  QUERY,  UPDATE,  and/or  administra<ve  privileges  on  an  collec<on  

•  Privileges stored in a “policy file” on HDFS: [groups]  #  Assigns  each  Hadoop  group  to  its  set  of  roles  dev_ops  =  engineer_role,  ops_role  [roles]  #  Assigns  each  role  to  its  set  of  privileges  engineer_role  =  collec<on  =  source_code-­‐>ac<on=Query,      collec<on  =  source_code  -­‐>  ac<on=Update  ops_role  =  collec<on  =  hbase_logs  -­‐>  ac<on=Query  

Integrating Sentry and Solr •  Sentry integrated via “hooks” in request handlers: •  Specified per collection in solrconfig.xml: •  Sentry ships with its own version of solrconfig.xml with secure handlers,

called solrconfig.xml.secure

Administrative requests •  That covers queries/updates of collections, but what about administrative

actions such as getting the status of the cores? •  In SolrCloud, admin looks like a collection: http://localhost:8983/solr/admin/cores?action=STATUS •  Can just follow this structure in Sentry: sample_role  =  collec<on  =  admin-­‐>ac<on=Query,  

•  Secure Admin Handlers controlled via cluster-wide “solr.xml” in ZooKeeper. By default, you get Secure Admin Handlers if Sentry is enabled

Administrative requests •  Full privilege model documented here •  Examples (colllection1 = arbitrary collection name):

Ac-on   Required  Privilege   Collec-on  

select   QUERY   collec<on1  

update/json   UPDATE   collec<on1  

ThreadDumpHandler   QUERY   admin  

Overview •  Motivation

•  Why security for Solr / SolrCloud? •  Why Apache Sentry?

•  Authentication •  Authorization

•  Collection-level •  Document-level

•  Secure Impersonation •  Performance •  Future Work

Document-level authorization motivation •  Collection-level authorization useful when access control requirements

for documents are homogeneous •  Security requirements may require restricting access to a subset of

documents •  Consider “Confidential” and “Secret” documents. How to store with only

collection-level authorization?

•  Pushes complexity to application

Document-level authorization model •  Instead of Policy File in HDFS:

[groups]  #  Assigns  each  Hadoop  group  to  its  set  of  roles  dev_ops  =  engineer_role,  ops_role  [roles]  #  Assigns  each  role  to  its  set  of  privileges  engineer_role  =  collec<on  =  source_code-­‐>ac<on=Query,      collec<on  =  source_code-­‐>ac<on=Update  ops_role  =  collec<on  =  hbase_logs-­‐>ac<on=Query  

•  Store authorization tokens in each document •  Many  more  documents  than  collec<ons;  doesn’t  scale  to  store  document-­‐

level  info  in  Policy  File  •  Can  use  Solr’s  built-­‐in  filtering  capabili<es  to  restrict  access  

Document-level authorization model •  A configurable field stores the authorization tokens •  The authorization tokens are Sentry roles, i.e. “ops_role”

 [roles]    ops_role  =  collec<on  =  hbase_logs-­‐>ac<on=Query  

•  Represents the roles that are allowed to view the document. To view a document, the querying user must belong to at least one role whose token is stored in the token field

•  Can modify document permissions without restarting Solr •  Can modify role memberships without reindexing

Document-level authorization impl •  Intercepts the request via a SearchComponent •  SearchComponent adds an “fq” or FilterQuery

•  Filter  out  all  documents  that  don’t  have  “role1”  or  “role2”  in  authField  

•  Filters are cached, so only construction expense once •  Note: does not supersede collection-level authorization

Document-level authorization config •  Configuration via solrconfig.xml.secure (per collection):      <!-­‐-­‐  Set  to  true  to  enabled  document-­‐level  authoriza<on  -­‐-­‐>        <bool  name="enabled">false</bool>        <!-­‐-­‐  Field  where  the  auth  tokens  are  stored  in  the  document  -­‐-­‐>        <str  name="sentryAuthField">sentry_auth</str>        <!-­‐-­‐  Auth  token  defined  to  allow  any  role  to  access  the    document.              Uncomment  to  enable.  -­‐-­‐>        <!-­‐-­‐<str  name="allRolesToken">*</str>-­‐-­‐>  

•  No tokens = no access. To allow all users to access a document,

use the allRolesToken. Useful for getting started

Overview •  Motivation

•  Why security for Solr / SolrCloud? •  Why Apache Sentry?

•  Authentication •  Authorization

•  Collection-level •  Document-level

•  Secure Impersonation •  Performance •  Future Work

Secure Impersonation •  But wait! My users don’t interact with Solr directly

•  Custom web UI, load balancer, etc.

•  Authorization won’t work! •  “user” is forgotten, request to Solr from “UI”  

Secure Impersonation •  Secure impersonation: the ability of a “super-user” to submit

requests on behalf of another user •  Conceptually  similar  to  “sudo”  on  Unix  •  Limited  to  only  groups/hosts  that  are  explicitly  configured  to  support  it  •  Iden<cal  to  func<onality  provided  by  HDFS,  Oozie    

Hue Search App UI •  Uses Secure Impersonation to integrate with its own security mechanisms

•  Users  can  login  to  Hue  via  LDAP  or  other  auth  mechanism  •  Hue  makes  requests  on  behalf  of  logged  in  user  •  Only  Hue  user  requires  kerberos  keytab  

•  Seamlessly integrates with the collection and document-level access control mechanisms

Overview •  Motivation

•  Why security for Solr / SolrCloud? •  Why Apache Sentry?

•  Authentication •  Authorization

•  Collection-level •  Document-level

•  Secure Impersonation •  Performance •  Future Work

Performance Testing •  Goal is to measure overhead of:

•  Kerberos Authentication •  Sentry Collection-Level Authorization

•  Measure index, query overhead separately

Index Test Setup •  20-node cluster: 12 cores, 96 GB RAM, 12x 2TB disks, 10G Ethernet •  Cloudera Search-1.2.0, CDH 4.6, MR1, CentOS 6.4 •  260M tweets/docs, indexed across 17 fields •  116 GB, ~800 JSON .gz files, ~130MB per file, 3-fold HDFS

replication •  1 Solr server and 1 shard per node (44M docs per shard), no Solr

replication •  Uses MapReduceIndexerTool contrib. mapper/reducer slots = 2x/1x

number of cores •  Solr heap size = 20GB •  Record end-to-end indexing time, i.e., indexing + mtree merge + go

live •  Record average from 3 repeats

Index Performance Testing

•  Leg  column  is  unsecured  baseline.  

•  Center  column  is  ~20%  lower  →  HDFS  security  introduces  ~20%  performance  overhead.  

•  Right  column  is  ~same  as  center  column  →  Solr  security  introduces  no  addi<onal  overhead.    

Query Test Setup •  Same setup as MapReduce batch indexing •  Uses the output of MapReduce batch indexing •  1 client, 30 threads per client •  Uses internal tool - QueryRunner

•  Similar  to  SolrMeter  and  JMeter  •  Query randomly sampled from fixed set of 10,000 strings •  Record per thread query throughput for 5 runs of 30 min each

Query Performance Testing

•  Leg  column  is  unsecured  baseline.  

•  Center  column  is  ~13%  lower  →  HDFS  security  introduces  ~13%  performance  overhead.  

•  Right  column  is  same  as  center  column  →  Solr  security  introduces  no  addi<onal  overhead.    

Overview •  Motivation

•  Why security for Solr / SolrCloud? •  Why Apache Sentry?

•  Authentication •  Authorization

•  Collection-level •  Document-level

•  Secure Impersonation •  Performance •  Future Work

Future Work •  Support for Sentry service with improved APIs / performance /

integration •  Already supported for Hive/Impala •  Currently in development upstream

•  “Lineage” security: data flows from one system to another and retains security criteria •  Example: Index HBase data for full-text queries in Solr. HBase Table

and Cell-level security tags automatically applied to Solr Collections, Documents, and Fields

Questions? •  Thanks for listening! •  More information / Want to contribute?

http://sentry.incubator.apache.org/ •  Questions?