![Page 1: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/1.jpg)
Securing Spark Applications
Kostas SakellisMarcelo Vanzin
![Page 2: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/2.jpg)
What is Security?• Security has many facets• This talk will focus on three areas:
– Encryption– Authentication– Authorization
![Page 3: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/3.jpg)
Why do I need security?• Multi-tenancy• Application isolation• User identification• Access control enforcement• Compliance with government regulations
![Page 4: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/4.jpg)
Before we go further...• Set up Kerberos• Use HDFS (or another secure filesystem)• Use YARN!• Configure them for security (enable auth, encryption).
Kerberos, HDFS, and YARN provide the security backbone for Spark.
![Page 5: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/5.jpg)
Encryption• In a secure cluster, data should not be visible in the clear• Very important to financial / government institutions
![Page 6: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/6.jpg)
What a Spark app looks like
RM NM NM
AM / Driver Executor
Executor
SparkSubmit
Control RPC
File Download
Shuffle / Cached Blocks
Shuffle Service
Shuffle Service
Shuffle Blocks
UI
Shuffle Blocks / Metadata
![Page 7: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/7.jpg)
Data Flow in SparkEvery connection in the previous slide can transmit sensitive data!
• Input data transmitted via broadcast variables• Computed data during shuffles• Data in serialized tasks, files uploaded with the job
How to prevent other users from seeing this data?
![Page 8: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/8.jpg)
Encryption in Spark• Almost all channels support encryption.
– Exception 1: UI (SPARK-2750) – Exception 2: local shuffle / cache files (SPARK-5682)
For local files, set up YARN local dirs to point at local encrypted disk(s) if desired. (SPARK-5682)
![Page 9: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/9.jpg)
Encryption: Current StateDifferent channel, different method.
• Shuffle protocol uses SASL• RPC / File download use SSL
SSL can be hard to set up.• Need certificates readable on every node• Sharing certificates not as secure• Hard to have per-user certificate
![Page 10: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/10.jpg)
Encryption: The GoalSASL everywhere for wire encryption (except UI).
• Minimum configuration (one boolean config)• Uses built-in JVM libraries• SPARK-6017
For UI:• Support for SSL• Or audit UI to remove sensitive info (e.g. information on
environment page).
![Page 11: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/11.jpg)
AuthenticationWho is reading my data?
• Spark uses Kerberos – the necessary evil
• Ubiquitous among other services– YARN, HDFS, Hive, HBase etc.
![Page 12: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/12.jpg)
Who’s reading my data?Kerberos provides secure authentication.
KDC
Application
Hi I’m Bob.
Hello Bob. Here’s your TGT.
Here’s my TGT. I want to talk to HDFS.
Here’s your HDFS ticket.
User
![Page 13: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/13.jpg)
Now with a distributed app...
KDC
Executor
Executor
Executor
Executor
Executor
Executor
Executor
Executor
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Hi I’m Bob.
Something is wrong.
![Page 14: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/14.jpg)
Kerberos in Hadoop / SparkKDCs do not allow multiple concurrent logins at the scale distributed applications need. Hadoop services use delegation tokens instead.
Driver
NameNode
Executor
DataNode
![Page 15: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/15.jpg)
Delegation TokensLike Kerberos tickets, they have a TTL.
• OK for most batch applications.• Not OK for long running applications
– Streaming– Spark SQL Thrift Server
![Page 16: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/16.jpg)
Delegation TokensSince 1.4, Spark can manage delegation tokens!
• Restricted to HDFS currently• Requires user’s keytab to be deployed with application• Still some remaining issues in client deploy mode
![Page 17: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/17.jpg)
AuthorizationHow can I share my data?
Simplest form of authorization: file permissions.• Use Unix-style permissions or ACLs to let others read
from and / or write to files and directories• Simple, but high maintenance. Set permissions /
ownership for new files, mess with umask, etc.
![Page 18: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/18.jpg)
More than just FS semantics...Authorization becomes more complicated as abstractions are created.
• Tables, columns, partitions instead of files and directories
• Semantic gap• Need a trusted entity to enforce access control
![Page 19: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/19.jpg)
Trusted Service: HiveHive has a trusted service (“HiveServer2”) for enforcing authorization.
• HS2 parses queries and makes sure users have access to the data they’re requesting / modifying.
HS2 runs as a trusted user with access to the whole warehouse. Users don’t run code directly in HS2*, so there’s no danger of code escaping access checks.
![Page 20: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/20.jpg)
Untrusted Apps: SparkEach Spark app runs as the requesting user, and needs access to the underlying files.
• Spark itself cannot enforce access control, since it’s running as the user and is thus untrusted.
• Restricted to file system permission semantics.
How to bridge the two worlds?
![Page 21: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/21.jpg)
Apache Sentry• Role-based access control to resources• Integrates with Hive / HS2 to control access to data• Fine-grained (up to column level) controls
Hive data and HDFS data have different semantics. How to bridge that?
![Page 22: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/22.jpg)
The Sentry HDFS PluginSynchronize HDFS file permissions with higher-level abstractions.
• Permission to read table = permission to read table’s files
• Permission to create table = permission to write to database’s directory
Uses HDFS ACLs for fine-grained user permissions.
![Page 23: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/23.jpg)
Still restricted to FS view of the world!• Files, directories, etc…• Cannot provide column-level and row-level access
control.• Whole table or nothing.
Still, it goes a long way in allowing Spark applications to work well with Hive data in a shared, secure environment.
But...
![Page 24: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/24.jpg)
Future: RecordServiceA distributed, scalable, data access service for unified authorization in Hadoop.
![Page 25: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/25.jpg)
RecordService
![Page 26: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/26.jpg)
RecordService• Drop in replacement for InputFormats• SparkSQL: Integration with Data Sources API
– Predicate pushdown, projection
![Page 27: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/27.jpg)
RecordService• Assume we had a table tpch.nation
column_name column_type
n_nationkey smallint
n_name string
n_regionkey smallint
n_comment string
![Page 28: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/28.jpg)
import com.cloudera.recordservice.spark._
val context = new org.apache.spark.sql.SQLContext(sc)
val df = context.load("tpch.nation",
"com.cloudera.recordservice.spark")
val results = df.groupBy("n_regionkey")
.count()
.collect()
RecordService
![Page 29: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/29.jpg)
RecordService• Users can enforce Sentry permissions using views• Allows column and row level security
> CREATE ROLE restrictedrole;
> GRANT ROLE restrictedrole to GROUP restrictedgroup;
> USE tpch;
> CREATE VIEW nation_names AS
SELECT n_nationkey, n_name
FROM tpch.nation;
> GRANT SELECT ON TABLE tpch.nation_names TO ROLE restrictedrole;
![Page 30: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/30.jpg)
...
val df = context.load("tpch.nation",
"com.cloudera.recordservice.spark")
val results = df.collect()
>> TRecordServiceException(code:INVALID_REQUEST, message:Could not plan
request., detail:AuthorizationException: User 'kostas' does not have
privileges to execute 'SELECT' on: tpch.nation)
RecordService
![Page 31: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/31.jpg)
...
val df = context.load("tpch.nation_names",
"com.cloudera.recordservice.spark")
val results = df.collect()
RecordService
![Page 32: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/32.jpg)
RecordService• Documentation: http://cloudera.github.io/RecordServiceClient/
• Beta Download: http://www.cloudera.com/content/cloudera/en/downloads/betas/recordservice/0-1-0.html
![Page 33: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/33.jpg)
Takeaways• Spark can be made secure today!• Benefits from a lot of existing Hadoop platform work• Still work to be done
– Ease of use– Better integration with Sentry / RecordService
![Page 34: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/34.jpg)
References• Encryption: SPARK-6017, SPARK-5682• Delegation tokens: SPARK-5342• Sentry: http://sentry.apache.org/
– HDFS synchronization: SENTRY-432• RecordService:
http://cloudera.github.io/RecordServiceClient/
![Page 35: Securing Spark Applications by Kostas Sakellis and Marcelo Vanzin](https://reader036.vdocument.in/reader036/viewer/2022081604/58ac4e0c1a28ab99028b63db/html5/thumbnails/35.jpg)
Thanks!
Questions?