simplifying hadoop: a secure and unified data access path for computer frameworks

‹#›© Cloudera, Inc. All rights reserved.

Simplifying Hadoop: A Secure and Unified Data Access Path for Compute FrameworksMarcell Szabó 2015-12-10


RecordService[public beta since Sept 2015]


Hi, We’re looking for a data protection solution to mask sensitive customer data during queries using Hive, Impala, MR, Spark and Hbase. Does Cloudera offer something appropriate?

Regards, John Doe

Motivation


HDFS • rw-rw-r--

Sentry • Access Control Rules on Hive MetaStore objects

• INSERT / SELECT / ALL • TABLE / VIEW / URI • view allows: filtering, projection, masking

• Understood by • Impala, HiveServer • but others (MapRed, Spark): fallback to HDFS

Before RecordService

RecordService to the rescue!

- Want to mask passwords? - Create a new file!


Filtering, Projection, Masking

CREATE VIEW eu_clients_for_marketing as SELECT name, date_of_birth, mask(credit_card_number) as ccn, rating, region FROM clients WHERE region = “Europe”


RecordService[public beta since Sept 2015]

Sentry

MetaStore


Expectations for a protective layer

• Durable and complete protection • Doesn’t disrupt the interface • Doesn’t impair performance


Durable and complete protection• Single access path • Kerberos • Zookeeper • Signed tasks, no user code


Doesn’t disrupt the interface


Spark Example

//val file = sc.textFile(path) val file = sc.recordServiceTextFile(path)


MR Example

//FileInputFormat.setInputPaths(job, new Path(args[0]));//job.setInputFormatClass(AvroKeyInputFormat.class);

RecordServiceConfig.setInputTable(configuration, null, args[0]);job.setInputFormatClass( com.cloudera.recordservice.avro.mapreduce.AvroKeyInputFormat.class);


Client Integration APIs

• Drop in replacements for common existing InputFormats • Text, Avro

• Can be used with Spark as well • SparkSQL: integration with the Data Sources API • Predicate pushdown, projection

• Migration should be easy • Client APIs make things simpler • Don’t need to interact with HMS • Care about the underlying storage format:

worker always returns records in a canonical format.

• Storage engine details (e.g. s3)

+


Doesn’t impair performance


Terasort• ~Worst case scenario: a single STRING column Custom RecordServiceTeraInputFormat (similar to TeraInputFormat) • 78 Node cluster (12 cores/24 Hyper-Threaded, 12 disks) • Ran on 1 billion, 50 billion and 1 trillion (~100TB) scales

TeraChecksum

Nor

mal

ized

job

tim

e

0

0,28

0,55

0,83

1,1

1B (MapReduce) 50B (MapReduce) 1T (MapReduce) 1B (Spark) 50B (Spark) 1T (Spark)

0,850,8

1,03

0,23

0,48

1

Without RecordServiceWith RecordService

• See Github repo for more details and runnable examples.


Spark SQL• Represents a more expected use case: Data is fully schemed • TPCDS: 500GB scale factor, on parquet • Cluster: 5 node cluster

SparkSQL

0

100

200

300

400

TPCDS

Q3 Q7 Q8 Q19 Q27 Q34 Q42 Q43 Q52 Q53 Q55 Q61 Q68 Q73 Q88 Q96 GeoMean

SparkSQLSparkSQL with RecordService

~15% improvement in query times; queries are not scan bound

SparkSQL

0

8

16

24

32

2% Selective Scan Sum(col)

23,5

14

3129,5

SparkSQLSparkSQL with RecordService


Performance

• Shares some core components with Impala • IO management, optimized C++ code, runtime code generation, uses low level storage APIs • Highly efficient implementation of the scan functionality

• Optimized columnar on wire format • Inspired by Apache Parquet

• Accelerates performance for many workloads


Conclusion

• RecordService => schemed data access for Hadoop • security ++ • performance ++ • data format abstracted away • uniform access across Hadoop

• http://cloudera.github.io/RecordServiceClient/ • read … try … report bugs … contribute!

http://cloudera.github.io/RecordServiceClient/


Thank you

Marcell Szabó szama at cloudera.com

simplifying hadoop: a secure and unified data access path for computer frameworks

Data & Analytics