simplifying hadoop: a secure and unified data access path for computer frameworks
TRANSCRIPT
‹#›© Cloudera, Inc. All rights reserved.
Simplifying Hadoop: A Secure and Unified Data Access Path for Compute FrameworksMarcell Szabó 2015-12-10
‹#›© Cloudera, Inc. All rights reserved.
Hi, We’re looking for a data protection solution to mask sensitive customer data during queries using Hive, Impala, MR, Spark and Hbase. Does Cloudera offer something appropriate?
Regards, John Doe
Motivation
‹#›© Cloudera, Inc. All rights reserved.
HDFS • rw-rw-r--
Sentry • Access Control Rules on Hive MetaStore objects
• INSERT / SELECT / ALL • TABLE / VIEW / URI • view allows: filtering, projection, masking
• Understood by • Impala, HiveServer • but others (MapRed, Spark): fallback to HDFS
Before RecordService
RecordService to the rescue!
- Want to mask passwords? - Create a new file!
‹#›© Cloudera, Inc. All rights reserved.
Filtering, Projection, Masking
CREATE VIEW eu_clients_for_marketing as SELECT name, date_of_birth, mask(credit_card_number) as ccn, rating, region FROM clients WHERE region = “Europe”
‹#›© Cloudera, Inc. All rights reserved.
RecordService[public beta since Sept 2015]
Sentry
MetaStore
‹#›© Cloudera, Inc. All rights reserved.
Expectations for a protective layer
• Durable and complete protection • Doesn’t disrupt the interface • Doesn’t impair performance
‹#›© Cloudera, Inc. All rights reserved.
Durable and complete protection• Single access path • Kerberos • Zookeeper • Signed tasks, no user code
‹#›© Cloudera, Inc. All rights reserved.
Spark Example
//val file = sc.textFile(path) val file = sc.recordServiceTextFile(path)
‹#›© Cloudera, Inc. All rights reserved.
Spark SQL Example
ctx.sql(s""" |CREATE TEMPORARY TABLE $tbl |USING com.cloudera.recordservice.spark.DefaultSource |OPTIONS ( | RecordServiceTable '$db.$tbl', | RecordServiceTableSize '$size' |) """.stripMargin)
‹#›© Cloudera, Inc. All rights reserved.
MR Example
//FileInputFormat.setInputPaths(job, new Path(args[0]));//job.setInputFormatClass(AvroKeyInputFormat.class);
RecordServiceConfig.setInputTable(configuration, null, args[0]);job.setInputFormatClass( com.cloudera.recordservice.avro.mapreduce.AvroKeyInputFormat.class);
‹#›© Cloudera, Inc. All rights reserved.
Client Integration APIs
• Drop in replacements for common existing InputFormats • Text, Avro
• Can be used with Spark as well • SparkSQL: integration with the Data Sources API • Predicate pushdown, projection
• Migration should be easy • Client APIs make things simpler • Don’t need to interact with HMS • Care about the underlying storage format:
worker always returns records in a canonical format.
• Storage engine details (e.g. s3)
+
‹#›© Cloudera, Inc. All rights reserved.
Terasort• ~Worst case scenario: a single STRING column Custom RecordServiceTeraInputFormat (similar to TeraInputFormat) • 78 Node cluster (12 cores/24 Hyper-Threaded, 12 disks) • Ran on 1 billion, 50 billion and 1 trillion (~100TB) scales
TeraChecksum
Nor
mal
ized
job
tim
e
0
0,28
0,55
0,83
1,1
1B (MapReduce) 50B (MapReduce) 1T (MapReduce) 1B (Spark) 50B (Spark) 1T (Spark)
0,850,8
1,03
0,23
0,48
1
Without RecordServiceWith RecordService
• See Github repo for more details and runnable examples.
‹#›© Cloudera, Inc. All rights reserved.
Spark SQL• Represents a more expected use case: Data is fully schemed • TPCDS: 500GB scale factor, on parquet • Cluster: 5 node cluster
SparkSQL
0
100
200
300
400
TPCDS
Q3 Q7 Q8 Q19 Q27 Q34 Q42 Q43 Q52 Q53 Q55 Q61 Q68 Q73 Q88 Q96 GeoMean
SparkSQLSparkSQL with RecordService
~15% improvement in query times; queries are not scan bound
SparkSQL
0
8
16
24
32
2% Selective Scan Sum(col)
23,5
14
3129,5
SparkSQLSparkSQL with RecordService
‹#›© Cloudera, Inc. All rights reserved.
Performance
• Shares some core components with Impala • IO management, optimized C++ code, runtime code generation, uses low level storage APIs • Highly efficient implementation of the scan functionality
• Optimized columnar on wire format • Inspired by Apache Parquet
• Accelerates performance for many workloads
‹#›© Cloudera, Inc. All rights reserved.
Conclusion
• RecordService => schemed data access for Hadoop • security ++ • performance ++ • data format abstracted away • uniform access across Hadoop
• http://cloudera.github.io/RecordServiceClient/ • read … try … report bugs … contribute!