securely analyze data with sas® and cloudera · securely analyze data with sas® and cloudera...
TRANSCRIPT
#AnalyticsXC o p y r ig ht © 201 6, SAS In st i tute In c. A l l r ig hts r ese rve d.
Securely Analyze Data With SAS® and Cloudera
Scott ArmstrongDirector, Business DevelopmentCloudera
Scott ArmstrongDirector, Business Development @ [email protected]
Securely analyze data with SAS and Cloudera
© Cloudera, Inc. All rights reserved.
Agenda
• SAS & Cloudera
• How we work together
• Security in Hadoop
• Q&A
© Cloudera, Inc. All rights reserved.
SAS & Cloudera: Leaders Coming Together
CEO commitment from both companies
Formal Alliance forged in January 2013
Master Reciprocal Services Agreement in place to provide service flexibility
A Data Science course leveraging joint content and instructors
Cloudera is the leading commercial Hadoop distribution for SAS product testing & internal use
Cloudera onsite dedicated resource to work with SAS R&D to ensure tight technical alignment & roadmap
A joint QuickStart service bundle, featuring SAS Visual Analytics / Visual Statistics, SAS Data Loader for Hadoop and the Cloudera Enterprise Data Hub starter service package
SAS & Cloudera enable organizations to achieve competitive advantage by gaining value from all their data, through a proven combination of enterprise-ready storage, processing, analytics, and data management.
Expected Benefits from an Integrated SAS & Cloudera Platform
Improved Business Outcomes
•Better decisions by analyzing more data.
•Solve the hard problems with interactive and iterative analytics
•Unlimited variables for analysis, i.e. No column restrictions
Accelerated Time-to-Value
•In-memory data and analytics processing for faster performance.
•Joint ‘Starter Service’ bundle is available and can offer a fast start
•SAS simplifies working with Hadoop, Cloudera Manager simplifies system admin.
Reduced Costs & Risk
•SAS & Cloudera integration minimizes data movement & improves governance
•Cloudera & SAS are stable market leaders aligned across R&D (dedicated Cloudera engineer), product mgt., services, education, and tech support
More Innovation
•Hadoop’s cost-effective scalability allows for more analytic exploration of data that previously was too costly to store or troublesome to format
•Cloudera & SAS integrated technologies make ‘Big Data Analytics’ approachable and can support innovative use cases
What to Expect from SAS & Cloudera
Cloudera is the Preferred Hadoop Vendor for SAS Solutions on Demand
o Anti-Money Launderingo Tax Fraudo Drug Developmento Clinical Trial Data
Transparencyo Intelligent Advertising for
Publishers
o Claims Fraudo Customer Experience Analyticso Customer Experience Targetingo Customer Experience Personalizationo Marketing Operations Managemento Suspect Claims Detection…. and many more
Business-specific solutions such as
© Cloudera, Inc. All rights reserved.
SAS & Cloudera
SAS & Cloudera intersect in many ways:
SAS pulling data FROM Cloudera, when it is most convenient;
SAS can work WITH Cloudera, lifting data into a purpose-built advanced analytics in-memory environment;
SAS can work directly IN Cloudera, leveraging the distributed processing capabilities of Hadoop.
8© Cloudera, Inc. All rights reserved.
Memory
SAS
Data
In-Database
SAS
Traditional SAS
SAS Analytics HADOOP DEPLOYMENT PATTERNS
• These approaches are complementary & can be combined for maximum effect
• SAS In-Memory environment can be deployed as part of Hadoop cluster or separate footprint
SAS
In-Memory
Memory
Data
In-Database
Co-located Deployment Asymmetric
ORDataData
© Cloudera, Inc. All rights reserved.
The Benefits of Hadoop...
One place for unlimited data
• All types
• More sources
• Faster, larger ingestion
Unified, multi-framework data access
• More users
• More tools
• Faster changes
© Cloudera, Inc. All rights reserved.
…Can Create Information Security Challenges
Business Manager
• Run high value workloads in cluster
• Quickly adopt new innovations
Information Security
• Follow established policies and procedures
• Maintain compliance
IT/Operations
• Integrate with existing IT investments
• Minimize end-user support
• Automate configuration
Secure without CompromiseSecurity and Compliance are Not “Opt-In” Activities
Enterprise EncryptionProtects everything transparently
Access Policy EnforcementFull-stack row/column-based RBAC and dynamic masking
Automated Data ManagementFull-stack audit, lineage, discovery, and lifecycle
Secure OperationsSeparation of duties, log data redaction
OPERATIONSCloudera ManagerCloudera Director
DATA MANAGEMENT
Cloudera Navigator
Encrypt and KeyTrustee
Optimizer
STRUCTUREDSqoop
UNSTRUCTUREDKafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENTYARN
SECURITYSentry, RecordService
STORE
INTEGRATE
BATCHSpark, Hive, Pig
MapReduce
STREAM
Spark
SQLImpala
SEARCH
Solr
OTHERKite
NoSQLHBase
OTHERObject Store
FILESYSTEMHDFS
RELATIONALKudu
Comprehensive, Compliance-Ready Security
Authentication, Authorization, Audit, and Compliance
AccessDefining what users and applications can
do with data
Technical Concepts:Permissions
Authorization
DataProtecting data in the
cluster from unauthorized visibility
Technical Concepts:Encryption, Tokenization,
Data masking
VisibilityReporting on where data came from and how it’s being used
Technical Concepts:AuditingLineage
Cloudera ManagerApache Sentry & RecordService
Cloudera NavigatorNavigator Encrypt & Key
Trustee | Partners
PerimeterGuarding access to the
cluster itself
Technical Concepts:Authentication
Network isolation
© Cloudera, Inc. All rights reserved.
Perimeter Security – Isolation, Authentication
Preserve user choice of the right Hadoop service (e.g. Impala, Spark)Conform to centrally managed authentication policiesImplement with existing standard systems: Active Directory (LDAP) and Kerberos
Cloudera Manager
PerimeterGuarding access to the
cluster itself
Technical Concepts:Authentication
Network isolation
© Cloudera, Inc. All rights reserved.
© Cloudera, Inc. All rights reserved.
Active Directory and Kerberos
• Manages Users, Groups, and Services• Provides username / password authentication
• Group membership determines Service access
Active Directory
• Trusted and standard third-party• Authenticated users receive “Tickets”
• “Tickets” gain access to Services
Kerberos
User authenticates to AD
Authenticated user
gets Kerberos
Ticket
Ticket grants access to
Services e.g. ImpalaUser
[ssmith]Password[***** ]
Automated Authentication with Cloudera Manager
Direct to AD Kerberos Integration
Kerberos Configuration Wizard
Added Tuning and Monitoring
• Users authenticate directly against AD• Hadoop Services defined directly in AD Kerberos• User access to Hadoop services controlled via AD Groups
• Automates Kerberos configuration for existing Hadoop clusters simplifying a tedious and error prone process
• Tune interrelated configuration for dual KDC’s• Service monitoring through CM when Kerberos enabled
© Cloudera, Inc. All rights reserved.
Access Security Requirements
Provide users access to data needed to do their jobCentrally manage access policies
Leverage a role-based access control model built on AD
AccessDefining what users and applications can
do with data
InfoSec Concept:Authorization
Apache Sentry & RecordService
© Cloudera, Inc. All rights reserved.
© Cloudera, Inc. All rights reserved.
RBAC and Centralized Authorization
Manage data access by role, instead of by individual user
• Customer Support Rep has read access to US Customers
• Broker Analyst has read access to US Transactions
• Relationships between users and roles are established via groups
An RBAC policy is then uniformly enforced for all Hadoop services
• Provides unified authorization controls
• As opposed to tools for managing numerous, service specific policies
© Cloudera, Inc. All rights reserved.
Unified Authorization with Apache Sentry
Sentry provides unified authorization via:
• Fine-grained RBAC for Impala, Hive, and Search
• Impala/Hive permissions synced in HDFS for all other components (Spark, MapReduce, etc)
Goal: Unified authorization for all Hadoop services and applications
Sentry Perm.Read Access
to ALL Transaction
Data
Sentry Role
Fraud Analyst Role
Group
Fraud Analysts
Sam Smith
© Cloudera, Inc. All rights reserved.
© Cloudera, Inc. All rights reserved.
The Need for Fine-Grained Access Control Across all access paths
Columns: Sensitive column visibility varies; Example: credit card numbers
• Managers: 1234 5678 1234 5678
• Call Center: XXXX XXXX XXXX 5678
• Analysts: XXXX XXXX XXXX XXXX
• Others: Does not see credit card column
Rows: Different groups of users need access to different records
• European privacy laws
• Government security clearance
• Financial information restrictions
21© Cloudera, Inc. All rights reserved.
Permission Enforcement today with SentryHive
Server 2
Sen
try
Enfo
rce
men
t
Impala
HDFS: MR, Pig, Spark, ...
Search (Solr)
Sentry Permissions
rules
Rule: “Allow fraud analysts read access to the transaction table”
Admins specify permissions
Sen
try
Enfo
rce
men
t Se
ntr
y En
forc
em
ent
Se
ntr
y En
forc
em
ent
SAS products
Sentry
Service
Coarse grained (table)
© Cloudera, Inc. All rights reserved.
© Cloudera, Inc. All rights reserved.
RecordServiceUnified Access Control Enforcement
• New high performance security layer that centrally enforces access control policies across Hadoop• Complements Apache Sentry’s unified policy
definition
• Row- and column-based security
• Dynamic data masking
• Apache-licensed open source
• Beta now available
STRUCTUREDSqoop
UNSTRUCTUREDKafka, Flume
PROCESS, ANALYZE, SERVE
UNIFIED SERVICES
RESOURCE MANAGEMENTYARN
SECURITYSentry, RecordService
STORE
INTEGRATE
BATCHSpark, Hive, Pig
MapReduce
STREAM
Spark
SQLImpala
SEARCH
Solr
OTHERKite
NoSQLHBase
OTHERObject Store
FILESYSTEMHDFS
RELATIONALKudu
Fine-Grained HDFS Access without RecordService
Date/time
Accnt # SSN Asset Trade Country
09:33:1116-Feb-2015
0234837823
238-23-9876
AAPL Sell US
11:33:01 16-Feb-2015
3947848494
329-44-9847
TBT Buy EU
14:12:34 16-Feb-2015
4848367383
123-56-2345
IBM Sell UK
09:22:0316-Feb-2015
3485739384
585-11-2345
INTC Buy US
11:55:33 16-Feb-2015
3847598390
234-11-8765
F Buy US
10:22:5516-Feb-2015
8765432176
344-22-9876
UA Buy UK
13:45:2416-Feb-2015
3456789012
412-22-8765
AMZN Sell EU
09:03:44 16-Feb-2015
4857389329
123-44-5678
TMV Buy US
Date/time
Accnt # SSN Asset Trade Country
14:12:34 16-Feb-2015
4848367383
123-56-2345
IBM Sell UK
10:22:5516-Feb-2015
8765432176
344-22-9876
UA Buy UK
15:55:55 16-Feb-2015
4756983234
234-76-9274
MA Buy UK
Date/time
Accnt # SSN Asset Trade Country
11:33:01 16-Feb-2015
3947848494
329-44-9847
TBT Buy EU
13:45:2416-Feb-2015
3456789012
412-22-8765
AMZN Sell EU
Date/time
Accnt # SSN Asset Trade Country
09:33:1116-Feb-2015
0234837823
238-23-9876
AAPL Sell US
09:22:0316-Feb-2015
3485739384
585-11-2345
INTC Buy US
11:55:33 16-Feb-2015
3847598390
234-11-8765
F Buy US
09:03:44 16-Feb-2015
4857389329
123-44-5678
TMV Buy US
Split the original fileUse HDFS permissions to limit access
© Cloudera, Inc. All rights reserved.
Fine-Grained HDFS Access Control with RecordService
• Apply controls to the master data file
• Row, column, and sub-column (masking) controls
• Enforce these across all access paths
Date/time
Accnt # SSN Asset Trade Country
09:33:1116-Feb-2015
0234837823
238-23-9876
AAPL Sell US
11:33:01 16-Feb-2015
3947848494
329-44-9847
TBT Buy EU
14:12:34 16-Feb-2015
4848367383
123-56-2345
IBM Sell EU
09:22:0316-Feb-2015
3485739384
585-11-2345
INTC Buy US
11:55:33 16-Feb-2015
3847598390
234-11-8765
F Buy US
10:22:5516-Feb-2015
8765432176
344-22-9876
UA Buy EU
Column-Level Controls
Ro
w-L
eve
l Co
ntr
ols
Date/time
Accnt # SSN Asset Trade Country
09:33:1116-Feb-2015
0234837823
238-23-9876
AAPL Sell US
11:33:01 16-Feb-2015
3947848494
329-44-9847
TBT Buy group2
14:12:34 16-Feb-2015
4848367383
123-56-2345
IBM Sell group3
09:22:0316-Feb-2015
3485739384
585-11-2345
INTC Buy US
11:55:33 16-Feb-2015
3847598390
234-11-8765
F Buy US
10:22:5516-Feb-2015
8765432176
344-22-9876
UA Buy group3
Column-Level Controls
Ro
w-L
eve
l Co
ntr
ols
XXX-XX
XXX-XX
XXX-XX
What U.S. Brokers See
PROBLEM
SOLUTION
Customer data was spread across sources and channels, limiting loyalty marketing
• Existing targeting segments not generating enough return
• Limited ability to analyze multi-structured data
• Need accelerated processing to act on data but existing system running at capacity
Implemented new system to maximize marketing ROI, while meeting compliance
• Improved segmentation with reduced processing time (6hrs to 45min)
• Analyzing 3M records per hour, incl. mobile, sentiment, & non-gaming spend
• EDW optimization equals millions saved• Achieved PCI compliance and met
governance needs