red hat - presentation at hortonworks booth - strata 2014
DESCRIPTION
As the Enterprise’s big data program matures and Apache Hadoop becomes more deeply embedded in critical operations, the ability to support and operate it efficiently and reliably becomes increasingly important. To aid enterprise in operating modern data architecture at scale, Red Hat and Hortonworks have collaborated to integrate Hortonworks Data Platform with Red Hat’s proven platform technologies. Join us in this interactive series, as we’ll demonstrate how Red Hat JBoss Data Virtualization can integrate with Hadoop through Hive and provide users easy access to data.TRANSCRIPT
1 RED HAT JBOSS MIDDLEWARE
Discover Red Hat and Hortonworks for the Modern Data Architecture Kimberly Palko
Product Manager
Red Hat
2 RED HAT JBOSS MIDDLEWARE
Agenda
● Red Hat and JBoss Middleware Overview ● Combining data in Hadoop with traditional data
sources ● Federating two geographically distributed
Hadoop clusters ● Virtual data marts for Hadoop Lake
3 RED HAT JBOSS MIDDLEWARE
RED HAT & JBOSS MIDDLEWARE OVERVIEW
4 RED HAT JBOSS MIDDLEWARE
Engineering CollaboraFon Benefits Integra<on with JBoss Data Virtualiza<on
Enable agile Big Data Hadoop integra<on with exis<ng enterprise assets and maximize universal data u<liza<on to enable self-‐service analy<cs
Integra<on with mul<ple Red Hat JBoss Middleware product family
Enables millions of JBoss developers to quickly build applica<ons with Hadoop
Integra<on with Red Hat Storage Enables Hadoop to use Red Hat Storage secure resilient storage pool for data applica<ons
Integra<on with Red Hat Enterprise Linux OpenStack PlaOorm
Simplifies automated deployment of Hadoop on OpenStack
Integrated with Red Hat Enterprise Linux and OpenJDK
Develop and deploy Apache Hadoop as an integrated component for mul<ple deployment scenarios
5 RED HAT JBOSS MIDDLEWARE
Big Data Integra<on: Turn Data into Ac<onable Informa<on
Hadoop & NoSQL
Data Integra<on & Data Services JBoss Data Virtualiza<on
In-‐memory data management JBoss Data Grid
BI Analy<cs (diagnos<c, descrip<ve, predic<ve, prescrip<ve)
Speed of Itera<on leads to Success
SOA Applica<ons
Event Processing & Messaging JBoss BRMS & JBoss A-‐MQ
Structured Data DW, OLAP, OLTP
Red Hat Enterprise Linux Red Hat Storage
Semi / Unstructured Data SOCIAL, LOGS
Streaming Data EVENTS, IOT
Analyze
Integrate
Enrich
Ingest
6 RED HAT JBOSS MIDDLEWARE
Data Challenges Geang Bigger…
NoSQL
Hive
MapReduce
HDFS
Storm
HBase Spark
7 RED HAT JBOSS MIDDLEWARE
Make Big Data Accessible for Everyone
8 RED HAT JBOSS MIDDLEWARE
Data Supply and Integration Solution Data Virtualiza<on sits in front of mul<ple data sources and ! allows them to be treated a single source ! delivering the desired data
! in the required form
! at the right <me
! to any applica<on and/or user. THINK VIRTUAL MACHINE FOR DATA
9 RED HAT JBOSS MIDDLEWARE
Easy Access to Big Data
● Repor<ng tool accesses the data virtualiza<on server via rich SQL dialect
● The data virtualiza<on server translates rich SQL dialect to HiveQL
● Hive translates HiveQL to MapReduce
● MapReduce runs MR job on big data
MapReduce
HDFS
Hive
Analytical Reporting
Tool
Data Virtualization
Server
Hadoop
Big Data
10 RED HAT JBOSS MIDDLEWARE
Different Users Different Views of Big Data
● Logical tables with different forms of aggrega<on
● Logical tables containing extra derived data
● Logical tables with filtered data
● All reports/users share the same specifica<ons MapReduce
HDFS
Hive
11 RED HAT JBOSS MIDDLEWARE
USE CASE 1: COMBINING DATA FROM HADOOP WITH TRADITIONAL SOURCES -‐ USING JBOSS DATA VIRTUALIZATION
12 RED HAT JBOSS MIDDLEWARE
Integra<on of Big Data with “Small Data”
• Integra<ng small data with big data is easy
• Integra<on specifica<ons can be shared or be developed for individual reports
MapReduce
HDFS
Hive Applica<on Database Server
13 RED HAT JBOSS MIDDLEWARE
Caching the Big Data
• Caches to speed up interac<ve repor<ng
• Caches to create a consistent view of big data
• Different caches for different reports
MapReduce
HDFS
Hive
14 RED HAT JBOSS MIDDLEWARE
USE CASE 2: GEOGRAPHICALLY DISTRIBUTED HADOOP CLUSTERS WITH DATA VIRTUALIZATION - SECURING DATA BY USER ROLE
15 RED HAT JBOSS MIDDLEWARE
Role based access control
Roles • Define roles based on organiza<on hierarchy
Users • External authen<ca<on via Kerberos, LDAP, etc.
VDB • Assign users and groups to a virtual data base
16 RED HAT JBOSS MIDDLEWARE
Authentication
Kerberos From client to the virtual data base
Login Modules LDAP (MS Ac<ve Directory, OpenLDAP, etc.), any JAAS based security domain
REST and Web Services WS-‐UsernameToken HTTP Basic authen<ca<on
SAML SAML authen<ca<on for web client applica<ons
17 RED HAT JBOSS MIDDLEWARE
Audit Logging via Dashboard
18 RED HAT JBOSS MIDDLEWARE
Row and Column Masking
-‐ Row based masking Ex: keyed off geographic marker -‐ Column masking to a constant, null, or a SQL statement Example: change all but the Last 4 digits in a credit card number to stars concat('****', substring(column, length(column)-‐4))
19 RED HAT JBOSS MIDDLEWARE
Summary of Security Capabili<es ● Authentication
– Kerberos, LDAP, WS-UsernameToken, HTTP Basic, SAML
● Authorization – Virtual data views, Role based access control
● Administration – Centralized management of VDB privileges
● Audit – Centralized audit logging and dashboard
● Protection – Row and column masking – SSL encryption (ODBC and JDBC)
20 RED HAT JBOSS MIDDLEWARE
Demonstration Geographically Distributed Hadoop Clusters with Data Virtualization - Securing
Data by User Role
21 RED HAT JBOSS MIDDLEWARE
Use Case 2: Federating across Geographically Distributed Hadoop Clusters Problem:
Geographically distributed Hadoop clusters contains sensitive data like patient records or customer identification that cannot be accessed by other regions due to regulatory policy. IT needs access to all data, but users can only access the data in their region.
Solution: Leverage JBoss Data Virtualization to
provide Row Level Security and Masking of columns while federating across Hadoop clusters.
Consume Compose Connect
Data can be accessed by mulFple tools and methods already in-‐house
JBoss Data Virtualiza<on
Hive
Hadoop cluster in one geographic
region
Hive
Hadoop cluster in a second geographic
region
22 RED HAT JBOSS MIDDLEWARE
Use Case 2 - Architecture
DATA
SYSTEM
APPLICAT
IONS
Business AnalyFcs
Custom ApplicaFons
Packaged ApplicaFons
VIRTUAL DATA MART
23 RED HAT JBOSS MIDDLEWARE
Use Case 2 - Resources
• GUIDE How to guide: https://github.com/DataVirtualizationByExample/HortonworksUseCase2 Tutorial: Available soon • VIDEOS: hpp://vimeo.com/user16928011/hortonworksusecase2short hpp://vimeo.com/user16928011/hortonworksusecase2short • SOURCE: hpps://github.com/DataVirtualiza<onByExample/HortonworksUseCase2
24 RED HAT JBOSS MIDDLEWARE
USE CASE 3: VIRTUAL DATA MARTS FOR HADOOP DATA LAKE - WITH JBOSS DATA VIRTUALIZATION
25 RED HAT JBOSS MIDDLEWARE
Data for entire organization in Hadoop Data Lake
Problem: How does IT control access and give business users just the data they need? -‐ Does every line of business have access to everyone’s data? -‐ How do business users get access to the data they need in a simple (even self-‐service) way?
Marke<ng Clickstream Data Finance
Expense Reports
HR Employee Files Server
Logs
Sales Transac<ons
Customer Accounts Twiper
Sen<ment Data
Hadoop Data Lake
26 RED HAT JBOSS MIDDLEWARE
Marke<ng Clickstream Data
Marke<ng IT Finance
Customer Accounts Twiper
Sen<ment Data
Sales
Server Logs
Sales Transac<ons HR Employee Files Finance Expense Reports
Secure, Self-Service Virtual Data Marts for Hadoop
SoluFon: Use JBoss Data VirtualizaFon to create virtual data marts on top of a Hadoop cluster -‐ Lines of Business get access to the data they need in a simple manner -‐ IT maintains the process and control it needs -‐ All data remains in the data lake, nothing is copied or moved
Hadoop Data Lake
27 RED HAT JBOSS MIDDLEWARE
Optional hierarchical data architectures with virtual data mart Can be combined with security features like user role access and row and column masking
Dept Base Virtual Database (VDB)
Team 1 VDB
Team2 VDB
View2 View1
28 RED HAT JBOSS MIDDLEWARE
Virtual Data Marts for Operational Data
Problem: All the legacy and archived data is in the Hadoop data lake. We want to access the most recent, up to the minute, operaFonal data o\en and quickly.
Marke<ng Clickstream Data
Finance Expense Reports
HR Employee Files
Server Logs
Sales Transac<ons
Customer Accounts
Twiper Sen<ment Data
Hadoop Data Lake Historical Data
29 RED HAT JBOSS MIDDLEWARE
Caching For Faster Performance – Materialized View
Cached or Materialized View 1
View 1
Query 2 Query 1
Virtual Database (VDB)
• Same cached view for mul<ple queries
• Refreshed automa<cally or manually
• Cache repository can be any supported data source
30 RED HAT JBOSS MIDDLEWARE
Virtual operational data store
SoluFon: Use JBoss Data VirtualizaFon to integrate up to the minute data from mulFple diverse data sources that can be quickly queried. -‐ Use HDP for older data -‐ -‐ Use JDV to materialize the data in HDP for -‐ faster access and to combine with operaFonal VDB -‐
Marke<ng Clickstream Data Finance
Expense Reports
HR Employee Files
Server Logs
Sales Transac<ons
Customer Accounts
Twiper Sen<ment Data
Hadoop Data Lake Historical Data Opera<onal
VDB with up to the
minute data
Periodic Transfer from Data Sources
Materialized View
31 RED HAT JBOSS MIDDLEWARE
Demonstration Virtual Data Marts
with Hadoop Data Lake
32 RED HAT JBOSS MIDDLEWARE
Use Case 3 - Overview
xxx ObjecFve: –Purpose oriented data views for func<onal teams over a rich variety of semi-‐structured and structured data Problem: –Data Lakes have large volumes of consolidated clickstream data, product and customer data that need to be constrained for mul<-‐departmental use. SoluFon: –Leverage HDP to mashup Clickstream analysis data with product and customer data on HDP to answer -‐ Leverage Jboss Data Virt to provide Virtual data marts for Marke<ng and Product teams
33 RED HAT JBOSS MIDDLEWARE
Use Case 3 - Architecture
APPLICAT
IONS
Business AnalyFcs
Custom ApplicaFons
Packaged ApplicaFons
DATA
SYSTEM
SOURC
ES
Emerging Sources (Sensor, SenFment, Geo,
Unstructured)
ExisFng Sources (CRM, ERP, Clickstream,
Logs)
HDP 2.1
Gov
erna
nce
&
Inte
grat
ion
Secu
rity
Ope
ratio
ns
Data Access
Data Management VIRTUAL DATA MART
34 RED HAT JBOSS MIDDLEWARE
Use Case 3 - Resources • GUIDE How to guide: https://github.com/DataVirtualizationByExample/HortonworksUseCase3 Tutorial: Available soon • VIDEOS: http://vimeo.com/user16928011/hwxuc3configuration http://vimeo.com/user16928011/hwxuc3run http://vimeo.com/user16928011/hwxuc3overview • SOURCE: https://github.com/DataVirtualizationByExample/HortonworksUseCase3
35 RED HAT JBOSS MIDDLEWARE
Demonstration Combining Sentiment Data
with Sales Data
36 RED HAT JBOSS MIDDLEWARE
Use Case 1: Combine data from Hadoop with traditional data sources
Problem: Data from new data sources like
social media, clickstream and sensors needs to be combined with data from traditional sources to get the full value.
Solution: Leverage JBoss Data Virtualization
to mashup new data in Hadoop with data in traditional data sources without moving or copying any data and access it through a variety of BI tools and SOA technologies.
Consume Compose Connect
Data can be accessed by mulFple tools and methods
already in-‐house
JBoss Data Virtualiza<on
Hive
SOURCE 1: Hive/Hadoop contains data from new data sources like social media, clickstream and sensor data
SOURCE 2: TradiFonal relaFonal databases in the
enterprise
37 RED HAT JBOSS MIDDLEWARE
Use Case 1 - Architecture
DATA
SYSTEM
TRADITIONAL REPOSITORIES
RDBMS EDW MPP
APPLICAT
IONS
Business AnalyFcs
Custom ApplicaFons
Packaged ApplicaFons
VIRTUAL DATA MART
38 RED HAT JBOSS MIDDLEWARE
Use Case 1 – Demo
39 RED HAT JBOSS MIDDLEWARE
Use Case 1 - Resources
http://hortonworks.com/hadoop-tutorial/evolving-data-stratagic-asset-using-hdp-red-hat-jboss-data-virtualization/
40 RED HAT JBOSS MIDDLEWARE
Benefits of Data Virtualiza<on on Big Data
● Enterprise democra<za<on of big data
● Any repor<ng or analy<cal tool can be used
● Easy access to big data
● Seamless integra<on of big data and small data
● Sharing of integra<on specifica<ons
● Collabora<ve development on big data
● Fine-‐grained security of big data
● Speedy delivery of reports on big data
41 RED HAT JBOSS MIDDLEWARE
QUESTIONS