denodo data virtualization (dv) single view of truthdenodo data virtualization (dv) single view of...
TRANSCRIPT
Denodo Data Virtualization (DV)
Single View of Truth
“By 2018, organizations with data virtualization capabilities will spend 40% less on building & managing data integration processes for connecting distributed data assets.”
Source: SPA (Strategic Planning Assumption) Gartner published 2017 predictions research.
2
What is Data Virtualization
• A method of Data Integration that does not physically move
data or create new copies of data
• A method of Data Integration that isolates users from the
format, location, technologies, and protocols for storing and
accessing data
• Real Time Data Access to any data type – structured,
unstructured, semi structured
• Many to 1 approach. Virtually join any number of data
sources and source types into a single view
3
Problem: IT Architecture is Unmanageable
Log files(.txt/.log files)
CRM(MySQL)
Billing System(Web Service - Rest)
Big Data, Cloud(Hadoop, Web)
Inventory System(MS SQL Server)
Product Catalog(Web Service -SOAP)
Customer Voice(Internet, Unstruc)
Product Data(CSV)
ETL
TraditionalIssues
Hi-Data Growth,
IT Complexity, Data Silos, Hi - Latency
New Trends
Real Time,Big Data,
Unstructured Data,
External Data,Move to Cloud
4
Solution: Virtual Data Layer
Log files(.txt/.log files)
CRM(MySQL)
Billing System(Web Service - Rest)
Big Data, Cloud(Hadoop, Web)
Inventory System(MS SQL Server)
Product Catalog(Web Service -SOAP)
Customer Voice(Internet, Unstruc)
Product Data(CSV)
ETL
TraditionalIssues
Hi-Data Growth,
IT Complexity, Data Silos, Hi - Latency
New Trends
Real Time,Big Data,
Unstructured Data,
External Data,Move to Cloud
Data Virtualization
5
Denodo DV: Connectivity to Any Data TypeRelational DB’s: Oracle, DB2, Sybase, MS SQL Server, MySQL, PostgreSQL, Informix, MS Access…
Parallel DB’s & Appliances: Teradata, Netezza, Oracle Exadata, Sybase IQ, Greenplum, ParAccel…
Multidimensional OLAP Engines: SAP BW, MS SQL Server Analysis Services, Mondrian, Essbase…
SOAP / REST Web Services and Data Feeds: XML, RSS, ATOM, JSON, Odata, Delimited Files – CSV, log files, device feeds, ...
Enterprise Applications: SAP R3 / ECC, Oracle E-Business suite,, Siebel, PeopleSoft, SAS...
Content Management Sys (CMS): MS SharePoint, IBM FileNet, Documentum…
Modeling Tools: Erwin, Rochade, ER Studio…
MDM & Mapping: IBM Initiate, ontologies, taxonomies…
Mainframe / Legacy Connectivity: Adabas, IMS, DB2, TN5250 / TN3270.
Plug-in architecture: third party Mainframe / Legacy Adapters...
Semantic repositories in Triple Stores / RDF accessed via SPARQL endpoints
LDAP and Active Directory: as source data & security access
Big Data / NoSQL: Hadoop, Hive, HCatalog, Impala, Scoop, HBase, PIG, HDFS, MapReduce, AVRO, HDFS, Mongo DB, CouchDB, Neo4J, Cassandra, MarkLogic…
Cloud, SaaS: Salesforce, Google, Amazon, LinkedIn, Facebook, Twitter via APIs; Any Website, Form, any Web based Apps…
Enterprise Service Bus: JMS message queues, WebSphere MQ, Sonic, ActiveMQ…
Custom Connector SDK: access any application via API and procedural interfaces.
Semi-Structured Data: Web sites, Forms, applications, PDF, MS Word, MS Excel
Unstructured Data: websites, file systems, Email servers, databases, knowledgebase, indexes (Lucene, MS FAST, HP Autonomy…), RSS Feeds …
6
Denodo Platform Architecture
6
Da
ta V
irtu
aliza
tio
n Design Tools
Optimizer
Cache
Scheduler
Monitoring
Governance
Metadata
Security
Publish Real-time (Right-time) Data Services
Combine Transform, Improve Quality, Integrate
Connect Normalized Views of Disparate Data
Denodo Platform
Library of Wrappers Web Automation Any Data or Content Read and Write
Business SolutionsAccess Information-as-a-Service
Denodo PlatformRight Information at the Right Time
Disparate DataAny SourceAny Format
Denodo Platform
Publish Real-time (Right-time) Data Services
Combine Transform, Improve Quality, Integrate
Connect Normalized Views of Disparate Data Da
taV
irtu
aliza
tio
n Design Tools
Optimizer
Cache
Scheduler
Monitoring
Governance
Metadata
Security
7
Common Data Virtualization Use CasesData Virtualization
BIG DATA, CLOUD INTEGRATION
Advanced Analytics
Data Warehouse Offloading
Big Data for Enterprise
Cloud / SaaS Integration
AGILE BUSINESS INTELLIGENCE
Logical Data Warehouse
Virtual Data Marts
Self-Service BI
Operational BI / Analytics
SINGLE VIEW APPLICATIONS
Single Customer View - Call Centers, Portals
Single Product View - Catalogs
Single Inventory View - Inventory Reconciliation
Vertical Specific - Single View of Wells
DATA SERVICES
Unified Data Services Layer
Logical Data Abstraction
Agile Application Development
Linked Data Services
Analytical Operational
BusinessUse Cases
IT Use Cases
8
What are analysts saying?
“By 2018, organizations with data virtualization capabilities will spend 40% less on building & managing data integration processes for connecting distributed data assets.”
Source: Gartner Research, Predicts 2017: Data Distribution and Complexity Drive Information Infrastructure Modernization.
“Through 2020, 35% of enterprises will implement some form of data virtualization as one enterprise production option for data integration.”
Source: Gartner Research, Market Guide for Data Virtualization, 2016.
Performance & Security
10
Performance
Architecture designed for both Informational & Operational scenarios
Focused on 3 Core Concepts
1. Dynamic Multi-Source Query Execution Plans
Leverages processing power & architecture of data sources
Dynamic to support ad hoc queries
2. Selective Materialization
Intelligent Caching of only the most relevant and often used information
3. Optimized Resource Management
Smart allocation of resources to handle high concurrency
Throttling to control and mitigate source impact
Core Concepts
11
Performance
Ad Hoc querying requires an architecture that generate efficient plans in execution time.
Denodo borrows many techniques from traditional RDBMs such as:
Cost based execution plans
Based on statistics, indexes, transfer rates, etc.
Multiple JOIN strategies
Merge, Hash, Nested, Parallel Nested, Sorted-Merge
Query rewriting
Redundant filter detection, unnecessary JOIN pruning, etc.
But since data is stored in multiple heterogeneous sources, DV has to apply other techniques to minimize network traffic and minimize processing in the virtual layer:
Maximize query push down – Process at the source
Query rewriting to maximize delegation to sources
Data transformations push-up to maximize delegation
On-the-fly data movement (shipping)
Abstract source capabilities
Emulate in the virtual layer the operations that cannot be push down (e.g., a GROUP BY on a flat file)
Optimization techniques
Proven Performance in IBM Labs
Queries to single source
■ Denodo only adds 3-5% overhead
Source: Denodo testing in IBM labs – TPCDS Benchmark and DataShip Performance Tests
Join across multiple sources
■ Denodo optimization engine faster than in-house solution
When ‘Data Lakes’ become “Data Swamps”Uncontrolled dumping of Data in Hadoop leads to poor perf.
Denodo DV Query across Impalas and Exadata Vs.
MDM and Large data sets in Hadoop - Impala
ETL all data into Impala and run full query there
MDM data in Exadata (Oracle)
Large Data sets in Hadoop - Impala
Big Data Queries Run Faster using DV because: • DV automatically collects Statistics & Source capabilities, then• Rewrites optimized queries and pushes processing down to the sources• Thus, heavy processing is performed in the systems designed to do so:
• Impala Hadoop performs heavy aggregations on top of very large data sets• Oracle Exadata is faster than Impala to process dimensional queries
Big Data Queries Faster using DV
Impala
Hadoop-only
Runtime (s)
Denodo
Runtime (s)
Denodo
Runtime w/
Cache (s)
Data Volumes
Query 1199 120 68
Queries 1,2,3,5
•Exadata Row Count: ~5M
•Impala Row Count: ~500k
Query 4
•Exadata Row Count: ~5M
•Impala Row Count: ~2M
Query 2187 96 88
Query 3120 212 115
Query 4 timeout328 69
Query 546 91 56
Performance comparison of 5 different queries :
• DV delivers better performance & Saves replicating data into Hadoop
• DV leverages Data Source Architectures for what they are good at.
15
Performance
Denodo has done extensive testing using queries from the standard benchmarking test
TPC-DS* and the following scenario
Compares the performance of a federated approach in Denodo with an MPP system where
all the data has been replicated via ETL
Benchmarks: Logical Data Warehouse
Customer Dim.2 M rows
Sales Facts290 M rows
Items Dim.400 K rows
* TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions including, but not limited to, Big Data systems.
vs.Sales Facts290 M rows
Items Dim.400 K rows
Customer Dim.2 M rows
16
Performance
Query DescriptionReturned
RowsNetezza Time
Denodo Time (Federated Oracle,
Netezza & SQL Server)
Denodo Optimization Technique (automatically
selected)
Total sales by customer 2 M 20.9 sec. 21.4 sec. Full aggregation push-down
Total sales by customer and year between 2000 and 2004
5.5 M 52.3 sec. 59.0 sec Full aggregation push-down
Total sales by item brand 31 K 4.7 sec. 5.0 sec. Partial aggregation push-down
Total sales by item where sale price less than current
list price17 K 3.5 sec. 5.2 sec On the fly data movement
Benchmarks: Logical Data Warehouse - Results
17
Example: Total sales by Customer
Problem
Join cannot be pushed down
Group By is not pushed down
All sales rows sent to Integration Layer
Un-optimized Result
All Rows transferred: 300M + 2M
Slow execution and Netezza is underutilized
Typical Reporting Tool Process (No Query Rewriting)
Join
Group By
300 M 2 M
Sales Customer
SELECT c.id, SUM(s.amount) as total
FROM customer c join sales s
ON c.id = s.customer_id
GROUP BY c.id
ReportingTool
1K
18
Example: Total sales by Customer
Denodo Benefit
Group By automatically moved below JOIN without affecting the results (PK-FK join)
Group By pushed down to Netezza
Optimized Result
Rows transferred: 2M + 2M
Leverage star-schema features:
Size of Group By output determined by cardinality of dimensions (small)
Star-schema joins allow Group By push-down
After Denodo’s Rewriting – Full Aggregation Pushdown
SELECT c.id, amount
FROM
(SELECT s.customer_id, SUM(amount) amount
FROM sales s
GROUP BY s.customer_id) s_agg
JOIN Customer c
ON (c.id = s_agg.customer_id)
Join
Group By
2M 2M
Sales
Customer
Reporting Tool
Denodo
1K
19
Caching
Sometimes, real time access & federation not a good fit:
Sources are slow (e.g.. text files, cloud apps. like Salesforce.com)
A lot of data processing needed (e.g.. complex combinations,
transformations, matching, cleansing, etc.)
Limited access or have to mitigate impact on the sources
For these scenarios, Denodo can replicate just the relevant data in
the cache
Real time vs. caching
20
Security in DenodoOverview
Authentication• Pass-through authentication• Kerberos and Windows SSO• OAuth, SPNEGO
Authentication• Standard JDBC/ODBC security• Kerberos and Windows SSO • Web Service security
LDAPActive Directory
Role based AuthenticationGuest, employee, corporate
Schema-wide Permissions
Data Specific Permissions(Row, Column level, Masking)
Policy Based Security
Data in motion• SSL/TLS
Data in motion• SSL/TLS
Encrypted data at rest• Cache• Swap
21
Enterprise Governance
Data Lineage
Find source of ‘truth’ – top down – shows where data comes from and/or how it is derived
Source Refresh
Detect changes in underlying data sources and propagate to the affected data services
Impact Analysis
Analyze impact of metadata changes in workflows where the modified view is used
Catalog Search
Have a complete understanding of each of the views and data services created in Denodo
22
Enterprise Governance
Data lineage is available from the Admin Tool and from the web-based Information
Self-Service
Data lineage example
23
Management - Monitoring and Diagnosing Tool
■ See current sessions, queries, connections,
cache load processes…
■ See resources usage in each server (CPU,
memory, connections,…)
■ Inspect data sources and cache statistics
(connection pools, response times, active
requests…)
■ Go “back in time” to the moment where a
problem happened
Graphical Monitoring of Servers and Clusters; Graphical Problem
Diagnosing
■ Graphically inspect and browse all the
information provided by the Denodo Monitor
and server logs:
■ Graphical Analysis of incidents
■ Active requests and sessions
■ Resources Usage
■ Data source statistics
Denodo Platform Architecture
2424
Thanks!
www.denodo.com [email protected]
© Copyright Denodo Technologies. All rights reservedUnless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.