performance considerations in logical data warehouse
TRANSCRIPT
Performance Considerations in the Logical Data Warehouse
Mark Pritchard
Sales Engineering Director UK, Denodo
Challenging the MythsLogical Data Warehouse Performance
3
What is a Logical Data Warehouse?
A logical data warehouse is a data system that follows the ideas
of traditional EDW (star or snowflake schemas) and includes, in
addition to one (or more) core DWs, data from external sources.
The main objectives are to improve decision making and/or cost
reduction.
4
C. Assumption, Acme Corp
Data Virtualization solutions will be much slower than a
persisted approach via ETL
1. There is a large amount of data moved through the network for each query
2. Network transfer is slow
…but is this really true?
5
Challenging the Myths of Virtual PerformanceNot as much data is moved as you may think!
▪ Complex queries can be solved transferring moderate data volumes when the
right techniques are applied
▪ Operational queries
▪ Predicate delegation produces small result sets
▪ Logical Data Warehouse and Big Data
▪ Denodo uses characteristics of underlying star schemas to apply query rewriting rules
that maximize delegation to specialized sources (especially heavy GROUP BY) and
minimize data movement
▪ Current networks are almost as fast as reading from disk
▪ 10GB and 100GB Ethernet are a commodity
6
Performance ComparisonLogical Data Warehouse vs. Physical Data Warehouse
Denodo has done extensive testing using queries from the standard benchmarking test TPC-DS* and the
following scenario.
• Compares the performance of a federated approach in Denodo with an MPP system where all the
data has been replicated via ETL.
Customer Dim.2 M rows
Sales Facts290 M rows
Items Dim.400 K rows
* TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions including, but not limited to, Big Data systems.
vs.Sales Facts290 M rows
Items Dim.400 K rows
Customer Dim.2 M rows
7
Performance ComparisonLogical Data Warehouse vs. Physical Data Warehouse
Query Description Returned Rows Time NetezzaTime Denodo (Federated Oracle,
Netezza & SQL Server)Optimization Technique (automatically
selected)
Total sales by customer 1.99 M 20.9 sec. 21.4 sec. Full aggregation push-down
Total sales by customer and year between 2000 and 2004
5.51 M 52.3 sec. 59.0 sec Full aggregation push-down
Total sales by item brand 31.35 K 4.7 sec. 5.0 sec. Partial aggregation push-down
Total sales by item where sale price less than current list price
17.05 K 3.5 sec. 5.2 sec On the fly data movement
Performance OptimizationLogical Data Warehouse Performance
9
Performance and Optimizations in DenodoComparing optimizations in DV vs ETL
Although Data Virtualization is a data integration platform, architecturally speaking it is more similar to a RDBMs
Uses relational logicMetadata is equivalent to that of a databaseEnables ad hoc querying
Key difference between ETL engines and DV:ETL engines are optimized for static bulk movements
Fixed data flowsData virtualization is optimized for queries
Dynamic execution plan per query
Denodo performance architecture resembles that of a RDBMS
10
Performance and Optimizations in DenodoFocused on 3 core concepts
Dynamic Multi-Source Query Execution Plans
Leverages processing power & architecture of data sources
Dynamic to support ad hoc queries
Uses statistics for cost-based query plans
Selective Materialization
Intelligent Caching of only the most relevant and often used information
Optimized Resource Management
Smart allocation of resources to handle high concurrency
Throttling to control and mitigate source impact
Resource plans based on rules
Query OptimizerPerformance Optimization
12
Step by Step
Metadata Query Tree
• Maps query entities (tables, fields) to actual metadata
• Retrieves execution capabilities and restrictions for views involved in the query
Static Optimizer
• SQL rewriting rules (removal of redundant filters, tree pruning, join reordering, transformation push-up, star-schema rewritings, etc.)
• Query delegation
• Data movement query plans
Cost Based Optimizer
• Picks optimal JOIN methods and orders based on data distribution statistics, indexes, transfer rates, etc.
Physical Execution Plan
• Creates the calls to the underlying systems in their corresponding protocols and dialects (SQL, MDX, WS calls, etc.)
How the Dynamic Query Optimizer Works
13
How the Dynamic Query Optimizer WorksKey Optimizations for Logical Data Warehouse Scenarios
Automatic JOIN reordering
▪ Groups branches that go to the same source to maximize query delegation and reduce processing in the DV layer
▪ End users don’t need to worry about the optimal “pairing” of the tables
The Partial Aggregation push-down optimization is key in LDW scenarios. Based on PK-FK restrictions,
pushes the aggregation (for the PKs) to the DW
▪ Leverages the processing power of the DW, optimized for these aggregations
▪ Reduces significantly the data transferred through the network (from 1B to 10K)
The Cost-based Optimizer picks the right JOIN strategies based on estimations on data volumes,
existence of indexes, transfer rates, etc.
▪ Denodo estimates costs in a different way for parallel databases (Vertica, Netezza, Teradata) than for regular
databases to take into consideration the different way those systems operate (distributed data, parallel processing,
different aggregation techniques, etc.)
14
How the Dynamic Query Optimizer WorksOther relevant optimization techniques for LDW and Big Data
Automatic Data Movement
▪ Creation of temp tables in one of the systems to enable complete delegation
▪ Only considered as an option if the target source has the “data movement” option enabled
▪ Use of native bulk load APIs for better performance
Execution Alternatives
▪ If a view exists in more than one system, Denodo can decide in execution time which one to use
▪ The goal is to maximize query delegation depending on the other tables involved in the query
15
How the Dynamic Query Optimizer WorksOther relevant optimization techniques for LDW and Big Data
Optimizations for Virtual Partitioning
Eliminates unnecessary queries and processing based on a pre-execution analysis of
the views and the queries
▪ Pruning of unnecessary JOIN branches
▪ Pruning of unnecessary UNION branches
▪ Push down of JOIN under UNION views
▪ Automatic Data movement for partition scenarios
CachingPerformance Optimization
16
17
CachingReal time vs. caching
Sometimes, real time access & federation not a good fit:
▪ Sources are slow (ex. text files, cloud apps. like Salesforce.com)
▪ A lot of data processing needed (ex. complex combinations, transformations, matching,
cleansing, etc.)
▪ Limited access or have to mitigate impact on the sources
For these scenarios, Denodo can replicate just the relevant data in the cache
18
CachingOverview
Denodo’s cache system is based on an external relational database
▪ Traditional (Oracle, SQLServer, DB2, MySQL, etc.)
▪ MPP (Teradata, Netezza, Vertica, Redshift, etc.)
▪ In-memory storage (Oracle TimesTen, SAP HANA)
Works at the view level
▪ Allows hybrid access (real-time / cached) of an execution tree
Cache Control (population / maintenance)
▪ Manually – user initiated at any time
▪ Time based - using the TTL or the Denodo Scheduler
▪ Event based - e.g. using JMS messages triggered in the DB
Thank you!
© Copyright Denodo Technologies. All rights reservedUnless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.
#DenodoDataFest