performance considerations in logical data warehouse

19
Performance Considerations in the Logical Data Warehouse Mark Pritchard Sales Engineering Director UK, Denodo

Upload: denodo

Post on 21-Jan-2018

85 views

Category:

Data & Analytics


0 download

TRANSCRIPT

Page 1: Performance Considerations in Logical Data Warehouse

Performance Considerations in the Logical Data Warehouse

Mark Pritchard

Sales Engineering Director UK, Denodo

Page 2: Performance Considerations in Logical Data Warehouse

Challenging the MythsLogical Data Warehouse Performance

Page 3: Performance Considerations in Logical Data Warehouse

3

What is a Logical Data Warehouse?

A logical data warehouse is a data system that follows the ideas

of traditional EDW (star or snowflake schemas) and includes, in

addition to one (or more) core DWs, data from external sources.

The main objectives are to improve decision making and/or cost

reduction.

Page 4: Performance Considerations in Logical Data Warehouse

4

C. Assumption, Acme Corp

Data Virtualization solutions will be much slower than a

persisted approach via ETL

1. There is a large amount of data moved through the network for each query

2. Network transfer is slow

…but is this really true?

Page 5: Performance Considerations in Logical Data Warehouse

5

Challenging the Myths of Virtual PerformanceNot as much data is moved as you may think!

▪ Complex queries can be solved transferring moderate data volumes when the

right techniques are applied

▪ Operational queries

▪ Predicate delegation produces small result sets

▪ Logical Data Warehouse and Big Data

▪ Denodo uses characteristics of underlying star schemas to apply query rewriting rules

that maximize delegation to specialized sources (especially heavy GROUP BY) and

minimize data movement

▪ Current networks are almost as fast as reading from disk

▪ 10GB and 100GB Ethernet are a commodity

Page 6: Performance Considerations in Logical Data Warehouse

6

Performance ComparisonLogical Data Warehouse vs. Physical Data Warehouse

Denodo has done extensive testing using queries from the standard benchmarking test TPC-DS* and the

following scenario.

• Compares the performance of a federated approach in Denodo with an MPP system where all the

data has been replicated via ETL.

Customer Dim.2 M rows

Sales Facts290 M rows

Items Dim.400 K rows

* TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions including, but not limited to, Big Data systems.

vs.Sales Facts290 M rows

Items Dim.400 K rows

Customer Dim.2 M rows

Page 7: Performance Considerations in Logical Data Warehouse

7

Performance ComparisonLogical Data Warehouse vs. Physical Data Warehouse

Query Description Returned Rows Time NetezzaTime Denodo (Federated Oracle,

Netezza & SQL Server)Optimization Technique (automatically

selected)

Total sales by customer 1.99 M 20.9 sec. 21.4 sec. Full aggregation push-down

Total sales by customer and year between 2000 and 2004

5.51 M 52.3 sec. 59.0 sec Full aggregation push-down

Total sales by item brand 31.35 K 4.7 sec. 5.0 sec. Partial aggregation push-down

Total sales by item where sale price less than current list price

17.05 K 3.5 sec. 5.2 sec On the fly data movement

Page 8: Performance Considerations in Logical Data Warehouse

Performance OptimizationLogical Data Warehouse Performance

Page 9: Performance Considerations in Logical Data Warehouse

9

Performance and Optimizations in DenodoComparing optimizations in DV vs ETL

Although Data Virtualization is a data integration platform, architecturally speaking it is more similar to a RDBMs

Uses relational logicMetadata is equivalent to that of a databaseEnables ad hoc querying

Key difference between ETL engines and DV:ETL engines are optimized for static bulk movements

Fixed data flowsData virtualization is optimized for queries

Dynamic execution plan per query

Denodo performance architecture resembles that of a RDBMS

Page 10: Performance Considerations in Logical Data Warehouse

10

Performance and Optimizations in DenodoFocused on 3 core concepts

Dynamic Multi-Source Query Execution Plans

Leverages processing power & architecture of data sources

Dynamic to support ad hoc queries

Uses statistics for cost-based query plans

Selective Materialization

Intelligent Caching of only the most relevant and often used information

Optimized Resource Management

Smart allocation of resources to handle high concurrency

Throttling to control and mitigate source impact

Resource plans based on rules

Page 11: Performance Considerations in Logical Data Warehouse

Query OptimizerPerformance Optimization

Page 12: Performance Considerations in Logical Data Warehouse

12

Step by Step

Metadata Query Tree

• Maps query entities (tables, fields) to actual metadata

• Retrieves execution capabilities and restrictions for views involved in the query

Static Optimizer

• SQL rewriting rules (removal of redundant filters, tree pruning, join reordering, transformation push-up, star-schema rewritings, etc.)

• Query delegation

• Data movement query plans

Cost Based Optimizer

• Picks optimal JOIN methods and orders based on data distribution statistics, indexes, transfer rates, etc.

Physical Execution Plan

• Creates the calls to the underlying systems in their corresponding protocols and dialects (SQL, MDX, WS calls, etc.)

How the Dynamic Query Optimizer Works

Page 13: Performance Considerations in Logical Data Warehouse

13

How the Dynamic Query Optimizer WorksKey Optimizations for Logical Data Warehouse Scenarios

Automatic JOIN reordering

▪ Groups branches that go to the same source to maximize query delegation and reduce processing in the DV layer

▪ End users don’t need to worry about the optimal “pairing” of the tables

The Partial Aggregation push-down optimization is key in LDW scenarios. Based on PK-FK restrictions,

pushes the aggregation (for the PKs) to the DW

▪ Leverages the processing power of the DW, optimized for these aggregations

▪ Reduces significantly the data transferred through the network (from 1B to 10K)

The Cost-based Optimizer picks the right JOIN strategies based on estimations on data volumes,

existence of indexes, transfer rates, etc.

▪ Denodo estimates costs in a different way for parallel databases (Vertica, Netezza, Teradata) than for regular

databases to take into consideration the different way those systems operate (distributed data, parallel processing,

different aggregation techniques, etc.)

Page 14: Performance Considerations in Logical Data Warehouse

14

How the Dynamic Query Optimizer WorksOther relevant optimization techniques for LDW and Big Data

Automatic Data Movement

▪ Creation of temp tables in one of the systems to enable complete delegation

▪ Only considered as an option if the target source has the “data movement” option enabled

▪ Use of native bulk load APIs for better performance

Execution Alternatives

▪ If a view exists in more than one system, Denodo can decide in execution time which one to use

▪ The goal is to maximize query delegation depending on the other tables involved in the query

Page 15: Performance Considerations in Logical Data Warehouse

15

How the Dynamic Query Optimizer WorksOther relevant optimization techniques for LDW and Big Data

Optimizations for Virtual Partitioning

Eliminates unnecessary queries and processing based on a pre-execution analysis of

the views and the queries

▪ Pruning of unnecessary JOIN branches

▪ Pruning of unnecessary UNION branches

▪ Push down of JOIN under UNION views

▪ Automatic Data movement for partition scenarios

Page 16: Performance Considerations in Logical Data Warehouse

CachingPerformance Optimization

16

Page 17: Performance Considerations in Logical Data Warehouse

17

CachingReal time vs. caching

Sometimes, real time access & federation not a good fit:

▪ Sources are slow (ex. text files, cloud apps. like Salesforce.com)

▪ A lot of data processing needed (ex. complex combinations, transformations, matching,

cleansing, etc.)

▪ Limited access or have to mitigate impact on the sources

For these scenarios, Denodo can replicate just the relevant data in the cache

Page 18: Performance Considerations in Logical Data Warehouse

18

CachingOverview

Denodo’s cache system is based on an external relational database

▪ Traditional (Oracle, SQLServer, DB2, MySQL, etc.)

▪ MPP (Teradata, Netezza, Vertica, Redshift, etc.)

▪ In-memory storage (Oracle TimesTen, SAP HANA)

Works at the view level

▪ Allows hybrid access (real-time / cached) of an execution tree

Cache Control (population / maintenance)

▪ Manually – user initiated at any time

▪ Time based - using the TTL or the Denodo Scheduler

▪ Event based - e.g. using JMS messages triggered in the DB

Page 19: Performance Considerations in Logical Data Warehouse

Thank you!

© Copyright Denodo Technologies. All rights reservedUnless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.

#DenodoDataFest