logical data warehouse and data lakes

New York City

9th June, 2016

Logical Data Warehouse, Data Lakes, and Data Services Marketplaces

Agenda1.Introductions

2.Logical Data Warehouse and Data Lakes

3.Coffee Break

4.Data Services Marketplaces

5.Q&A

3

HEADQUARTERS

Palo Alto, CA.

DENODO OFFICES, CUSTOMERS, PARTNERS

Global presence throughout North America, EMEA, APAC, and Latin America.

CUSTOMERS

250+ customers, including many

F500 and G2000 companies across every major industry have gained significant business agility and ROI.

LEADERSHIP

Longest continuous focus on data

virtualization and data services.

Product leadership.

Solutions expertise.

3

THE LEADER IN DATA VIRTUALIZATION

Denodo provides agile, high performance data

integration and data abstraction across the broadest

range of enterprise, cloud, big data and unstructured

data sources, and real-time data services at half the

cost of traditional approaches.

Speakers

Paul Moxon

Senior Director of Product Management, Denodo

Pablo Álvarez

Principal Technical Account Manager, Denodo

Rubén Fernández

Technical Account Manager, Denodo

Logical Data Warehouse and Data LakesNew York City

June 2016

Agenda1.The Logical Data Warehouse

2.Different Types, Different Needs

3.Performance in a LDW

4.Customer Success Stories

5.Q&A

What is a Logical Data Warehouse?

A logical data warehouse is a data system that follows

the ideas of traditional EDW (star or snowflake schemas)

and includes, in addition to one (or more) core DWs,

data from external sources.

The main motivations are improved decision making

and/or cost reduction

Logical Data Warehouse

Description:

“The Logical Data Warehouse (LDW) is a new data management architecture for analytics combining the strengths of traditional repository warehouses with alternative data management and access strategy. The LDW will form a new best practice by the end of 2015.”

“The LDW is an evolution and augmentation of DW practices, not a replacement”

“A repository-only style DW contains a single ontology/taxonomy, whereas in the LDW a semantic layer can contain many combination of use cases, many business definitions of the same information”

“The LDW permits an IT organization to make a large number of datasets available for analysis via query tools and applications.”

8

Gartner Definition

Gartner Hype Cycle for Enterprise Information Management, 2012


Description:

“The Logical Data Warehouse (LDW) is a new data management architecture for analytics combining the strengths of traditional repository warehouses with alternative data management and access strategy. The LDW will form a new best practice by the end of 2015.”

“The LDW is an evolution and augmentation of DW practices, not a replacement”

“A repository-only style DW contains a single ontology/taxonomy, whereas in the LDW a semantic layer can contain many combination of use cases, many business definitions of the same information”

“The LDW permits an IT organization to make a large number of datasets available for analysis via query tools and applications.”

9

Gartner Definition

Gartner Hype Cycle for Enterprise Information Management, 2012


Description:

A semantic layer on top of the data warehouse that keeps the business data definition.

Allows the integration of multiple data sources including enterprise systems, the data warehouse, additional processing nodes (analytical appliances, Big Data, …), Web, Cloud and unstructured data.

Publishes data to multiple applications and reporting tools.

10

11

Three Integration/Semantic Layer AlternativesGartner’s View of Data Integration

Application/BI Tool as Data Integration/Semantic Layer

EDW as Data Integration/Semantic Layer

Data Virtualization as Data Integration/Semantic Layer

Application/BI Tool Data Virtualization

EDW

EDW

ODS ODS EDW ODS

12

Application/BI Tool as the Data Integration Layer

Application/BI Tool as Data Integration/Semantic Layer

Application/BI Tool

EDW ODS

• Integration is delegated to end user tools

and applications

• e.g. BI Tools with ‘data blending’

• Results in duplication of effort – integration

defined many times in different tools

• Impact of change in data schema?

• End user tools are not intended to be

integration middleware

• Not their primary purpose or expertise

13

EDW as the Data Integration Layer

EDW as Data Integration/Semantic Layer

EDW

ODS

• Access to ‘other’ data (query federation) via EDW

• Teradata QueryGrid, IBM FluidQuery, SAP Smart Data Access, etc.

• Often coupled with traditional ETL replication of data into EDW

• EDW ‘center of data universe’

• Provides data integration and semantic layer

• Appears attractive to organizations heavily invested in EDW

• More than one EDW? EDW costs?

14

Data Virtualization as the Data Integration Layer

Data Virtualization as Data Integration/Semantic Layer

Data Virtualization

EDW ODS

• Move data integration and semantic layer to

independent Data Virtualization platform

• Purpose built for supporting data access

across multiple heterogeneous data sources

• Separate layer provides semantic models for

underlying data

• Physical to logical mapping

• Enforces common and consistent security

and governance policies

• Gartner’s recommended approach


15

EDW Hadoop Cluster

SalesHDFSFiles

Document Collections

NoSQLDatabase

ERP

Database Excel

16

Logical Data WarehouseReference Architecture by Denodo

17

The State and Future of Data Integration. Gartner, 25 may 2016

Physical data movement architectures that aren’t designed to

support the dynamic nature of business change, volatile

requirements and massive data volume are increasingly being

replaced by data virtualization.

Evolving approaches (such as the use of LDW architectures) include

implementations beyond repository-centric techniques

What about the Logical Data Lake?

A Data Lake will not have a star or snowflake schema, but rather a more

heterogeneous collection of views with raw data from heterogeneous

sources

The virtual layer will act as a common umbrella under which these

different sources are presented to the end user as a single system

However, from the virtualization perspective, a Virtual Data Lake shares

many technical aspects with a LDW and most of these contents also

apply to a Logical Data Lake

Different Types, Different NeedsCommon Patterns for a Logical Data Warehouse

20

Common Patterns for a Logical Data Warehouse

1. The Virtual Data Mart

2. DW + MDM

Data Warehouse extended with master data

3. DW + Cloud

Data Warehouse extended with cloud data

4. DW + DW

Integration of multiple Data Warehouse

5. DW historical offloading

DW horizontal partitioning with historical data in cheaper storage

6. Slim DW extension

DW vertical partitioning with rarely used data in cheaper storage

21

Virtual Data Marts

Business friendly models defined on top of one or multiple systems,

often “flavored” for a particular division

Motivation

Hide complexity of star schemas for business users

Simplify model for a particular vertical

Reuse semantic models and security across multiple reporting engines

Typical queries

Simple projections, filters and aggregations on top of curated “fat tables”

that merge data from facts and many dimensions

Simplified semantic models for business users

22

Virtual Data Marts

Time Dimension Fact table(sales)

Product

Retailer Dimension

Sales

EDW Others

Product

Prod. Details

23

DW + MDM

Slim dimensions with extended information maintained in an external

MDM system

Motivation

Keep a single copy of golden records in the MDM that can be reused across

systems and managed in a single place

Typical queries

Join a large fact table (DW) with several MDM dimensions, aggregations on

top

Example

Revenue by customer, projecting the address from the MDM

24

DW + MDM dimensions

Time Dimension Fact table(sales) Product Dimension

Retailer Dimension

EDW MDM

25

DW + Cloud dimensional data

Fresh data from cloud systems (e.g. SFDC) is mixed with the EDW, usually

on the dimensions. DW is sometimes also in the cloud.

Motivation

Take advantage of “fresh” data coming straight from SaaS systems

Avoid local replication of cloud systems

Typical queries

Dimensions are joined with cloud data to filter based on some external attribute

not available (or not current) in the EDW

Example

Report on current revenue on accounts where the potential for an expansion is

higher than 80%

26

DW + Cloud dimensional data


Customer Dimension

CRM

SFDC Customer

EDW

27

Multiple DW integration

Motivation

Merges and acquisitions

Different DWs by department

Transition to new EDW Deployments (migration to Spark, Redshift, etc.)

Typical queries

Joins across fact tables in different DW with aggregations before or after the JOIN

Example

Get customers with a purchases higher than 100 USD that do not have a fidelity

card (purchases and fidelity card data in different DW)

Use of multiple DW as if it was only one

28

Multiple DW integration

Time Dimensi

on

Sales fact

Product Dimension

Region

Finance EDW

City

Marketing EDW

Customer Fidelity factsProduct Dimension

*Real Examples: Nationwide POC, IBM tests

Store

29

DW Historical Partitioning

Only the most current data (e.g. last year) is in the EDW. Historical data is offloaded to a Hadoop cluster

Motivations

Reduce storage cost

Transparently use the two datasets as if they were all together

Typical queries

Facts are defined as a partitioned UNION based on date

Queries join the “virtual fact” with dimensions and aggregate on top

Example

Queries on current date only need to go to the DW, but longer timespans need to merge with Hadoop

Horizontal partitioning

30

DW Historical offloading Horizontal partitioning


Retailer Dimension

Current Sales Historical Sales

EDW

31

Slim DW extension

Minimal DW, with more complete raw data in a Hadoop cluster

Motivation

Reduce cost

Transparently use the two datasets as if they were all together

Typical queries

Tables are defined virtually as 1-to-1 joins between the two systems

Queries join the facts with dimensions and aggregate on top

Example

Common queries only need to go to the DW, but some queries need attributes or measures from Hadoop

Vertical partitioning

32

Slim DW extensionVertical partitioning


Retailer Dimension

Slim Sales Extended Sales

EDW

Performance in a LDW

34

It is a common assumption that a virtualized solution will

be much slower than a persisted approach via ETL:

1. There is a large amount of data moved through the

network for each query

2. Network transfer is slow

But is this really true?

35

Debunking the myths of virtual performance

1. Complex queries can be solved transferring moderate data volumes when

the right techniques are applied

Operational queries

Predicate delegation produces small result sets

Logical Data Warehouse and Big Data

Denodo uses characteristics of underlying star schemas to apply

query rewriting rules that maximize delegation to specialized sources

(especially heavy GROUP BY) and minimize data movement

2. Current networks are almost as fast as reading from disk

10GB and 100GB Ethernet are a commodity

36

Denodo has done extensive testing using queries from the standard benchmarking test

TPC-DS* and the following scenario

Compares the performance of a federated approach in Denodo with an MPP system where

all the data has been replicated via ETL

Customer Dim.2 M rows

Sales Facts290 M rows

Items Dim.400 K rows

* TPC-DS is the de-facto industry standard benchmark for measuring the performance of decision support solutions including, but not limited to, Big Data systems.

vs.Sales Facts290 M rows

Items Dim.400 K rows

Customer Dim.2 M rows

Performance ComparisonLogical Data Warehouse vs. Physical Data Warehouse

http://www.tpc.org/tpcds/

37

Performance Comparison

Query DescriptionReturned

RowsTime Netezza

Time Denodo (Federated Oracle,

Netezza & SQL Server)

Optimization Technique (automatically selected)

Total sales by customer 1,99 M 20.9 sec. 21.4 sec. Full aggregation push-down

Total sales by customer and year between 2000 and 2004

5,51 M 52.3 sec. 59.0 sec Full aggregation push-down

Total sales by item brand 31,35 K 4.7 sec. 5.0 sec. Partial aggregation push-down

Total sales by item where sale price less than current

list price17,05 K 3.5 sec. 5.2 sec On the fly data movement

Logical Data Warehouse vs. Physical Data Warehouse

38

Performance and optimizations in DenodoFocused on 3 core concepts

Dynamic Multi-Source Query Execution Plans

Leverages processing power & architecture of data sources

Dynamic to support ad hoc queries

Uses statistics for cost-based query plans

Selective Materialization

Intelligent Caching of only the most relevant and often used information

Optimized Resource Management

Smart allocation of resources to handle high concurrency

Throttling to control and mitigate source impact

Resource plans based on rules

39

Performance and optimizations in DenodoComparing optimizations in DV vs ETL

Although Data Virtualization is a data integration platform, architecturally speaking it is more similar to a RDBMs

Uses relational logicMetadata is equivalent to that of a databaseEnables ad hoc querying

Key difference between ETL engines and DV:ETL engines are optimized for static bulk movements

Fixed data flowsData virtualization is optimized for queries

Dynamic execution plan per query

Therefore, the performance architecture presented here resembles that of a RDBMS

Query Optimizer

41

Step by Step

Metadata Query Tree

• Maps query entities (tables, fields) to actual metadata

• Retrieves execution capabilities and restrictions for views involved in the query

Static Optimizer

• Query delegation

• SQL rewriting rules (removal of redundant filters, tree pruning, join reordering, transformation push-up, star-schema rewritings, etc.)

• Data movement query plans

Cost Based Optimizer

• Picks optimal JOIN methods and orders based on data distribution statistics, indexes, transfer rates, etc.

Physical Execution Plan

• Creates the calls to the underlying systems in their corresponding protocols and dialects (SQL, MDX, WS calls, etc.)

How Dynamic Query Optimizer Works


42

Example: Total sales by retailer and product during the last month for the brand ACME


Retailer Dimension

EDW MDM

SELECT retailer.name,

product.name,

SUM(sales.amount)

FROM

sales JOIN retailer ON

sales.retailer_fk = retailer.id

JOIN product ON sales.product_fk =

product.id

JOIN time ON sales.time_fk = time.id

WHERE time.date < ADDMONTH(NOW(),-1)

AND product.brand = ‘ACME’

GROUP BY product.name, retailer.name


43

Example: Non-optimized

1,000,000,000 rows

JOIN

JOIN

JOIN

GROUP BYproduct.name, retailer.name

100 rows 10 rows 30 rows

10,000,000 rows

SELECT

sales.retailer_fk,

sales.product_fk,

sales.time_fk,

sales.amount

FROM sales

SELECT

retailer.name,

retailer.id

FROM retailer

SELECT

product.name,

product.id

FROM product

WHERE

produc.brand =

‘ACME’

SELECT time.date,

time.id

FROM time

WHERE time.date <

add_months(CURRENT_

TIMESTAMP, -1)


44

Step 1: Applies JOIN reordering to maximize delegation

100,000,000 rows

JOIN

JOIN

100 rows 10 rows

10,000,000 rows

GROUP BYproduct.name, retailer.name

SELECT sales.retailer_fk,

sales.product_fk,

sales.amount

FROM sales JOIN time ON

sales.time_fk = time.id WHERE

time.date <

add_months(CURRENT_TIMESTAMP, -1)

SELECT

retailer.name,

retailer.id

FROM retailer

SELECT product.name,

product.id

FROM product

WHERE

produc.brand = ‘ACME’


45

Step 2

10,000 rows

JOIN

JOIN

100 rows 10 rows

1,000 rowsGROUP BYproduct.name, retailer.name

Since the JOIN is on foreign keys

(1-to-many), and the GROUP BY is

on attributes from the dimensions,

it applies the partial aggregation

push down optimization


sales.product_fk,

SUM(sales.amount)



time.date <


GROUP BY sales.retailer_fk,

sales.product_fk

SELECT

retailer.name,

retailer.id

FROM retailer


product.id

FROM product

WHERE



46

Step 3

Selects the right JOIN

strategy based on costs for

data volume estimations

1,000 rows

NESTED JOIN

HASH JOIN

100 rows10 rows

1,000 rowsGROUP BYproduct.name, retailer.name


sales.product_fk,

SUM(sales.amount)



time.date <


GROUP BY sales.retailer_fk,

sales.product_fk

WHERE product.id IN (1,2,…)

SELECT

retailer.name,

retailer.id

FROM retailer


product.id

FROM product

WHERE



1. Automatic JOIN reordering

Groups branches that go to the same source to maximize query delegation and reduce processing in the DV

layer

End users don’t need to worry about the optimal “pairing” of the tables

2. The Partial Aggregation push-down optimization is key in those scenarios. Based on

PK-FK restrictions, pushes the aggregation (for the PKs) to the DW

Leverages the processing power of the DW, optimized for these aggregations

Reduces significantly the data transferred through the network (from 1 b to 10 k)

3. The Cost-based Optimizer picks the right JOIN strategies based on estimations on data

volumes, existence of indexes, transfer rates, etc.

Denodo estimates costs in a different way for parallel databases (Vertica, Netezza, Teradata) than for regular

databases to take into consideration the different way those systems operate (distributed data, parallel

processing, different aggregation techniques, etc.)

47

Summary


Automatic data movement

Creation of temp tables in one of the systems to enable complete delegation

Only considered as an option if the target source has the “data movement” option

enabled

Use of native bulk load APIs for better performance

Execution Alternatives

If a view exist in more than one system, Denodo can decide in execution time which one

to use

The goal is to maximize query delegation depending on the other tables involved in the

query

48

Other relevant optimization techniques for LDW and Big Data


Optimizations for Virtual Partitioning

Eliminates unnecessary queries and processing based on a pre-execution analysis of the views and the queries

Pruning of unnecessary JOIN branches

Relevant for horizontal partitioning and “fat” semantic models when queries do not need attributes for all the tables

Pruning of unnecessary UNION branches

Enables detection of unnecessary UNION branches in vertical partitioning scenarios

Push down of JOIN under UNION views

Enables the delegation of JOINs with dimensions

Automatic Data movement for partition scenarios

Enables the delegation of JOINs with dimensions

49

Other relevant optimization techniques for LDW and Big Data

Caching

50

51

Caching

Sometimes, real time access & federation not a good fit:

Sources are slow (ex. text files, cloud apps. like Salesforce.com)

A lot of data processing needed (ex. complex combinations, transformations,

matching, cleansing, etc.)

Limited access or have to mitigate impact on the sources

For these scenarios, Denodo can replicate just the relevant data in

the cache

Real time vs. caching

52

Caching

Denodo’s cache system is based on an external relational database

Traditional (Oracle, SQLServer, DB2, MySQL, etc.)

MPP (Teradata, Netezza, Vertica, Redshift, etc.)

In-memory storage (Oracle TimesTen, SAP HANA)

Works at view level.

Allows hybrid access (real-time / cached) of an execution tree

Cache Control (population / maintenance)

Manually – user initiated at any time

Time based - using the TTL or the Denodo Scheduler

Event based - e.g. using JMS messages triggered in the DB

Overview

References

53

54

Further Reading

Data Virtualization Blog (http://www.datavirtualizationblog.com)

Check the following articles written by our CTO Alberto Pan in our blog:

• Myths in data virtualization performance

• Performance of Data Virtualization in Logical Data Warehouse scenarios

• Physical vs Logical Data Warehouse: the numbers

• Cost Based Optimization in Data Virtualization

Denodo Cookbook

• Data Warehouse Offloading

http://www.datavirtualizationblog.com/myths-in-data-virtualization-performance/

http://www.datavirtualizationblog.com/myths-in-data-virtualization-performance/

http://www.datavirtualizationblog.com/performance-data-virtualization-logical-data-warehouse-scenarios/

http://www.datavirtualizationblog.com/physical-logical-data-warehouse-performance-numbers/

http://www.datavirtualizationblog.com/cost-based-optimization-in-data-virtualization/http:/www.datavirtualizationblog.com/cost-based-optimization-in-data-virtualization/

http://www.denodo.com/en/document/e-book/cookbook-data-warehouse-loading-denodo-and-cloudera

Success StoriesCustomer Case Studies

Autodesk Overview

• Founded 1982 (NASDAQ: ASDK)

• Annual revenues (FY 2015) $2.5B

Over 8,800 employees

• 3D modeling and animation software

Flagship product is AutoCAD

• Market sectors:

Architecture, Engineering, and Construction

Manufacturing

Media and Entertainment

Recently started 3D Printing offerings

56

Business Drivers for Change

• Software consumption model is changing

Perpetual licenses to subscriptions

User want more flexibility in how they use software

• Autodesk needed to transition to subscription pricing

2016 – some products will be subscription only

• Lifetime revenue higher with subscriptions

Over 3-5 years, subscriptions = more revenues

• Changing a licensing model is disruptive

57

Technology Challenges

• Current ‘traditional’ BI/EDW architecture not designed for data streams from online apps

Weblogs, Clickstreams, Cloud/Desktop apps, etc.

• Existing infrastructure can’t simply ‘go away’

Regulatory reporting (e.g. SEC)

Existing ‘perpetual’ customers

• ‘Subscription’ infrastructure work in parallel

Extend and enhance existing systems

With single access point to all data

• Solution – ‘Logical Data Warehouse’

58

Logical Data Warehouse at Autodesk

59

Logical Data Warehouse at AutodeskTraditional BI/Reporting

60

Logical Data Warehouse at Autodesk‘New Data’ Ingestion

61

Logical Data Warehouse ExampleReporting on Combined Data

62

63

Problem Solution Results

Case Study Autodesk Successfully Changes Their Revenue Model and Transforms Business

Autodesk was changing their business revenue model from a conventional perpetual license model to subscription-based license model.

Inability to deliver high quality data in a timely manner to business stakeholders.

Evolution from traditional operational data warehouse to contemporary logical data warehouse deemed necessary for faster speed.

General purpose platform to deliver data through logical data warehouse.

Denodo Abstraction Layer helps live invoicing with SAP.

Data virtualization enabled a culture of “see before you build”.

Successfully transitioned to subscription-based licensing.

For the first time, Autodesk can do single point security enforcement and have uniform data environment for access.

Autodesk, Inc. is an American multinational software corporation that makes software for the architecture, engineering, construction, manufacturing, media, and entertainment industries.

Thanks!

www.denodo.com [email protected]

© Copyright Denodo Technologies. All rights reservedUnless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.