big data: architecture and performance considerations in logical data lakes

22
Architecture and Performance Considerations in the Logical Data Lake Dr. Alberto Pan, Chief Technical Officer

Upload: denodo

Post on 09-Jan-2017

230 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Big Data: Architecture and Performance Considerations in Logical Data Lakes

Architecture and Performance Considerations in the Logical Data Lake

Dr. Alberto Pan, Chief Technical Officer

Page 2: Big Data: Architecture and Performance Considerations in Logical Data Lakes

Architecture and Performance Considerations in the Logical Data Lake

Dr. Alberto Pan, Chief Technical Officer

Page 3: Big Data: Architecture and Performance Considerations in Logical Data Lakes

Agenda1. Data Lake Architecture

2.Data Virtualization in the Logical Data Lake

3.Performance: ‘Move Processing To the Data’

4.Performance: Choosing the Best Execution Plan

5.Example Scenario: The Numbers

Page 4: Big Data: Architecture and Performance Considerations in Logical Data Lakes

Data Lake Architecture

Page 5: Big Data: Architecture and Performance Considerations in Logical Data Lakes

5

Architecture of the Data LakeReal-TimeDecision

Management

Alerts

ScorecardsDashboards

Reporting

Data DiscoverySelf-Service

Search

Predictive Analytics

Statistical Analytics (R)

Text Analytics

Data MiningData Warehouse

Sensor Data

Machine Data (Logs)

Social Data

Clickstream Data

Internet Data

Image and Video

Enterprise Content (Unstructured)

Big Data

Enterprise ApplicationsTraditional

Enterprise Data

Cloud

Cloud Applications

Metadata Management, Data Governance, Data Security

NoSQL

EDW In-Memory(SAP Hana, …)

Analytical Appliances

Cloud DW(Redshift,..)

ODS

Big Data ETL

CDC

Sqoop

(Flume, Kafka, …)

Real-Time Data Access (On-Demand / Streaming)

Batch

YARN / Workload Management

HDFS

HiveSparkDrill

ImpalaStorm HBase Solr

Hunk

DW Streams NoSQL SearchSQL

Hadoop

TezMapRed.

Page 6: Big Data: Architecture and Performance Considerations in Logical Data Lakes

6

How can I combine Data from Several Systems ensuring good Performance ?

How can I abstract consuming applications from technology change and requirements evolution ?

How can I enforce consistent Security and Governance Policies across the Data Lake ?

Questions for the Logical Data Lake:

The Logical Data Lake ArchitectureIntegrated View of a Plurality of systems: Hadoop, EDW, Streaming, In-memory,...

Page 7: Big Data: Architecture and Performance Considerations in Logical Data Lakes

DV in the Logical Data Lake

Page 8: Big Data: Architecture and Performance Considerations in Logical Data Lakes

8

Architecture of the Data LakeReal-TimeDecision

Management

Alerts

ScorecardsDashboards

Reporting

Data DiscoverySelf-Service

Search

Predictive Analytics

Statistical Analytics (R)

Text Analytics

Data MiningData Warehouse

Sensor Data

Machine Data (Logs)

Social Data

Clickstream Data

Internet Data

Image and Video

Enterprise Content (Unstructured)

Big Data

Enterprise ApplicationsTraditional

Enterprise Data

Cloud

Cloud Applications

Metadata Management, Data Governance, Data Security

NoSQL

EDW In-Memory(SAP Hana, …)

Analytical Appliances

Cloud DW(Redshift,..)

ODS

Big Data ETL

CDC

Sqoop

(Flume, Kafka, …)

Real-Time Data Access (On-Demand / Streaming)

Batch

YARN / Workload Management

HDFS

HiveSparkDrill

ImpalaStorm HBase Solr

Hunk

DW Streams NoSQL SearchSQL

Hadoop

TezMapRed.

Page 9: Big Data: Architecture and Performance Considerations in Logical Data Lakes

9

Architecture of the Logical Data LakeReal-TimeDecision

Management

Alerts

ScorecardsDashboards

Reporting

Data DiscoverySelf-Service

Search

Predictive Analytics

Statistical Analytics (R)

Text Analytics

Data MiningData Warehouse

Sensor Data

Machine Data (Logs)

Social Data

Clickstream Data

Internet Data

Image and Video

Enterprise Content (Unstructured)

Big Data

Enterprise Applications

Traditional Enterprise

Data

Cloud

Cloud Applications

NoSQL

EDW In-Memory(SAP Hana, …)

Analytical Appliances

Cloud DW(Redshift,..)

ODS

Big Data ETL

CDC

Sqoop

(Flume, Kafka, …)

Data Virtualization

Real-Time Data Access (On-Demand / Streaming)

Data Caching

Dat

a Ser

vice

s

Data Search & Discovery

GovernanceSecurity

Optimization

Dat

a Abs

trac

tion

Dat

a Tr

ansf

orm

atio

n

Dat

a Fe

dera

tionBatch

YARN / Workload Management

HDFS

HiveSparkDrill

ImpalaStorm HBase Solr

Hunk

DW Streams NoSQL SearchSQL

Hadoop

TezMapRed.

Page 10: Big Data: Architecture and Performance Considerations in Logical Data Lakes

10

What is Needed ?Requirements for the Integration Component in the Logical Data Lake

Ability to answer ad-hoc queries combining data from several systems

Performance comparable to physical approaches

Ability to expose different logical views over the same data

Single entry point to apply Security and Governance policies. Comprehensive, granular security support

Denodo Data Virtualization is the only option verifying:

Page 11: Big Data: Architecture and Performance Considerations in Logical Data Lakes

Performance: Move Processing to the Data

Page 12: Big Data: Architecture and Performance Considerations in Logical Data Lakes

12

Move Processing to the DataProcess the data where it resides

Process the data locally where it resides

DV System combines partial results

Minimizes network traffic

Leverages specialized data sources

Page 13: Big Data: Architecture and Performance Considerations in Logical Data Lakes

13

Move Processing to the Data: Example 1Obtain Total Sales By Product (Naive Strategy)

Naive Strategy: 350M rows moved through the network

Page 14: Big Data: Architecture and Performance Considerations in Logical Data Lakes

14

Move Processing to the Data: Example 1Obtain Total Sales By Product (Move Processing to the Data)

Denodo Strategy: 30k rows moved through the network

Page 15: Big Data: Architecture and Performance Considerations in Logical Data Lakes

15

Move Processing to the Data: Example 2Maximum Sales Discount By Product in the last year: On-the-fly Data Movement

Move Products Data to a Temp table in the DW : 20K rows moved through the network + 10K

rows inserted in the DW

Execute full query on the DW: 10k rows through the network

Page 16: Big Data: Architecture and Performance Considerations in Logical Data Lakes

16

Move Processing to the Data: Example 2Maximum Sales Discount By Product in the last year: Partial aggregation Pushdown

Products DB: 10K rows through the network

Data Warehouse: #rows through the network = 10K * average

#sale_prices_per_product

Page 17: Big Data: Architecture and Performance Considerations in Logical Data Lakes

Performance: Choosing the Best Execution Plan

Page 18: Big Data: Architecture and Performance Considerations in Logical Data Lakes

18

How to Choose the Best Execution Plan?Cost-Based Optimization in Data Virtualization

Data statistics to estimate size of intermediate result sets

Data Source Indexes (and other physical structures)

Execution Model of data sources: e.g. Parallel Databases VS Hadoop clusters VS Relational Databases

Features of data sources (e.g. number of processing cores in parallel database or Hadoop Cluster)

Data Transfer rate

Must take into account:

Page 19: Big Data: Architecture and Performance Considerations in Logical Data Lakes

Example Scenario: The Numbers

Page 20: Big Data: Architecture and Performance Considerations in Logical Data Lakes

20

Example Scenario: The NumbersBest Performance Even When Processing Billions of Rows

Performance Comparison of Physical vs Logical Scenario

Big Data volumes

TPC-DS benchmarkSales(Netezza)

Customers(Oracle) Items

(SQLServer)290M

2M 400K

Page 21: Big Data: Architecture and Performance Considerations in Logical Data Lakes

21

Example Scenario: The NumbersPhysical vs Logical DW Performance

Query Description Rows Returned AVG Time Physical (all data in Netezza) AVG Time Logical

Optimization Technique (automatically chosen by Denodo6.0)

Total sales by customer 1,99 M 20975 ms 21457 msFull group bypushdown

Total sales by customer and year between 2000 and 2004 5,51 M 52313 ms 59060 ms

Full group bypushdown

Total sales by item brand 31,35 K 4697 ms 5330 msPartial group bypushdown

Total sales by item where sale price less than current list price 17,05 K 3509 ms 5229 ms

On the fly data movement

Page 22: Big Data: Architecture and Performance Considerations in Logical Data Lakes

Thanks!

www.denodo.com [email protected]© Copyright Denodo Technologies. All rights reservedUnless otherwise specified, no part of this PDF file may be reproduced or utilized in any for or by any means, electronic or mechanical, including photocopying and microfilm, without prior the written authorization from Denodo Technologies.

Find more details at: datavirtualization.bloghttp://www.datavirtualizationblog.com/myths-in-data-virtualization-performance/