optimization

29
Optimization with DMExpress Steven Haddad – Senior Software Architect [email protected]

Upload: steven-haddad

Post on 16-May-2015

925 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Optimization

Optimization with DMExpress

Steven Haddad – Senior Software [email protected]

Page 2: Optimization

Introducing DMExpress™ - Fast. Efficient. Simple. Cost Effective.

Syncsort Confidential and Proprietary - do not copy or distribute 3

A Family of High-Performance, Purpose-Built Data Integration Tools

Migrate

Integrate

Optimize

→ Hadoop Optimization

→ ETL Optimization For Informatica, DataStage, and others

For Apache, HortonWorks, Cloudera, and others

→ High-Performance Sort

→ Sort Optimization

For z/OS, z/VSE, and Windows/UNIX/Linux

For SAS, DFSORT, Trillium, and others

For core ETL processing & database transformation offload (Oracle PL/SQL, Teradata, and others)

→ High-Performance ETL

→ Rehosting Optimization For Clerity, MicroFocus, Oracle, and others

Page 3: Optimization

Do You Need Data Integration Optimization/Acceleration?

ETL is taking longer and longer

Large budgets to purchase additional hardware and database

A shift in data integration processing to database or hand-coded solutions

Data integration environment can’t easily be govern, maintained or expanded

Inability to launch or staff initiatives due to lack of resources

Long time-to-value

Users may lose confidence in data

4Syncsort Confidential and Proprietary - do not copy or distribute

Page 4: Optimization

What is Optimization with DMExpress™ ?

Better Performance – No Tuning

Lower Costs for:

Hardware

Licenses

IT Stuff

Improves your Capabilities to deliver

Reduces usage of resources

More work in less time

Secure your already done investment

5Syncsort Confidential and Proprietary - do not copy or distribute

Page 5: Optimization

Examples for Optimization with DMExpress™

Syncsort Confidential and Proprietary - do not copy or distribute 6

AbInitio

IBM DataStage

Informatica

PL/SQL

→ 10 * Faster then DataStage Parallel

Major Logistic Company

→ 26 * Faster then DataStage Server

Major Logistic Company

→ 27 days down to 15 hours→6 week to production

Information Service Provider

→ 1/20 of disc space→ significant less Memory

Major Insurance Provider

→ Reduce costs by 2.9 Mio $→ 2.35h down to 3 min

Global Payments

→ 4:42 h down to 1:12h→ 360 GB down to 4 GB WS

Financial Service Provider

→ Costs/TB down from → 1538 US$ to 46 US$

ComScore

Page 6: Optimization

DMExpress Delivers Significantly Faster Performance Even Without Any Tuning

05

101520253035

INFA

DMExpress

Elap

sed

Tim

e (m

)

1. Copy / Fil-ter

2. Sort 3. Aggregate / Rollup

0

50

100

150

200

250

300

Ab InitioDMExpress

Elap

sed

Tim

e (m

)

Up to 5x Faster→ DMExpress: No Tuning→ Informatica: Tuned

Up to 4x Faster→ DMExpress: No Tuning→ Ab Initio: Tuned

7Syncsort Confidential and Proprietary - do not copy or distribute

Page 7: Optimization

DMExpress Seamlessly Scales to Support Growing Requirements

Syncsort Confidential and Proprietary - do not copy or distribute 8

Business Requirements

Time

Volu

me

& C

ompl

exity

Conventional ETL

DMExpress

Seamlessly scale:• No tuning• No ELT• Defer hardware purchases

Point of problem awareness

Continuously implement performance stop-gap measures:• Manual tuning• Add/upgrade hardware• Push-down (ELT)

Page 8: Optimization

Fast: Intelligent Sort Algorithms

High Frequency and Impact

Syncsort has been the market leading sort technology since 1968

Sort impacts every aspect of ETL

Source Extract, Compress & FTP

CompressionRatioincreases

Joining Records

Source ExtractCompress & FTPDatabase Load

Aggregation

Merging &Transformation

Partition Data

6 X

PartitionData

JoiningRecords

Merge &Transformation

Aggregation

DatabaseLoad & Index

Up To Faster 40%

Up To Faster 40%

Up To Faster 60%

Up To Faster 50%

Up To Faster 70%

Page 9: Optimization

Maximizing Performance with Optimum Resource Utilization

10

CPU

I/O MemoryCPU & M

emory Bound

CPU &

I/O B

ound

Most ETL

Tools

Are stuck here

Disk & I/O Bound

• Patented AlgorithmsDynamically responds to CPU, Memory & disk availability

• Direct I/OBypasses file system buffer accessing data directly at block level for higher performance

• CompressionUsed for read/write & crucially active workspace (minimizes disk touches & transfer volume)

DMExpress Is Different

The Performance Triangle

BufferManagement

ETL Process Optimizer

Memory Cache Optimization

Algorithm Selection

I/O Optimization

Instruction Cache

Optimization

Partition &Pipeline

Parallelism

Syncsort Confidential and Proprietary - do not copy or distribute

Page 10: Optimization

DMExpress Dynamically Maximizes Throughput at Run Time

Syncsort Confidential and Proprietary - do not copy or distribute 11

■ Extremely efficient in commodity hardware■ I/O operations at near disk speed■ Automatic parallelism and pipelining■ Automatic, efficient caching and hashing

■ Minimizes disk caching

Processing Time

Algo

rithm

s Manual and Static

Processing Time

Algo

rithm

s

Automatic and Dynamic

■ Scaling requires expensive hardware■ I/O operations well below disk speed■ Requires exhaustive tuning■ Sub-optimal consumption of resources

■ Uses all memory, overflows to disk

Data Integration with DMExpressConventional Data Integration

Page 11: Optimization

12Syncsort Confidential and Proprietary - do not copy or distribute

Memory

CPU

I/O

File System

Res

ourc

e An

alys

isD

ata

Anal

ysis Data Type

Record Format

#Records / Columns

Efficient: Dynamic ETL Optimizer

Fully automatic, continuously self-tuning optimizer maximizes throughput and resource efficiencies

– Evaluates hardware, software, and data environment– Determines optimal algorithmic flow at start-up– Begins execution with auto-generated optimizer plan– Continuously adjusts algorithms, memory use, parallelism based on

application and run time environment

BufferManagement

ETL Process Optimizer

Memory Cache

Optimization

Algorithm Selection

I/O Optimization

Instruction Cache

Optimization

Partition &Pipeline

Parallelism

Page 12: Optimization

Design Once Inherit Performance

Syncsort Confidential and Proprietary - do not copy or distribute 13

ETL JobEDW

Thread Management

Tasks

Dynamic Optimizations

Sources Read Join Aggregate Write Targets

• Each ETL task runs on a separate process• Automatic, dynamic thread management for each task• Automatic parallelism and pipelining• Automatic, dynamic algorithm selection

DM

Page 13: Optimization

DMExpress – White Boarding the Data Acceleration Sales

Architecture

Page 14: Optimization

DMExpress Architecture Delivers Maximum Performance and Data Scalability with Automatic Dynamic Optimizations

Syncsort Confidential and Proprietary - do not copy or distribute 15

DMExpress Engine

Graphical Development Environment

High Performance Transformations User Defined Functions

Built in Functions:• Numeric• Text• Date and Time• Logical• Advanced Text Processing• Data Partitioning

Automatic Continuous Optimization

De

plo

yme

nt

Me

tada

ta

Source/Target Connectivity

Inte

grat

ion

/ Cus

tom

izat

ion

(SD

K, O

pen

AP

Is)

• Sort • Merge• Aggregate• Join / Lookup• Copy• Load Presort

• Filter• Reformat• Partition

Processing Time

Alg

orit

hm

s

Page 15: Optimization

Five Simple Steps to Deploy. Tuning Is NOT One of Them.

Syncsort Confidential and Proprietary - do not copy or distribute 16

1. Install DMExpress

2. Choose “Task” Template

3. Fill-in the blanks

4. Integrate

5. Deploy

• Single install• Takes less than 5 minutes

• Primary Tasks: Sort, Merge, Aggregate, Join / Lookup, Copy

• Secondary Tasks: Filter, Reformat, Partition

• Connectivity• Standard Functions

• Numeric, Text, Date/Time, Logical• User-defined Functions

• Create Complete ETL “Jobs” by Combining Multiple “Tasks”

• Define Flows – from files to direct flows

• Schedule• Parameterize• Monitor

Page 16: Optimization

Syncsort DMExpress Is Simple but powerfulIntuitive Graphical Interface enables Development and Maintenance

Syncsort Confidential and Proprietary - do not copy or distribute 17

→ No coding required→ No tuning required→ Easily build/edit jobs and tasks→ Detect differences between development,

test, and production environments→ Users are fully functional within a few days

• Graphical Development Environment

• Expression Builder

• Job/Task Diff

Page 17: Optimization

DMExpress Architecture

DMExpress Engine

DmExpress Clients

ServicesWindows / Unix / Linux

Flat File Based Metadata Repository

Command Line

Data Sources / Targets

Design Time View

Data

Local Server

Remote Server

Job Editor

Task Editor

3rd party version control toolCheck-in

Check-out

Page 18: Optimization

DMExpress – White Boarding the Data Acceleration Sales

Use Cases

Page 19: Optimization

Acceleration POC – Scenario A

DataStage Parallel DMExpress0

10

20

30

4032

19

Processing Time in Minutes of ‘High Load Jobs’

20

4/6 cores(Physical/Virt.)

Linux

1 core(Virtual)Linux

1/2 The time

1/6 The hardware

Page 20: Optimization

Acceleration POC – Scenario B

21

14 cores(Physical)

HP-UX

1 core(Virtual)Linux

1/2 The time

1/14 The Hardware

DataStage Server DMExpress0.00

10.00

20.00

30.00

40.00

40.00

21.30

Processing Time in Minutes of ‘Scenario B’

Page 21: Optimization

Business Challenge Severe competitive pressure from Google Finance, Yahoo! Finance, Morningstar, and others forced development of strategic

new offerings Environment

Informatica 8.11 SP3, Oracle 10.2 RAC 6 nodes, DMExpress 5.2.15. 16 core LINUX machine

Technical Challenge Weekly Reporting application on 8 million DUNS numbers Data Sizes: 5 tables of ~1 TB each Bottleneck step was to join 5 tables and aggregate the output

Prior Attempts to Increase Performance Manual tuning of ETL routines - lots of consultants spent many months and dollars Converted the ETL mapping to ELT. No success - Process would abort with ORA-01555: Snapshot too old error Broke up the ELT process into 100,000 record batches to prevent the oracle error. The process ran in 27 days (extrapolated) Problem existed since February on 2009, many attempts and touch points, production in October.

Solution DMExpress extracted five 1 TB tables in 6 hours and performed the joins and aggregation in 9 hours. Total run time was 15 hour

to run this step in DMExpress vs. 27 days. DMExpress invoked at the command line prior to Informatica

Benefits New offering launched on time Able to meet SLAs 2 weeks to finish POC In production in 6 weeks

Use Case 1: Global Information Service Provider

Page 22: Optimization

Use case 2: Major Insurance Provider

Business Challenge Unable to complete processing to deliver new highly personalized offers and pricing to their agents via their agent marketing portal

over weekend window impacts conversion rates for promotions to policyholders Need to start the processing on Friday night 6pm, causing data from load to be done only by Wednesday 6 pm

Environment Informatica version 7.x, 8.6.1, Trillium, Teradata, reporting - MicroStrategy, Hyperion/Brio,DMExpress 6.9, Maestro , Sun Solaris

Technical Challenge 500 of GB of data, including joins and aggregations, need to be completed during weekend window Certain jobs would not even not run – need to abort (30 hour + runs). No alternative – no tuning worked Very slow I/O when joins spill to disk. All of the memory on the system is grabbed! Virtual memory errors No capacity in Teradata to push down transformations

Prior Attempts to Increase Performance Tuning did not solve the problem Dynamically adjusting cache did not solve the bottleneck

Solution Output from Trillium is sent to DMExpress and Informatica to integrate and aggregate the data (Joins, and aggregations) Started out with 10 critical DMExpress jobs and now expanded to 700+ DMExpress tasks, 200 DMExpress jobs Orchestrated within PowerCenter Workflow Manager – command task and also called separately from Maestro.

Benefits DMExpress completes within weekend batch window Extremely simple and scalable approach – very short learning curve – 1 month to deploy DMExpress Significantly less memory used by DMX - more parallel jobs due to efficiency. DMExpress takes 1/20 th the disk space

Page 23: Optimization

Syncsort Confidential and Proprietary - do not copy or distribute 24

BeforePL/SQL Scripts (ELT)

AfterDMExpress (ETL)

Read files Load into staging area, dedupe, and summarize using PL/SQL scripts and iWay Data Migrator

Load into the Oracle production data warehouse for analysis & reporting

Read files Dedupe, summarize and load into Oracle data warehouse

Analysis & reporting

• Est. TCO over 3 years: $4.4M• Total processing time: 2.35 hrs• Complex architecture with PL/SQL, iWay Data

Migrator and lots of Oracle staging• Manual coding. Manual tuning. No reusability• No scalability to support business goals

Est. TCO over 3 years: $1.5MTotal processing time: 3 minOne tool. One ETL engine. No stagingNo coding. No tuning. Reusable objectsScalable architecture supports business growth

and profitability objectives

Analytics

Avg.

13.

5M ro

ws

per fi

le/t

able

DMExpress

Analytics

OracleVertica

Case Study: Enabling Up to $3M in Data Integration Cost Savings

Oracle

Oracle

ETLTL

Data Migrator

Oracle

Avg.

13.

5M ro

ws

per fi

le/t

able

Page 24: Optimization

POC Results – Informatica

Elapsed timeMemory Peak

(Mb)Approximate

CPU Time

Max I/O Utilization -

Read MB/Sec

Ave I/O Utilization

– Read (Meg/s)

Max I/O Utilization

– Write(MB/Sec

Ave I/O Utilization

– WriteMB/Sec

PowerCenter 0:28:10 11,875 1:06:29.2 53 12 82 39DMExpress 0:13:26 9,438 0:16:53.9 154 33 101 66

DMExpress (Linux) 0:05:43 9,957 0:16:21 N/A 83 N/A 142

PC DMX DMX (Linux)00:00:00

00:07:12

00:14:24

00:21:36

00:28:48

00:36:00

Elapsed Time

PC DMX DMX (Linux) 0.0

2.0

4.0

6.0

8.0

10.0

12.0

14.0

Memory (Gb)

PC DMX DMX (Linux) 0:00:000:07:120:14:240:21:360:28:480:36:000:43:120:50:240:57:361:04:481:12:00

CPU Time

52% 80% 21% 16% 75% 75%

Page 25: Optimization

Benchmark Details DMExpress vs. Informatica

Task Time

Copy 4mins 09 seconds

Sort 7mins 26 seconds

Aggregate 9mins 37 seconds

Sort & Aggregate 3mins 43 seconds

Current DMX

Max 88% Reduction

Min 57% Reduction

Avg 80% Reduction

Task Time Saving

Copy 0mins 50 seconds 80%

Sort 1mins 19 seconds 82%

Aggregate 1mins 9 seconds 88%

Sort & Aggregate 1mins 37 seconds 57%

5 GbFile –45 M Records

Task Time

Copy 20mins 53 seconds

Sort 31mins 48 seconds

Aggregate 20mins 45 seconds

Sort & Aggregate 14mins 53 seconds

Task Time Saving

Copy 4mins 12 seconds 80%

Sort 6mins 17 seconds 80%

Aggregate 4mins 30 seconds 78%

Sort & Aggregate 6mins 38 seconds 55%

25 GbFile –225 MRecords

Max 80% Reduction

Min 55% Reduction

Avg 75% Reduction

Page 26: Optimization

Ab Initio Benchmark

Syncsort Confidential and Proprietary - do not copy or distribute 27

Scenario1 (copy/filter)

Elapsed time CPU time Temp Workspace Records read Record written Data read Data written (bytes)

DMExpress 47 minutes 3 hours 44 min 0 GB 2,926,155,265 452,375,411 383,326,339,715 59,261,178,841Ab Initio 66 minutes 4 hours 38 min 0 GB 2,926,155,265 452,375,411 383,326,339,715 59,261,178,841

Scenario2 (Sort)

Elapsed time CPU time Temp Workspace Records read Record written Data read Data written (bytes)

DMExpress 1 hour 12 min 7 hours 26 min 60 GB 2,926,155,265 2,926,155,265 383,326,339,715 383,326,339,715Ab Initio 4 hours 42 min 9 hours 48 min 360 GB 2,926,155,265 2,926,155,265 383,326,339,715 383,326,339,715

Scenario3 (Aggregation/Rollup)

Elapsed time CPU time Temp Workspace Records read Record written Data read Data written (bytes)DMExpress 1 hour 21 min 7 hour 10 min 4 GB 2,926,155,265 27,179,924 383,326,339,715 4,022,628,752

Ab Initio 2 hours 10 hours 14 min 360 GB 2,926,155,265 27,179,924 383,326,339,715 4,022,628,752

Ab Initio tuned 8 waysDMExpress with no tuning

Page 27: Optimization

DMExpress – White Boarding the Data Acceleration Sales

Metadata with Miti

Page 28: Optimization

ETL to DMExpress acceleration / conversion

Syncsort Confidential and Proprietary - do not copy or distribute 29

Parsing• Informatica• IBM DataStage• PL/SQL• Etc…

Processing• Flow analysis• Expression & type analysis• Optimization

Output Generation• DMExpress• Documentation

Conversion Utility

UNIX shell scriptsInformatica workflowsInformatica mappingsSpreadsheets identifying the production workflows and mappingsTiming information of the job executions over a two month periodResource data points for the workflows  

Automatic Conversion Utility

Cognizant Migration / Optimization COE

Page 29: Optimization

DMExpress – White Boarding the Data Acceleration Sales P

DMX Live Demo