optimization
TRANSCRIPT
![Page 2: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/2.jpg)
Introducing DMExpress™ - Fast. Efficient. Simple. Cost Effective.
Syncsort Confidential and Proprietary - do not copy or distribute 3
A Family of High-Performance, Purpose-Built Data Integration Tools
Migrate
Integrate
Optimize
→ Hadoop Optimization
→ ETL Optimization For Informatica, DataStage, and others
For Apache, HortonWorks, Cloudera, and others
→ High-Performance Sort
→ Sort Optimization
For z/OS, z/VSE, and Windows/UNIX/Linux
For SAS, DFSORT, Trillium, and others
For core ETL processing & database transformation offload (Oracle PL/SQL, Teradata, and others)
→ High-Performance ETL
→ Rehosting Optimization For Clerity, MicroFocus, Oracle, and others
![Page 3: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/3.jpg)
Do You Need Data Integration Optimization/Acceleration?
ETL is taking longer and longer
Large budgets to purchase additional hardware and database
A shift in data integration processing to database or hand-coded solutions
Data integration environment can’t easily be govern, maintained or expanded
Inability to launch or staff initiatives due to lack of resources
Long time-to-value
Users may lose confidence in data
4Syncsort Confidential and Proprietary - do not copy or distribute
![Page 4: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/4.jpg)
What is Optimization with DMExpress™ ?
Better Performance – No Tuning
Lower Costs for:
Hardware
Licenses
IT Stuff
Improves your Capabilities to deliver
Reduces usage of resources
More work in less time
Secure your already done investment
5Syncsort Confidential and Proprietary - do not copy or distribute
![Page 5: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/5.jpg)
Examples for Optimization with DMExpress™
Syncsort Confidential and Proprietary - do not copy or distribute 6
AbInitio
IBM DataStage
Informatica
PL/SQL
→ 10 * Faster then DataStage Parallel
Major Logistic Company
→ 26 * Faster then DataStage Server
Major Logistic Company
→ 27 days down to 15 hours→6 week to production
Information Service Provider
→ 1/20 of disc space→ significant less Memory
Major Insurance Provider
→ Reduce costs by 2.9 Mio $→ 2.35h down to 3 min
Global Payments
→ 4:42 h down to 1:12h→ 360 GB down to 4 GB WS
Financial Service Provider
→ Costs/TB down from → 1538 US$ to 46 US$
ComScore
![Page 6: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/6.jpg)
DMExpress Delivers Significantly Faster Performance Even Without Any Tuning
05
101520253035
INFA
DMExpress
Elap
sed
Tim
e (m
)
1. Copy / Fil-ter
2. Sort 3. Aggregate / Rollup
0
50
100
150
200
250
300
Ab InitioDMExpress
Elap
sed
Tim
e (m
)
Up to 5x Faster→ DMExpress: No Tuning→ Informatica: Tuned
Up to 4x Faster→ DMExpress: No Tuning→ Ab Initio: Tuned
7Syncsort Confidential and Proprietary - do not copy or distribute
![Page 7: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/7.jpg)
DMExpress Seamlessly Scales to Support Growing Requirements
Syncsort Confidential and Proprietary - do not copy or distribute 8
Business Requirements
Time
Volu
me
& C
ompl
exity
Conventional ETL
DMExpress
Seamlessly scale:• No tuning• No ELT• Defer hardware purchases
Point of problem awareness
Continuously implement performance stop-gap measures:• Manual tuning• Add/upgrade hardware• Push-down (ELT)
![Page 8: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/8.jpg)
Fast: Intelligent Sort Algorithms
High Frequency and Impact
Syncsort has been the market leading sort technology since 1968
Sort impacts every aspect of ETL
Source Extract, Compress & FTP
CompressionRatioincreases
Joining Records
Source ExtractCompress & FTPDatabase Load
Aggregation
Merging &Transformation
Partition Data
6 X
PartitionData
JoiningRecords
Merge &Transformation
Aggregation
DatabaseLoad & Index
Up To Faster 40%
Up To Faster 40%
Up To Faster 60%
Up To Faster 50%
Up To Faster 70%
![Page 9: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/9.jpg)
Maximizing Performance with Optimum Resource Utilization
10
CPU
I/O MemoryCPU & M
emory Bound
CPU &
I/O B
ound
Most ETL
Tools
Are stuck here
Disk & I/O Bound
• Patented AlgorithmsDynamically responds to CPU, Memory & disk availability
• Direct I/OBypasses file system buffer accessing data directly at block level for higher performance
• CompressionUsed for read/write & crucially active workspace (minimizes disk touches & transfer volume)
DMExpress Is Different
The Performance Triangle
BufferManagement
ETL Process Optimizer
Memory Cache Optimization
Algorithm Selection
I/O Optimization
Instruction Cache
Optimization
Partition &Pipeline
Parallelism
Syncsort Confidential and Proprietary - do not copy or distribute
![Page 10: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/10.jpg)
DMExpress Dynamically Maximizes Throughput at Run Time
Syncsort Confidential and Proprietary - do not copy or distribute 11
■ Extremely efficient in commodity hardware■ I/O operations at near disk speed■ Automatic parallelism and pipelining■ Automatic, efficient caching and hashing
■ Minimizes disk caching
Processing Time
Algo
rithm
s Manual and Static
Processing Time
Algo
rithm
s
Automatic and Dynamic
■ Scaling requires expensive hardware■ I/O operations well below disk speed■ Requires exhaustive tuning■ Sub-optimal consumption of resources
■ Uses all memory, overflows to disk
Data Integration with DMExpressConventional Data Integration
![Page 11: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/11.jpg)
12Syncsort Confidential and Proprietary - do not copy or distribute
Memory
CPU
I/O
File System
Res
ourc
e An
alys
isD
ata
Anal
ysis Data Type
Record Format
#Records / Columns
Efficient: Dynamic ETL Optimizer
Fully automatic, continuously self-tuning optimizer maximizes throughput and resource efficiencies
– Evaluates hardware, software, and data environment– Determines optimal algorithmic flow at start-up– Begins execution with auto-generated optimizer plan– Continuously adjusts algorithms, memory use, parallelism based on
application and run time environment
BufferManagement
ETL Process Optimizer
Memory Cache
Optimization
Algorithm Selection
I/O Optimization
Instruction Cache
Optimization
Partition &Pipeline
Parallelism
![Page 12: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/12.jpg)
Design Once Inherit Performance
Syncsort Confidential and Proprietary - do not copy or distribute 13
ETL JobEDW
Thread Management
Tasks
Dynamic Optimizations
Sources Read Join Aggregate Write Targets
• Each ETL task runs on a separate process• Automatic, dynamic thread management for each task• Automatic parallelism and pipelining• Automatic, dynamic algorithm selection
DM
![Page 13: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/13.jpg)
DMExpress – White Boarding the Data Acceleration Sales
Architecture
![Page 14: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/14.jpg)
DMExpress Architecture Delivers Maximum Performance and Data Scalability with Automatic Dynamic Optimizations
Syncsort Confidential and Proprietary - do not copy or distribute 15
DMExpress Engine
Graphical Development Environment
High Performance Transformations User Defined Functions
Built in Functions:• Numeric• Text• Date and Time• Logical• Advanced Text Processing• Data Partitioning
Automatic Continuous Optimization
De
plo
yme
nt
Me
tada
ta
Source/Target Connectivity
Inte
grat
ion
/ Cus
tom
izat
ion
(SD
K, O
pen
AP
Is)
• Sort • Merge• Aggregate• Join / Lookup• Copy• Load Presort
• Filter• Reformat• Partition
Processing Time
Alg
orit
hm
s
![Page 15: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/15.jpg)
Five Simple Steps to Deploy. Tuning Is NOT One of Them.
Syncsort Confidential and Proprietary - do not copy or distribute 16
1. Install DMExpress
2. Choose “Task” Template
3. Fill-in the blanks
4. Integrate
5. Deploy
• Single install• Takes less than 5 minutes
• Primary Tasks: Sort, Merge, Aggregate, Join / Lookup, Copy
• Secondary Tasks: Filter, Reformat, Partition
• Connectivity• Standard Functions
• Numeric, Text, Date/Time, Logical• User-defined Functions
• Create Complete ETL “Jobs” by Combining Multiple “Tasks”
• Define Flows – from files to direct flows
• Schedule• Parameterize• Monitor
![Page 16: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/16.jpg)
Syncsort DMExpress Is Simple but powerfulIntuitive Graphical Interface enables Development and Maintenance
Syncsort Confidential and Proprietary - do not copy or distribute 17
→ No coding required→ No tuning required→ Easily build/edit jobs and tasks→ Detect differences between development,
test, and production environments→ Users are fully functional within a few days
• Graphical Development Environment
• Expression Builder
• Job/Task Diff
![Page 17: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/17.jpg)
DMExpress Architecture
DMExpress Engine
DmExpress Clients
ServicesWindows / Unix / Linux
Flat File Based Metadata Repository
Command Line
Data Sources / Targets
Design Time View
Data
Local Server
Remote Server
Job Editor
Task Editor
3rd party version control toolCheck-in
Check-out
![Page 18: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/18.jpg)
DMExpress – White Boarding the Data Acceleration Sales
Use Cases
![Page 19: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/19.jpg)
Acceleration POC – Scenario A
DataStage Parallel DMExpress0
10
20
30
4032
19
Processing Time in Minutes of ‘High Load Jobs’
20
4/6 cores(Physical/Virt.)
Linux
1 core(Virtual)Linux
1/2 The time
1/6 The hardware
![Page 20: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/20.jpg)
Acceleration POC – Scenario B
21
14 cores(Physical)
HP-UX
1 core(Virtual)Linux
1/2 The time
1/14 The Hardware
DataStage Server DMExpress0.00
10.00
20.00
30.00
40.00
40.00
21.30
Processing Time in Minutes of ‘Scenario B’
![Page 21: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/21.jpg)
Business Challenge Severe competitive pressure from Google Finance, Yahoo! Finance, Morningstar, and others forced development of strategic
new offerings Environment
Informatica 8.11 SP3, Oracle 10.2 RAC 6 nodes, DMExpress 5.2.15. 16 core LINUX machine
Technical Challenge Weekly Reporting application on 8 million DUNS numbers Data Sizes: 5 tables of ~1 TB each Bottleneck step was to join 5 tables and aggregate the output
Prior Attempts to Increase Performance Manual tuning of ETL routines - lots of consultants spent many months and dollars Converted the ETL mapping to ELT. No success - Process would abort with ORA-01555: Snapshot too old error Broke up the ELT process into 100,000 record batches to prevent the oracle error. The process ran in 27 days (extrapolated) Problem existed since February on 2009, many attempts and touch points, production in October.
Solution DMExpress extracted five 1 TB tables in 6 hours and performed the joins and aggregation in 9 hours. Total run time was 15 hour
to run this step in DMExpress vs. 27 days. DMExpress invoked at the command line prior to Informatica
Benefits New offering launched on time Able to meet SLAs 2 weeks to finish POC In production in 6 weeks
Use Case 1: Global Information Service Provider
![Page 22: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/22.jpg)
Use case 2: Major Insurance Provider
Business Challenge Unable to complete processing to deliver new highly personalized offers and pricing to their agents via their agent marketing portal
over weekend window impacts conversion rates for promotions to policyholders Need to start the processing on Friday night 6pm, causing data from load to be done only by Wednesday 6 pm
Environment Informatica version 7.x, 8.6.1, Trillium, Teradata, reporting - MicroStrategy, Hyperion/Brio,DMExpress 6.9, Maestro , Sun Solaris
Technical Challenge 500 of GB of data, including joins and aggregations, need to be completed during weekend window Certain jobs would not even not run – need to abort (30 hour + runs). No alternative – no tuning worked Very slow I/O when joins spill to disk. All of the memory on the system is grabbed! Virtual memory errors No capacity in Teradata to push down transformations
Prior Attempts to Increase Performance Tuning did not solve the problem Dynamically adjusting cache did not solve the bottleneck
Solution Output from Trillium is sent to DMExpress and Informatica to integrate and aggregate the data (Joins, and aggregations) Started out with 10 critical DMExpress jobs and now expanded to 700+ DMExpress tasks, 200 DMExpress jobs Orchestrated within PowerCenter Workflow Manager – command task and also called separately from Maestro.
Benefits DMExpress completes within weekend batch window Extremely simple and scalable approach – very short learning curve – 1 month to deploy DMExpress Significantly less memory used by DMX - more parallel jobs due to efficiency. DMExpress takes 1/20 th the disk space
![Page 23: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/23.jpg)
Syncsort Confidential and Proprietary - do not copy or distribute 24
BeforePL/SQL Scripts (ELT)
AfterDMExpress (ETL)
Read files Load into staging area, dedupe, and summarize using PL/SQL scripts and iWay Data Migrator
Load into the Oracle production data warehouse for analysis & reporting
Read files Dedupe, summarize and load into Oracle data warehouse
Analysis & reporting
• Est. TCO over 3 years: $4.4M• Total processing time: 2.35 hrs• Complex architecture with PL/SQL, iWay Data
Migrator and lots of Oracle staging• Manual coding. Manual tuning. No reusability• No scalability to support business goals
Est. TCO over 3 years: $1.5MTotal processing time: 3 minOne tool. One ETL engine. No stagingNo coding. No tuning. Reusable objectsScalable architecture supports business growth
and profitability objectives
Analytics
Avg.
13.
5M ro
ws
per fi
le/t
able
DMExpress
Analytics
OracleVertica
Case Study: Enabling Up to $3M in Data Integration Cost Savings
Oracle
Oracle
ETLTL
Data Migrator
Oracle
Avg.
13.
5M ro
ws
per fi
le/t
able
![Page 24: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/24.jpg)
POC Results – Informatica
Elapsed timeMemory Peak
(Mb)Approximate
CPU Time
Max I/O Utilization -
Read MB/Sec
Ave I/O Utilization
– Read (Meg/s)
Max I/O Utilization
– Write(MB/Sec
Ave I/O Utilization
– WriteMB/Sec
PowerCenter 0:28:10 11,875 1:06:29.2 53 12 82 39DMExpress 0:13:26 9,438 0:16:53.9 154 33 101 66
DMExpress (Linux) 0:05:43 9,957 0:16:21 N/A 83 N/A 142
PC DMX DMX (Linux)00:00:00
00:07:12
00:14:24
00:21:36
00:28:48
00:36:00
Elapsed Time
PC DMX DMX (Linux) 0.0
2.0
4.0
6.0
8.0
10.0
12.0
14.0
Memory (Gb)
PC DMX DMX (Linux) 0:00:000:07:120:14:240:21:360:28:480:36:000:43:120:50:240:57:361:04:481:12:00
CPU Time
52% 80% 21% 16% 75% 75%
![Page 25: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/25.jpg)
Benchmark Details DMExpress vs. Informatica
Task Time
Copy 4mins 09 seconds
Sort 7mins 26 seconds
Aggregate 9mins 37 seconds
Sort & Aggregate 3mins 43 seconds
Current DMX
Max 88% Reduction
Min 57% Reduction
Avg 80% Reduction
Task Time Saving
Copy 0mins 50 seconds 80%
Sort 1mins 19 seconds 82%
Aggregate 1mins 9 seconds 88%
Sort & Aggregate 1mins 37 seconds 57%
5 GbFile –45 M Records
Task Time
Copy 20mins 53 seconds
Sort 31mins 48 seconds
Aggregate 20mins 45 seconds
Sort & Aggregate 14mins 53 seconds
Task Time Saving
Copy 4mins 12 seconds 80%
Sort 6mins 17 seconds 80%
Aggregate 4mins 30 seconds 78%
Sort & Aggregate 6mins 38 seconds 55%
25 GbFile –225 MRecords
Max 80% Reduction
Min 55% Reduction
Avg 75% Reduction
![Page 26: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/26.jpg)
Ab Initio Benchmark
Syncsort Confidential and Proprietary - do not copy or distribute 27
Scenario1 (copy/filter)
Elapsed time CPU time Temp Workspace Records read Record written Data read Data written (bytes)
DMExpress 47 minutes 3 hours 44 min 0 GB 2,926,155,265 452,375,411 383,326,339,715 59,261,178,841Ab Initio 66 minutes 4 hours 38 min 0 GB 2,926,155,265 452,375,411 383,326,339,715 59,261,178,841
Scenario2 (Sort)
Elapsed time CPU time Temp Workspace Records read Record written Data read Data written (bytes)
DMExpress 1 hour 12 min 7 hours 26 min 60 GB 2,926,155,265 2,926,155,265 383,326,339,715 383,326,339,715Ab Initio 4 hours 42 min 9 hours 48 min 360 GB 2,926,155,265 2,926,155,265 383,326,339,715 383,326,339,715
Scenario3 (Aggregation/Rollup)
Elapsed time CPU time Temp Workspace Records read Record written Data read Data written (bytes)DMExpress 1 hour 21 min 7 hour 10 min 4 GB 2,926,155,265 27,179,924 383,326,339,715 4,022,628,752
Ab Initio 2 hours 10 hours 14 min 360 GB 2,926,155,265 27,179,924 383,326,339,715 4,022,628,752
Ab Initio tuned 8 waysDMExpress with no tuning
![Page 27: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/27.jpg)
DMExpress – White Boarding the Data Acceleration Sales
Metadata with Miti
![Page 28: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/28.jpg)
ETL to DMExpress acceleration / conversion
Syncsort Confidential and Proprietary - do not copy or distribute 29
Parsing• Informatica• IBM DataStage• PL/SQL• Etc…
Processing• Flow analysis• Expression & type analysis• Optimization
Output Generation• DMExpress• Documentation
Conversion Utility
UNIX shell scriptsInformatica workflowsInformatica mappingsSpreadsheets identifying the production workflows and mappingsTiming information of the job executions over a two month periodResource data points for the workflows
Automatic Conversion Utility
Cognizant Migration / Optimization COE
![Page 29: Optimization](https://reader035.vdocument.in/reader035/viewer/2022062418/5556ed4ad8b42a0f028b5118/html5/thumbnails/29.jpg)
DMExpress – White Boarding the Data Acceleration Sales P
DMX Live Demo