b-fundamentals of datastage parallelism

8/6/2019 B-Fundamentals of DataStage Parallelism

1/21

Overview of DataStage Parallelism


2/21

Server & Parallel JobsServer & Parallel Jobs

Parallel Jobs separate from the Server Jobs separate palette, different stages

A single project can contain parallel as well as server jobs, can becalled using the same sequencer

Use Parallel jobs where there is a large volume of data to be processed Overhead for management of parallel processes

Hardware support for parallelization

MPP Single box, multiple share-nothing processors: Parallelization will give

performance improvements Clustered Multiple boxes: Parallelization will give performance improvements

SMP processors Shared memory, single OS

CPU-limited jobs parallelization to increase number of processors used will help improveperformance

Memory-limited jobs In SMP systems (with shared memory) may need hardware upgrade

Disk I/O limited jobs - Some SMP systems allow scalability of disk I/O, so that throughputimproves as the number of processors increases.

Server jobs also support limited degree of parallelization


3/21

Types or ParallelismTypes or Parallelism

Internal Performance Enhancements

Parallel execution of independent flows

Pipelining: Each row is processed passed on to the next process in a pileline

Partitioning: rows split across different processes, each performing the same logic

Process 1 Process 2 Process 3 Process 4Row 4 Row 3 Row 2 Row 1

Operator 5

Operator 2 Operator 3 Operator 4

Operator 6 Operator 7

Operator 1

Wait Point for dependent processWait Point for dependent process

Logical Unit

P1P1

P1P1

P1P1

Data

Partitioned

Data

1a 1a 1a

1b 1b1b

2

Note that these options also depend on-support from the h/w & OS-server settings & configurations


4/21

DataStage Enterprise EditionDataStage Enterprise Edition

Pipelining & Partitioning Combined

OracleOracleOracle Oracle

N-way partitioned data read, process & write

Pipeline processed for read, process & write

Optimize the parallelism

Better utilization of available CPU & hardware resources

Overhead of copying, and preparing data, managing the processes


5/21

PipeliningPipelining

By default,

DataStage combines operators where possible

E.g. Two transform stages may combined into a single operator

Saves data copying & preparation

This can be over-ridden at a project or stage level, ifwe wish to create separate processes for each

operator


6/21

Data PartitioningData Partitioning

Default The configuration file decides how many process instances of each operator is created, e.g. if 4

nodes are defined, there is a 4-way partition of data

By default, Auto-Partitioning is set

DS chooses the optimum partitioning & repartitioning mechanism

Round-robin is applied at the first level followed by Same

If there is a need for key-based partition upstream or down-stream, then alternative modes arechosen

e.g. in the case of a join, the data in the input link is sorted & partitioned by the join key

Degree of parallelism

This is decided by the Configuration file.

The configuration file used can be varied at the job-level to suit different job requirements Individual stages may also be executed on a selected nodes by specifying the node map

constraints

Where the overhead of partitioning is not worth the performance improvement, the entire job ora specific stage may be executed sequentially.

Avoid Repartitioning & Redundant Sorting

The Designer palette indicates links at which these occur (when not auto partitioning)

Designers may try to optimize this through changing the order in which the stages occur orotherwise modifying the jobs

Sequential files are non-partitioned so are less optimal intermediate storage formats. DataSetsmay be used to preserve partitioning across jobs.


7/21

Configuration FileConfiguration File

The Configuration File

System size & configuration details maintained external to the job design

Can be modified to suit development & production environment, handlehardware upgrades, etc. without redesigning/recompiling jobs

The configuration file describes available processing power in terms ofprocessing nodes

determines how many instances of a process will be produced when you compilea parallel job.

Minimum #Nodes < times #CPUs Minimum Recommended

Usual starting point for #Nodes = # CPUs

# Nodes < # CPUs if some CPUs left free for OS, DB and other applications

# Nodes > # CPUs for I/O intensive streams with poor CPU-usage

Associates the scratchdisk with each node


8/21

Configuration FileConfiguration File

Sample{

node "node1"{

fastname "ibmsceai"

pools ""resource disk "/home/dsadm/Ascential/DataStage/Datasets" {pools""}resource scratchdisk "/home/dsadm/Ascential/DataStage/Scratch" {pools ""}

}node "node2"{

fastname "ibmsceai"pools ""resource disk "/home/dsadm/Ascential/DataStage/Datasets" {pools""}resource scratchdisk "/home/dsadm/Ascential/DataStage/Scratch" {pools ""}

}}


9/21

Partitioning & CollectingPartitioning & Collecting

Partitioning Techniques : to be specified at each stage (or left as default) General load-balancing techniques

Round Robin most efficient

Random

Same use partition strategy of input feed, no repartitioning Entire each node receives every row, e.g. to create a lookup file

Hash By Key e.g to remove duplicates, aggregate over fewer groups

Key Modulo

Range ensure related records come together DB2 same algorithm used by DB2

Auto leave it to DataStage Usually Initially Round-robin, then Same

Collecting

Round Robin Ordered Read all records from first partition, then from second and so on

Sorted Merge Read records based on one or more columns (collectingkey)

15


10/21

Slide 9

15 usually...Otherwise determined by flow of jobs & requirements or the job stream111474, 6/2/2006


11/21

Discussion on Partitioning Data Within A Job


12/21

PartitioningPartitioning

Assume that a configuration file has beenset

In the simplest case,

when no specific options have been set data is processed as partitions

Unless downstream stages require otherwise

partition mechanism is Round Robin to begin with subsequent stages have Same, i.e no repartitioning happens

This normally gives maximum performance benefit & resourceusage

Set/Unset Designer menu options Diagram->LinkMarkings


13/21


Sequential Files - Read

Normally read in sequence, i.e 1 Process to read the data & pass it to the next stage

Output data is partitioned in round-robin into partitions

Processing Stages

Receives partitioned data & propagates it using the Same method

Sequential Files Write

Partitioned Data collected & written sequentially

Read Sequentially &Round-Robin Partition

No Repartitioning here

Auto Partition sets it to Same

Collected & written sequentiallyCan sort on collection


14/21


Sequential Files

Executes in parallel whenreading multiple files

For fixed-width files, can

parallelize using multiplereaders per node ORMultiple Nodes

Multiple files => 1 Partition per file

On SMP systems: set Number of ReadersPer Node

On cluster systems, set Read FromMultiple Nodes


15/21


Back to EE_TRG_Demo_1

Note that we did not set any specific options for parallelism or partitioning

Read Sequentially &Apply Partition (round-robin bydefault)

Icon for Auto Partition


16/21


DataStage inserts Operators for sort, partition, buffering, etc.

Aggregator Stage needs Key-

partitioned dataImplicit Insert: Hash Partition

Join Stage expects all input links to beKey-Partitioned & Sorted. Partitioningmode must be the sameImplicit Insert:

Hash Partition (if required**) & Sort onJoin Key(s)

** Note that in this case data output from aggregator is not partitionedagain since it is already in the required partitioning format. It is only sorted


17/21


USUALLY Auto modes works sufficiently well

When would we worry about the partitioning mechanism?

Some cases of debugging

Performance Tuning

Look at Link Icons to identify where partitioning, explicit repartitioning &collection has occurred

Advanced Users: look at DataStage log for

What partition has been implicitly applied

What and how the stages have been interpreted within the OSH

How data is distributed across the partitions

Where & why repartitioning occurs

Note that the level of reporting will depend on the environment variable settings

Tune parallelism

through the configuration file

running specific stages sequentially or on selected node pool(s)

Changing the partition mode

Enabling or disabling Operator Combinability, etc.


18/21


Most Stages have an Advanced tab with parallelization options

Sequential/Parallel

Combine operators (wherepossible) or Do not combine

Combine operators (where

possible) or Do not combine

Constraints or limitations on whichnodes are to be used


19/21


Each Input Link into a stage has a Partitioning tab

Select Partitioning Type

Select Partitioning Key (ifapplicable)

Options on sorting incoming link data.

If stage is executed sequentially & preceding stage is parallel, then theCollection options are available


20/21


How can inappropriate partitioning cause wrong results to be produced?

If the partitioning mode, on say the Aggregation Stage, is for some reason explicitly set to anon-key mode (such as round-robin), the stage will return WRONG RESULTS

80B

70B

60B

50B

40A

30A

20A

10A

Amt ValGrp Key

70B

50B

30A

10A

Amt ValGrp Key

80B

60B

40A

20A

Amt ValGrp Key

120B

40A

Amt ValGrp Key

140B

60A

Amt ValGrp Key

Round-RobinPartition

Aggregate within the partition Wrong Output!

140B

60A

120B

40AAmt ValGrp Key


21/21

Case Study 2

b-fundamentals of datastage parallelism

Documents