b-fundamentals of datastage parallelism
TRANSCRIPT
-
8/6/2019 B-Fundamentals of DataStage Parallelism
1/21
Overview of DataStage Parallelism
-
8/6/2019 B-Fundamentals of DataStage Parallelism
2/21
Server & Parallel JobsServer & Parallel Jobs
Parallel Jobs separate from the Server Jobs separate palette, different stages
A single project can contain parallel as well as server jobs, can becalled using the same sequencer
Use Parallel jobs where there is a large volume of data to be processed Overhead for management of parallel processes
Hardware support for parallelization
MPP Single box, multiple share-nothing processors: Parallelization will give
performance improvements Clustered Multiple boxes: Parallelization will give performance improvements
SMP processors Shared memory, single OS
CPU-limited jobs parallelization to increase number of processors used will help improveperformance
Memory-limited jobs In SMP systems (with shared memory) may need hardware upgrade
Disk I/O limited jobs - Some SMP systems allow scalability of disk I/O, so that throughputimproves as the number of processors increases.
Server jobs also support limited degree of parallelization
-
8/6/2019 B-Fundamentals of DataStage Parallelism
3/21
Types or ParallelismTypes or Parallelism
Internal Performance Enhancements
Parallel execution of independent flows
Pipelining: Each row is processed passed on to the next process in a pileline
Partitioning: rows split across different processes, each performing the same logic
Process 1 Process 2 Process 3 Process 4Row 4 Row 3 Row 2 Row 1
Operator 5
Operator 2 Operator 3 Operator 4
Operator 6 Operator 7
Operator 1
Wait Point for dependent processWait Point for dependent process
Logical Unit
P1P1
P1P1
P1P1
Data
Partitioned
Data
1a 1a 1a
1b 1b1b
2
Note that these options also depend on-support from the h/w & OS-server settings & configurations
-
8/6/2019 B-Fundamentals of DataStage Parallelism
4/21
DataStage Enterprise EditionDataStage Enterprise Edition
Pipelining & Partitioning Combined
OracleOracleOracle Oracle
N-way partitioned data read, process & write
Pipeline processed for read, process & write
Optimize the parallelism
Better utilization of available CPU & hardware resources
Overhead of copying, and preparing data, managing the processes
-
8/6/2019 B-Fundamentals of DataStage Parallelism
5/21
PipeliningPipelining
By default,
DataStage combines operators where possible
E.g. Two transform stages may combined into a single operator
Saves data copying & preparation
This can be over-ridden at a project or stage level, ifwe wish to create separate processes for each
operator
-
8/6/2019 B-Fundamentals of DataStage Parallelism
6/21
Data PartitioningData Partitioning
Default The configuration file decides how many process instances of each operator is created, e.g. if 4
nodes are defined, there is a 4-way partition of data
By default, Auto-Partitioning is set
DS chooses the optimum partitioning & repartitioning mechanism
Round-robin is applied at the first level followed by Same
If there is a need for key-based partition upstream or down-stream, then alternative modes arechosen
e.g. in the case of a join, the data in the input link is sorted & partitioned by the join key
Degree of parallelism
This is decided by the Configuration file.
The configuration file used can be varied at the job-level to suit different job requirements Individual stages may also be executed on a selected nodes by specifying the node map
constraints
Where the overhead of partitioning is not worth the performance improvement, the entire job ora specific stage may be executed sequentially.
Avoid Repartitioning & Redundant Sorting
The Designer palette indicates links at which these occur (when not auto partitioning)
Designers may try to optimize this through changing the order in which the stages occur orotherwise modifying the jobs
Sequential files are non-partitioned so are less optimal intermediate storage formats. DataSetsmay be used to preserve partitioning across jobs.
-
8/6/2019 B-Fundamentals of DataStage Parallelism
7/21
Configuration FileConfiguration File
The Configuration File
System size & configuration details maintained external to the job design
Can be modified to suit development & production environment, handlehardware upgrades, etc. without redesigning/recompiling jobs
The configuration file describes available processing power in terms ofprocessing nodes
determines how many instances of a process will be produced when you compilea parallel job.
Minimum #Nodes < times #CPUs Minimum Recommended
Usual starting point for #Nodes = # CPUs
# Nodes < # CPUs if some CPUs left free for OS, DB and other applications
# Nodes > # CPUs for I/O intensive streams with poor CPU-usage
Associates the scratchdisk with each node
-
8/6/2019 B-Fundamentals of DataStage Parallelism
8/21
Configuration FileConfiguration File
Sample{
node "node1"{
fastname "ibmsceai"
pools ""resource disk "/home/dsadm/Ascential/DataStage/Datasets" {pools""}resource scratchdisk "/home/dsadm/Ascential/DataStage/Scratch" {pools ""}
}node "node2"{
fastname "ibmsceai"pools ""resource disk "/home/dsadm/Ascential/DataStage/Datasets" {pools""}resource scratchdisk "/home/dsadm/Ascential/DataStage/Scratch" {pools ""}
}}
-
8/6/2019 B-Fundamentals of DataStage Parallelism
9/21
Partitioning & CollectingPartitioning & Collecting
Partitioning Techniques : to be specified at each stage (or left as default) General load-balancing techniques
Round Robin most efficient
Random
Same use partition strategy of input feed, no repartitioning Entire each node receives every row, e.g. to create a lookup file
Hash By Key e.g to remove duplicates, aggregate over fewer groups
Key Modulo
Range ensure related records come together DB2 same algorithm used by DB2
Auto leave it to DataStage Usually Initially Round-robin, then Same
Collecting
Round Robin Ordered Read all records from first partition, then from second and so on
Sorted Merge Read records based on one or more columns (collectingkey)
15
-
8/6/2019 B-Fundamentals of DataStage Parallelism
10/21
Slide 9
15 usually...Otherwise determined by flow of jobs & requirements or the job stream111474, 6/2/2006
-
8/6/2019 B-Fundamentals of DataStage Parallelism
11/21
Discussion on Partitioning Data Within A Job
-
8/6/2019 B-Fundamentals of DataStage Parallelism
12/21
PartitioningPartitioning
Assume that a configuration file has beenset
In the simplest case,
when no specific options have been set data is processed as partitions
Unless downstream stages require otherwise
partition mechanism is Round Robin to begin with subsequent stages have Same, i.e no repartitioning happens
This normally gives maximum performance benefit & resourceusage
Set/Unset Designer menu options Diagram->LinkMarkings
-
8/6/2019 B-Fundamentals of DataStage Parallelism
13/21
PartitioningPartitioning
Sequential Files - Read
Normally read in sequence, i.e 1 Process to read the data & pass it to the next stage
Output data is partitioned in round-robin into partitions
Processing Stages
Receives partitioned data & propagates it using the Same method
Sequential Files Write
Partitioned Data collected & written sequentially
Read Sequentially &Round-Robin Partition
No Repartitioning here
Auto Partition sets it to Same
Collected & written sequentiallyCan sort on collection
-
8/6/2019 B-Fundamentals of DataStage Parallelism
14/21
PartitioningPartitioning
Sequential Files
Executes in parallel whenreading multiple files
For fixed-width files, can
parallelize using multiplereaders per node ORMultiple Nodes
Multiple files => 1 Partition per file
On SMP systems: set Number of ReadersPer Node
On cluster systems, set Read FromMultiple Nodes
-
8/6/2019 B-Fundamentals of DataStage Parallelism
15/21
PartitioningPartitioning
Back to EE_TRG_Demo_1
Note that we did not set any specific options for parallelism or partitioning
Read Sequentially &Apply Partition (round-robin bydefault)
Icon for Auto Partition
-
8/6/2019 B-Fundamentals of DataStage Parallelism
16/21
PartitioningPartitioning
DataStage inserts Operators for sort, partition, buffering, etc.
Aggregator Stage needs Key-
partitioned dataImplicit Insert: Hash Partition
Join Stage expects all input links to beKey-Partitioned & Sorted. Partitioningmode must be the sameImplicit Insert:
Hash Partition (if required**) & Sort onJoin Key(s)
** Note that in this case data output from aggregator is not partitionedagain since it is already in the required partitioning format. It is only sorted
-
8/6/2019 B-Fundamentals of DataStage Parallelism
17/21
PartitioningPartitioning
USUALLY Auto modes works sufficiently well
When would we worry about the partitioning mechanism?
Some cases of debugging
Performance Tuning
Look at Link Icons to identify where partitioning, explicit repartitioning &collection has occurred
Advanced Users: look at DataStage log for
What partition has been implicitly applied
What and how the stages have been interpreted within the OSH
How data is distributed across the partitions
Where & why repartitioning occurs
Note that the level of reporting will depend on the environment variable settings
Tune parallelism
through the configuration file
running specific stages sequentially or on selected node pool(s)
Changing the partition mode
Enabling or disabling Operator Combinability, etc.
-
8/6/2019 B-Fundamentals of DataStage Parallelism
18/21
PartitioningPartitioning
Most Stages have an Advanced tab with parallelization options
Sequential/Parallel
Combine operators (wherepossible) or Do not combine
Combine operators (where
possible) or Do not combine
Constraints or limitations on whichnodes are to be used
-
8/6/2019 B-Fundamentals of DataStage Parallelism
19/21
PartitioningPartitioning
Each Input Link into a stage has a Partitioning tab
Select Partitioning Type
Select Partitioning Key (ifapplicable)
Options on sorting incoming link data.
If stage is executed sequentially & preceding stage is parallel, then theCollection options are available
-
8/6/2019 B-Fundamentals of DataStage Parallelism
20/21
PartitioningPartitioning
How can inappropriate partitioning cause wrong results to be produced?
If the partitioning mode, on say the Aggregation Stage, is for some reason explicitly set to anon-key mode (such as round-robin), the stage will return WRONG RESULTS
80B
70B
60B
50B
40A
30A
20A
10A
Amt ValGrp Key
70B
50B
30A
10A
Amt ValGrp Key
80B
60B
40A
20A
Amt ValGrp Key
120B
40A
Amt ValGrp Key
140B
60A
Amt ValGrp Key
Round-RobinPartition
Aggregate within the partition Wrong Output!
140B
60A
120B
40AAmt ValGrp Key
-
8/6/2019 B-Fundamentals of DataStage Parallelism
21/21
Case Study 2