b-fundamentals of datastage parallelism

Upload: sam2sung2

Post on 07-Apr-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 B-Fundamentals of DataStage Parallelism

    1/21

    Overview of DataStage Parallelism

  • 8/6/2019 B-Fundamentals of DataStage Parallelism

    2/21

    Server & Parallel JobsServer & Parallel Jobs

    Parallel Jobs separate from the Server Jobs separate palette, different stages

    A single project can contain parallel as well as server jobs, can becalled using the same sequencer

    Use Parallel jobs where there is a large volume of data to be processed Overhead for management of parallel processes

    Hardware support for parallelization

    MPP Single box, multiple share-nothing processors: Parallelization will give

    performance improvements Clustered Multiple boxes: Parallelization will give performance improvements

    SMP processors Shared memory, single OS

    CPU-limited jobs parallelization to increase number of processors used will help improveperformance

    Memory-limited jobs In SMP systems (with shared memory) may need hardware upgrade

    Disk I/O limited jobs - Some SMP systems allow scalability of disk I/O, so that throughputimproves as the number of processors increases.

    Server jobs also support limited degree of parallelization

  • 8/6/2019 B-Fundamentals of DataStage Parallelism

    3/21

    Types or ParallelismTypes or Parallelism

    Internal Performance Enhancements

    Parallel execution of independent flows

    Pipelining: Each row is processed passed on to the next process in a pileline

    Partitioning: rows split across different processes, each performing the same logic

    Process 1 Process 2 Process 3 Process 4Row 4 Row 3 Row 2 Row 1

    Operator 5

    Operator 2 Operator 3 Operator 4

    Operator 6 Operator 7

    Operator 1

    Wait Point for dependent processWait Point for dependent process

    Logical Unit

    P1P1

    P1P1

    P1P1

    Data

    Partitioned

    Data

    1a 1a 1a

    1b 1b1b

    2

    Note that these options also depend on-support from the h/w & OS-server settings & configurations

  • 8/6/2019 B-Fundamentals of DataStage Parallelism

    4/21

    DataStage Enterprise EditionDataStage Enterprise Edition

    Pipelining & Partitioning Combined

    OracleOracleOracle Oracle

    N-way partitioned data read, process & write

    Pipeline processed for read, process & write

    Optimize the parallelism

    Better utilization of available CPU & hardware resources

    Overhead of copying, and preparing data, managing the processes

  • 8/6/2019 B-Fundamentals of DataStage Parallelism

    5/21

    PipeliningPipelining

    By default,

    DataStage combines operators where possible

    E.g. Two transform stages may combined into a single operator

    Saves data copying & preparation

    This can be over-ridden at a project or stage level, ifwe wish to create separate processes for each

    operator

  • 8/6/2019 B-Fundamentals of DataStage Parallelism

    6/21

    Data PartitioningData Partitioning

    Default The configuration file decides how many process instances of each operator is created, e.g. if 4

    nodes are defined, there is a 4-way partition of data

    By default, Auto-Partitioning is set

    DS chooses the optimum partitioning & repartitioning mechanism

    Round-robin is applied at the first level followed by Same

    If there is a need for key-based partition upstream or down-stream, then alternative modes arechosen

    e.g. in the case of a join, the data in the input link is sorted & partitioned by the join key

    Degree of parallelism

    This is decided by the Configuration file.

    The configuration file used can be varied at the job-level to suit different job requirements Individual stages may also be executed on a selected nodes by specifying the node map

    constraints

    Where the overhead of partitioning is not worth the performance improvement, the entire job ora specific stage may be executed sequentially.

    Avoid Repartitioning & Redundant Sorting

    The Designer palette indicates links at which these occur (when not auto partitioning)

    Designers may try to optimize this through changing the order in which the stages occur orotherwise modifying the jobs

    Sequential files are non-partitioned so are less optimal intermediate storage formats. DataSetsmay be used to preserve partitioning across jobs.

  • 8/6/2019 B-Fundamentals of DataStage Parallelism

    7/21

    Configuration FileConfiguration File

    The Configuration File

    System size & configuration details maintained external to the job design

    Can be modified to suit development & production environment, handlehardware upgrades, etc. without redesigning/recompiling jobs

    The configuration file describes available processing power in terms ofprocessing nodes

    determines how many instances of a process will be produced when you compilea parallel job.

    Minimum #Nodes < times #CPUs Minimum Recommended

    Usual starting point for #Nodes = # CPUs

    # Nodes < # CPUs if some CPUs left free for OS, DB and other applications

    # Nodes > # CPUs for I/O intensive streams with poor CPU-usage

    Associates the scratchdisk with each node

  • 8/6/2019 B-Fundamentals of DataStage Parallelism

    8/21

    Configuration FileConfiguration File

    Sample{

    node "node1"{

    fastname "ibmsceai"

    pools ""resource disk "/home/dsadm/Ascential/DataStage/Datasets" {pools""}resource scratchdisk "/home/dsadm/Ascential/DataStage/Scratch" {pools ""}

    }node "node2"{

    fastname "ibmsceai"pools ""resource disk "/home/dsadm/Ascential/DataStage/Datasets" {pools""}resource scratchdisk "/home/dsadm/Ascential/DataStage/Scratch" {pools ""}

    }}

  • 8/6/2019 B-Fundamentals of DataStage Parallelism

    9/21

    Partitioning & CollectingPartitioning & Collecting

    Partitioning Techniques : to be specified at each stage (or left as default) General load-balancing techniques

    Round Robin most efficient

    Random

    Same use partition strategy of input feed, no repartitioning Entire each node receives every row, e.g. to create a lookup file

    Hash By Key e.g to remove duplicates, aggregate over fewer groups

    Key Modulo

    Range ensure related records come together DB2 same algorithm used by DB2

    Auto leave it to DataStage Usually Initially Round-robin, then Same

    Collecting

    Round Robin Ordered Read all records from first partition, then from second and so on

    Sorted Merge Read records based on one or more columns (collectingkey)

    15

  • 8/6/2019 B-Fundamentals of DataStage Parallelism

    10/21

    Slide 9

    15 usually...Otherwise determined by flow of jobs & requirements or the job stream111474, 6/2/2006

  • 8/6/2019 B-Fundamentals of DataStage Parallelism

    11/21

    Discussion on Partitioning Data Within A Job

  • 8/6/2019 B-Fundamentals of DataStage Parallelism

    12/21

    PartitioningPartitioning

    Assume that a configuration file has beenset

    In the simplest case,

    when no specific options have been set data is processed as partitions

    Unless downstream stages require otherwise

    partition mechanism is Round Robin to begin with subsequent stages have Same, i.e no repartitioning happens

    This normally gives maximum performance benefit & resourceusage

    Set/Unset Designer menu options Diagram->LinkMarkings

  • 8/6/2019 B-Fundamentals of DataStage Parallelism

    13/21

    PartitioningPartitioning

    Sequential Files - Read

    Normally read in sequence, i.e 1 Process to read the data & pass it to the next stage

    Output data is partitioned in round-robin into partitions

    Processing Stages

    Receives partitioned data & propagates it using the Same method

    Sequential Files Write

    Partitioned Data collected & written sequentially

    Read Sequentially &Round-Robin Partition

    No Repartitioning here

    Auto Partition sets it to Same

    Collected & written sequentiallyCan sort on collection

  • 8/6/2019 B-Fundamentals of DataStage Parallelism

    14/21

    PartitioningPartitioning

    Sequential Files

    Executes in parallel whenreading multiple files

    For fixed-width files, can

    parallelize using multiplereaders per node ORMultiple Nodes

    Multiple files => 1 Partition per file

    On SMP systems: set Number of ReadersPer Node

    On cluster systems, set Read FromMultiple Nodes

  • 8/6/2019 B-Fundamentals of DataStage Parallelism

    15/21

    PartitioningPartitioning

    Back to EE_TRG_Demo_1

    Note that we did not set any specific options for parallelism or partitioning

    Read Sequentially &Apply Partition (round-robin bydefault)

    Icon for Auto Partition

  • 8/6/2019 B-Fundamentals of DataStage Parallelism

    16/21

    PartitioningPartitioning

    DataStage inserts Operators for sort, partition, buffering, etc.

    Aggregator Stage needs Key-

    partitioned dataImplicit Insert: Hash Partition

    Join Stage expects all input links to beKey-Partitioned & Sorted. Partitioningmode must be the sameImplicit Insert:

    Hash Partition (if required**) & Sort onJoin Key(s)

    ** Note that in this case data output from aggregator is not partitionedagain since it is already in the required partitioning format. It is only sorted

  • 8/6/2019 B-Fundamentals of DataStage Parallelism

    17/21

    PartitioningPartitioning

    USUALLY Auto modes works sufficiently well

    When would we worry about the partitioning mechanism?

    Some cases of debugging

    Performance Tuning

    Look at Link Icons to identify where partitioning, explicit repartitioning &collection has occurred

    Advanced Users: look at DataStage log for

    What partition has been implicitly applied

    What and how the stages have been interpreted within the OSH

    How data is distributed across the partitions

    Where & why repartitioning occurs

    Note that the level of reporting will depend on the environment variable settings

    Tune parallelism

    through the configuration file

    running specific stages sequentially or on selected node pool(s)

    Changing the partition mode

    Enabling or disabling Operator Combinability, etc.

  • 8/6/2019 B-Fundamentals of DataStage Parallelism

    18/21

    PartitioningPartitioning

    Most Stages have an Advanced tab with parallelization options

    Sequential/Parallel

    Combine operators (wherepossible) or Do not combine

    Combine operators (where

    possible) or Do not combine

    Constraints or limitations on whichnodes are to be used

  • 8/6/2019 B-Fundamentals of DataStage Parallelism

    19/21

    PartitioningPartitioning

    Each Input Link into a stage has a Partitioning tab

    Select Partitioning Type

    Select Partitioning Key (ifapplicable)

    Options on sorting incoming link data.

    If stage is executed sequentially & preceding stage is parallel, then theCollection options are available

  • 8/6/2019 B-Fundamentals of DataStage Parallelism

    20/21

    PartitioningPartitioning

    How can inappropriate partitioning cause wrong results to be produced?

    If the partitioning mode, on say the Aggregation Stage, is for some reason explicitly set to anon-key mode (such as round-robin), the stage will return WRONG RESULTS

    80B

    70B

    60B

    50B

    40A

    30A

    20A

    10A

    Amt ValGrp Key

    70B

    50B

    30A

    10A

    Amt ValGrp Key

    80B

    60B

    40A

    20A

    Amt ValGrp Key

    120B

    40A

    Amt ValGrp Key

    140B

    60A

    Amt ValGrp Key

    Round-RobinPartition

    Aggregate within the partition Wrong Output!

    140B

    60A

    120B

    40AAmt ValGrp Key

  • 8/6/2019 B-Fundamentals of DataStage Parallelism

    21/21

    Case Study 2