datastage configuration file

12
DataStage Configuration file FAQ 1. APT_CONFIG_FILE is the file using which DataStage determines the configuration file (one can have many configuration files for a project) to be used. In fact, this is what is generally used in production. However, if this environment variable is not defined then how DataStage determines which file to use ? 1. If the APT_CONFIG_FILE environment variable is not defined then DataStage look for default configuration file (config.apt) in following path: 1. Current working directory. 2. INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level directory of DataStage installation. 2. What are the different options a logical node can have in the configuration file ? 1. fastname – The fastname is the physical node name that stages use to open connections for high volume data transfers. The attribute of this option is often the network name. Typically, you can get this name by using Unix command ‘uname -n’. 2. pools – Name of the pools to which the node is assigned to. Based on the characteristics of the processing nodes you can group nodes into set of pools. 1. A pool can be associated with many nodes and a node can be part of many pools. 2. A node belongs to the default pool unless you explicitly specify apools list for it, and omit the default pool name (“”) from the list. 3. A parallel job or specific stage in the parallel job can be constrained to run on a pool (set of processing nodes). 1. In case job as well as stage within the job are constrained to run on specific processing nodes then stage will run on the node which is common to stage as well as job. 3. resource – resource resource_type “location” [{pools “disk_pool_name”}] | resource resource_type “value” . resource_type can be canonicalhostname (Which takes quoted ethernet name of a node in cluster that is unconnected to Conductor node by the hight speed network.) or disk (To read/write persistent data to this directory.) or scratchdisk (Quoted absolute path name of a directory on a file system where intermediate data will be temporarily stored. It is local to the processing node.) or RDBMS Specific resourses (e.g. DB2, INFORMIX, ORACLE, etc.) 3. How datastage decides on which processing node a stage should be run ? 1. If a job or stage is not constrained to run on specific nodes then parallel engine executes a parallel stage on all nodes defined in the default node pool. (Default Behavior) 2. If the node is constrained then the constrained processing nodes are choosen while executing the parallel stage. (Refer to 2.2.3 for more detail). 4. When configuring an MPP, you specify the physical nodes in your system on which the parallel engine will run your parallel jobs. This is called Conductor Node. For other nodes, you do not need to specify the physical node. Also, You need to copy the (.apt) configuration file only to the nodes from which you start parallel

Upload: vasu-srinivas

Post on 22-Oct-2014

493 views

Category:

Documents


9 download

TRANSCRIPT

Page 1: DataStage Configuration File

DataStage Configuration file FAQ1. APT_CONFIG_FILE is the file using which DataStage determines the configuration file (one can have many

configuration files for a project) to be used. In fact, this is what is generally used in production. However, if this

environment variable is not defined then how DataStage determines which file to use?

1. If the APT_CONFIG_FILE environment variable is not defined then DataStage look for default configuration file

(config.apt) in following path:

1. Current working directory.

2. INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level directory of DataStage

installation.

2. What are the different options a logical node can have in the configuration file ?

1. fastname – The fastname is the physical node name that stages use to open connections for high volume data

transfers. The attribute of this option is often the network name. Typically, you can get this name by using Unix

command ‘uname -n’.

2. pools – Name of the pools to which the node is assigned to. Based on the characteristics of the processing nodes you

can group nodes into set of pools.

1. A pool can be associated with many nodes and a node can be part of many pools.

2. A node belongs to the default pool unless you explicitly specify apools list for it, and omit the default pool

name (“”) from the list.

3. A parallel job or specific stage in the parallel job can be constrained to run on a pool (set of processing

nodes).

1. In case job as well as stage within the job are constrained to run on specific processing nodes then

stage will run on the node which is common to stage as well as job.

3. resource – resource resource_type “location” [{pools “disk_pool_name”}]  | resource resource_type “value” .

resource_type can be canonicalhostname (Which takes quoted ethernet name of a node in cluster that is

unconnected to Conductor node by the hight speed network.) or disk (To read/write persistent data to this

directory.) or scratchdisk (Quoted absolute path name of a directory on a file system where intermediate data will

be temporarily stored. It is local to the processing node.) or RDBMS Specific resourses (e.g. DB2, INFORMIX,

ORACLE, etc.)

3. How datastage decides on which processing node a stage should be run ?

1. If a job or stage is not constrained to run on specific nodes then parallel engine executes a parallel stage on all nodes

defined in the default node pool. (Default Behavior)

2. If the node is constrained then the constrained processing nodes are choosen while executing the parallel stage.

(Refer to 2.2.3 for more detail).

4. When configuring an MPP, you specify the physical nodes in your system on which the parallel engine will run your

parallel jobs. This is called Conductor Node. For other nodes, you do not need to specify the physical node.  Also, You

need to copy the (.apt) configuration file only to the nodes from which you start parallel engine applications. It is possible

that conductor node is not connected with the high-speed network switches. However, the other nodes are connected to

each other using a very high-speed network switches. How do you configure your system so that you will be able to

achieve optimized parallelism?

1. Make sure that none of the stages are specified to be run on the conductor node.

2. Use conductor node just to start the execution of parallel job.

3. Make sure that conductor node is not the part of the default pool.

5. Although, parallelization increases the throughput and speed of the process, why maximum parallelization is not

necessarily the optimal parallelization?

Page 2: DataStage Configuration File

1. Datastage creates one process for every stage for each processing node.  Hence, if the hardware resource is not

available to support the maximum parallelization, the performance of overall system goes down. For example,

suppose we have a SMP system with three CPU and a Parallel job with 4 stage. We have 3 logical node (one

corresponding to each physical node (say CPU)). Now DataStage will start 3*4 = 12 processes, which has to be

managed by a single operating system. Significant time will be spent in switching context and scheduling the

process.

6. Since we can have different logical processing nodes, it is possible that some node will be more suitable for some stage

while other nodes will be more suitable for other stages. So, when to decide which node will be suitable for which stage?

1. If a stage is performing a memory intensive task then it should be run on a node which has more disk

space available for it. E.g. sorting a data is memory intensive task and it should be run on such nodes.

2. If some stage depends on licensed version of software (e.g. SAS Stage, RDBMS related stages, etc.) then you need

to associate those stages with the processing node, which is physically mapped to the machine on which the licensed

software is installed. (Assumption: The machine on which licensed software is installed is connected through other

machines using high speed network.)

3. If a job contains stages, which exchange large amounts of data then they should be assigned to nodes where stages

communicate by either shared memory (SMP) or high-speed link (MPP) in most optimized manner.

7. Basically nodes are nothing but set of machines (specially in MPP systems). You start the execution of parallel jobs from

the conductor node. Conductor nodes creates a shell of remote machines (depending on the processing nodes) and copies

the same environment on them. However, it is possible to create a startup script which will selectively change the

environment on a specific node. This script has a default name of startup.apt. However, like main configuration file, we

can also have many startup configuration files. The appropriate configuration file can be picked up using the environment

variable APT_STARTUP_SCRIPT. What is use of APT_NO_STARTUP_SCRIPT environment variable?

1. Using APT_NO_STARTUP_SCRIPT environment variable, you can instruct Parallel engine not to run the startup

script on the remote shell.

8. What are the generic things one must follow while creating a configuration file so that optimal parallelization can be

achieved?

1. Consider avoiding the disk/disks that your input files reside on.

2. Ensure that the different file systems mentioned as the disk and scratchdisk resources hit disjoint sets of spindles

even if they’re located on a RAID (Redundant Array of Inexpensive Disks) system.

3. Know what is real and what is NFS:

1. Real disks are directly attached, or are reachable over a SAN (storage-area network -dedicated, just for

storage, low-level protocols).

2. Never use NFS file systems for scratchdisk resources, remember scratchdisk are also used for temporary

storage of file/data during processing.

3. If you use NFS file system space for disk resources, then you need to know what you are doing. For example,

your final result files may need to be written out onto the NFS disk area, but that doesn’t mean the

intermediate data sets created and used temporarily in a multi-job sequence should use this NFS disk area.

Better to setup a “final” disk pool, and constrain the result sequential file or data set to reside there, but let

intermediate storage go to local or SAN resources, not NFS.

4. Know what data points are striped (RAID) and which are not. Where possible, avoid striping across data points that

are already striped at the spindle level.

Page 3: DataStage Configuration File

Createing DataStage APT Config File - BasicIt seems many people have confusion about APT_CONFIG_FILE and how and what nodes it should have. What is the impact of it on Stages. So here is basic series before go to next level.

Above configuration file explains how t define node pool and use reserved name say specifics for DB2 or any other partitions.

Above sample configuration file further shows how to use pool and map it to specific usage.

Page 4: DataStage Configuration File

Above one is further shows how we can use pool constraints with-in any Stage. Will show more examples in next post.

This is to see if created configuration file is correct. Confirms its validations on the lines of orchadmin -checkWill put more detailed examples in this space soon to remove the confusion around use of APT_CONFIG_FILE

Special Team Parameter Sets can remove some of the mystery from DataStage Parallel Job Environment Variables.

In a previous post I looked at How to Create, Use and Maintain DataStage 8 Parameter Sets.  In this second of three posts on Parameter Sets I look at combining Environments Variables with Parameter Sets and in the final post I look at User Defined Environment Variables and Parameter Sets.

Page 5: DataStage Configuration File

Parameter Sets have the potential to make Environment Variables much easier to add to jobs and easier to use across a large number of jobs.

Environment Variables and Parameter Sets

Environment Variables are set every time you log into a computer.  They are set for all Unix, Linux and Windows logins and you can see them if you type in "Env" from a prompt.  DataStage Parallel Jobs have a special set of Environment Variables that get added during a DataStage installation and they are exposed through the DataStage Administrator so you can edit them more easily.  You can see most of them documented in Chapter 6 of the Parallel Job Advanced Developers Guide.

There used to be just two ways to set an environment variable for a job. 

1)  In the Administrator tool you set it centrally and that impacts every job that runs, this variable gets set in the session that starts any DataStage job:

Page 6: DataStage Configuration File

2)  In the DataStage Designer you add the same environment variable to a job via the Add Job Parameter screen and use it like an override just for that job: 

Let's look at some of the problems with these environment variables prior to version 8:

1. Neither of those screenshots above shows a very friendly user interface.  The parameters are in a long list, they have long and technical names, it's hard to work out how different parameters relate to each other.

2. You can bring in a DataStage guru who can spend weeks fine tuning your Environment Variable values for you in a performance testing environment - however it only takes one dunce to come along and using Administrator to change a setting and lose all that value.

3. It's very time consuming to add these environment variables to jobs.4. If you use Sequence Jobs you will find yourself having to pass through values from the Sequence

job level to the parallel job level in the Job Activity stage properties for every single parameter leading to lots of time spent configuring Sequence jobs.

 

Parameter Sets change all this.  Imagine if you could add Environment Variables to a job choosing from a shorter list with a group of environment variables and a name that indicates what that group of variables is trying to achieve:

Page 7: DataStage Configuration File

By creating some "special team" Parameter Sets and adding environment variables to them we simplify the creation and management of these values.  A DataStage parallel guru sets them up at the beginning of a project, they are performance tested to verify they work and then all developers who follow can benefit from using those Parameter Sets.  You need to recompiled the job if you add or remove a Parameter Set or a parameter from a Parameter Set but apart from that no changes to the job are necessary.

I have created some example Parameter Sets full of Environment Variables to illustrate how this works.  The first two scenarios show how to create a Parameter Set for very high and very low volume jobs.  This lets you setup your project wide variables to suit medium jobs or "all comers" and lets you override specific settings for the extremes of data volumes.

High Volume Job

The idea here is you choose a typical high volume job and test the hell out of it using all the DataStage reporting and performance monitoring software and then via trial and error you tune some environment variables in a Parameter Set to deliver faster performance.  You then apply that Parameter Set to all similar high volume jobs.

Testing will show whether you can use on Parameter Set for all high volume jobs or whether you need different Parameter Sets for different types of jobs - such as those that write to file versus those that write to a database.

For high volume jobs the first environment variables to look at are:

$APT_CONFIG_FILE: lets you define the biggest config file with the most number of nodes.

Page 8: DataStage Configuration File

$APT_SCORE_DUMP: when switched on it creates a job run report that shows the partitioning used, degree of parallelism, data buffering and inserted operators.  Useful for finding out what your high volume job is doing.

$APT_PM_PLAYER_TIMING: this reporting option lets you see what each operator in a job is doing, especially how much data they are handling and how much CPU they are consuming.  Good for spotting bottlenecks.

One way to speed up very high volume jobs is to pre-sort the data and make sure it is not resorted in the DataStage job.  This is done by turning off auto sorting in high volume jobs:

APT_NO_SORT_INSERTION: stops the job from automatically adding a sort command to the start of a job that has stages that need sorted data such as Remove Duplicates.  You can also add a sort stage to the job and set it to a value of "Previously Sorted" to avoid this is a specific job path.

Buffering is another thing that can be tweaked, it controls how data is passed between stages, usually you just leave it alone but on a very high volume job you might want custom settings:

APT_BUFFER_MAXIMUM_MEMORY: Sets the default value of Maximum memory buffer size. APT_BUFFER_DISK_WRITE_INCREMENT: For systems where small to medium bursts of I/O are not

desirable, the default 1MB write to disk size chunk size may be too small. APT_BUFFER_DISK_WRITE_INCREMENT controls this and can be set larger than 1048576 (1 MB). The setting may not exceed max_memory * 2/3.

APT_IO_MAXIMUM_OUTSTANDING: Sets the amount of memory, in bytes, allocated to a WebSphere DataStage job on every physical node for network communications. The default value is 2097152 (2MB). When you are executing many partitions on a single physical node, this number may need to be increased.

APT_FILE_EXPORT_BUFFER_SIZE: if your high volume jobs are writing to sequential files you may be overheating your file system, increasing the size of this value can deliver data to files in bigger chunks to combat long latency.

These are just some of the I/O and buffering settings.

Low Volume Job

By default a low volume job will tend to run slowly on a massively scalable DataStage server. 

Many less environment variables to set as low volume jobs don't need any special configuration.  Just make sure the job is not trying to partition data as that could be overkill when you don't have a lot of data to process.  Partitioning and repartitioning data on volumes of less than 1000 rows makes the job start and stop more slowly:

APT_EXECUTION_MODE: By default, the execution mode is parallel, with multiple processes. Set this variable to one of the following values to run an application in sequential execution mode: ONE_PROCESS, MANY_PROCESS and NO_SERLIALIZE.

$APT_CONFIG_FILE: lets you define a config file that will run these little jobs on just one node so they don't try any partitioning and repartitioning.

$APT_IO_MAXIMUM_OUTSTANDING: when a job starts on a node it is allocated some memory for network communications - especially the partitioning and repartitioning between nodes.  This is set to 2MB but when you have a squadron of very small jobs that don't partition you can reduce this size to make the job start faster and free up RAM memory.

Page 9: DataStage Configuration File

Other Parameter Sets

You can set up all your default project Environment Variables to handle all data volumes in between.  You can still have a Parameter Set for medium volume jobs if you have specific config files you want to use. 

You might also create a ParameterSet called PX_MANY_STAGES which is for any job that has dozens of stages in it regardless of data volumes.

APT_THIN_SCORE: Setting this variable decreases the memory usage of steps with 100 operator instances or more by a noticable amount. To use this optimization, set APT_THIN_SCORE=1 in your environment. There are no performance benefits in setting this variable unless you are running out of real memory at some point in your flow or the additional memory is useful for sorting or buffering. This variable does not affect any specific operators which consume large amounts of memory, but improves general parallel job memory handling.

This can be combined with the large volume Parameter Set in a job so you have extra configuration for high volume jobs with many stages.

You might also create a ParameterSet for a difficult type of source data file when default values don't work, eg. PX_MFRAME_DATA:

APT_EBCDIC_VERSION: Certain operators, including the import and export operators, support the “ebcdic†property specifying that field data is represented in the EBCDIC character set. The �APT_EBCDIC_VERSION variable indicates the specific EBCDIC character set to use.

APT_IMPEXP_ALLOW_ZERO_LENGTH_FIXED_NULL: When set, allows zero length null_field value with fixed length fields. This should be used with care as poorly formatted data will cause incorrect results. By default a zero length null_field value will cause an error.

SAS is another operator that has a lot of configurable environment variables because when you are reading or writing native SAS datasets or running a SAS transformation you are handing some of the control over to SAS - these environment variables configure this interaction:

APT_HASH_TO_SASHASH: can output data hashed using sashash - the hash algorithm used by SAS. APT_SAS_ACCEPT_ERROR: When a SAS procedure causes SAS to exit with an error, this variable

prevents the SAS-interface operator from terminating. The default behavior is for WebSphere DataStage to terminate the operator with an error.

APT_NO_SAS_TRANSFORMS: WebSphere DataStage automatically performs certain types of SAS-specific component transformations, such as inserting an sasout operator and substituting sasRoundRobin for RoundRobin. Setting the APT_NO_SAS_TRANSFORMS variable prevents WebSphere DataStage from making these transformations.

You can group all known debug parameters into a single debug file to make it easier for support to find:

APT_SAS_DEBUG: Set this to set debug in the SAS process coupled to the SAS stage. Messages appear in the SAS log, which may then be copied into the WebSphere DataStage log.  Don't put this into your SAS Parameter Set as the support team might not be able to find it or know it exists.

APT_SAS_DEBUG_IO: Set this to set input/output debug in the SAS process coupled to the SAS stage. Messages appear in the SAS log, which may then be copied into the WebSphere DataStage log.

APT_SAS_SCHEMASOURCE_DUMP: When using SAS Schema Source, sauses the command line to be written to the log when executing SAS. You use it to inspect the data contained in a -schemaSource. Set this if you are getting an error when specifying the SAS data set containing the schema source.

Page 10: DataStage Configuration File

 

So a new developer who is handed a high volume job does not need to know anything about environment variables, they just need to add the right ParameterSet to the job.  And if an experienced developer decides a new environment variable needs to be added to high volume jobs they just add it to the central ParameterSet and recompile all the jobs that use it.  The "Where Used" function will help identify those jobs.

ParameterSets and environment variables make a powerful combination.  ParameterSets can act as a layer that simplifies environment parameters and makes them easier to add to jobs.