datastage8.1_trgmaterials

IBM WebSphere

DataStage

Introduction to DataStage 8.1

Course Contents• Module 01: Introduction

• Module 02: Deployment

• Module 03: Administering DataStage

• Module 04: DataStage Designer

• Module 05: Creating Parallel jobs

• Module 06: Accessing Sequential data

• Module 07: Platform Architecture

• Module 08: Combining Data

• Module 09: Sorting and aggregating data

• Module 10: Transforming Data

• Module 11: Repository Functions

• Module 12: Working with Relational data

• Module 13: Metadata in Parallel Framework

• Module 14: Job Control

2/16/2010 2Training Material - Sample Copy

Course Objectives• DataStage Clients and Server

• Setting up the parallel environment

• Importing metadata

• Building DataStage jobs

• Loading metadata into job stages

• Accessing Sequential data

• Accessing Relational data

• Introducing the Parallel framework architecture

• Transforming data

• Sorting and aggregating data

• Merging data

• Configuration files


IBM WebSphere DataStage 8.1

____________________________Module 1: Introduction


Unit Objectives

After completing this unit , you should be able to :

• List and describe the uses of DataStage

• List and describe DataStage clients

• Describe the DataStage workflow

• List and compare different types of DataStage jobs

• Describe two types of parallelism exhibited by DataStage parallel jobs


What is IBM WebSphere DataStage?

• Design jobs for Extraction, Transformation, and Loading (ETL)

• Ideal tool for data integration projects – such as, data warehouses, data marts, and system migrations

• Import, export, create, and manage metadata for use within jobs

• Schedule, run, and monitor jobs all within DataStage

• Administer your DataStage development and execution environments

• Create batch (controlling) jobs


IBM information server• Suit of applications, including DataStage, that :

- Shares a common repository

– DB2, by default

- Shares a common set of application services and functionality

– Provided by Metadata server components hosted by an application server

- IBM Web Sphere Application server

– Provided services included:

• Security

• Repository

• Logging and reporting

• Metadata management

• Managed using web console clients

- Administration console

- Reporting console


IBM Information Server


Information Server Backbone

Web Sphere Web Sphere Web Sphere Web Sphere Web Sphere Web Sphere

Information service

director

Business

GlossaryInformation Analyzer DataStage Quality Stage

Federation Server

Metadata Access Services Metadata Analysis Services

Metadata Server

Information Server Console


Comparison between IBM WebSphere and DataStage 8.1


What’s the Same• DataStage Designer and Director work very much the same

- What you could do before you can still do

- Minor changes to menus and GUI

• Same job types are supported

- Parallel jobs

- Server jobs

- Mainframe jobs

- Job sequences

• All stages that existed in DataStage 7.5x are still supported

• Previous DataStage functionality is still supported

- Export DataStage components (dsx)

• Now occurs in designer

- Import Metadata

• Sequential files

• Database

• COBOL file definitions

• Job compile, execution ,run-time log are the same2/16/2010 11Training Material - Sample Copy

What’s Different Part I

• QualityStage is now embedded within a DataStage designer

• DataStage Quality stage(a.k.a DataStage) are now hosted by Metadata server

- There is a layer of administration ,logging, reporting, security that occurs at Metadata server level(outside of DataStage)

• Managed by the administration and reporting console

- Repository is now managed by the Metadata server and its services

• No longer a universe database

• Repository model has been completely recognized

- Installation and deployment are now more flexible (complex)

• More deployment option

• Metadata server, repository, DataStage engines can be on different machines and platforms


What’s new or different? Part II

• DataStage manager is gone

– All manager functionality has been moved into Designer

• DataStage

Permissions are implemented differently

GUI : new icons, menu arrangements, etc.

New stages

• Slowly changing dimensions (SCD) stage

• Surrogate key management

• Connector stages

Stage enhancements

• Lookup stage now supports range lookups

• Complex flat file(CFF) stage now supports multiple record format

• SQL builders accessible in connector stages


What’s New or Different ? Part III• DataStage repository enhancements

– Flexible folder organization

– Repository search

– Enhanced DataStage components export

– Job and table definition difference reports

– Impact analysis

• New objects

– Parameters sets: Named sets of parameters

– Database connectors : Named sets of database connection property values

• New utilities

– Performance analyzer

– Resource Estimator


Components of DataStage 8.1

____________________________Module 1: Introduction


Client Logon

16

Information server

console address

Information

Server

administrator ID


Information Server Administration console

Administration Console

Reporting

Console


DataStage Architecture

Clients

Parallel engine Server Engine

Shared repositoryEngines


DataStage Administrator


DataStage Designer

New Menus / Icons

Repository Search


DataStage DirectorNothing new here


Steps to Develop job in DataStage

• Define global and project properties in Administrator

• Import metadata into the Repository

• Designer Repository View lists the existing jobs

• Build job in Designer

• Compile job in Designer

• Run and monitor job in Director

-- Jobs can also be run in designer, but job log messages can not be viewed in designer


DataStage Project Repository

User -added folder

Standard Job Folder

Standard table

definitions folder


Types of DataStage Jobs• Parallel jobs

– Executed under control of DataStage Server runtime environment

– Built-in functionality for Pipeline and Partitioning Parallelism

– Compiled into OSH (Orchestrate Scripting Language)

– OSH executes Operators

• Executable C++ class instances

– Runtime monitoring in DataStage Director

• Job Sequences (Batch jobs, Controlling jobs)

– Master Server jobs that kick-off jobs and other activities

– Can kick-off Server or Parallel jobs


• Server jobs (Requires Server Edition license)

– Executed by the DataStage Server Edition

– Compiled into Basic (interpreted pseudo-code)


• Mainframe jobs (Requires Mainframe Edition license)

– Compiled into COBOL

– Executed on the Mainframe, outside of DataStage2/16/2010 24Training Material - Sample Copy

Design Elements of Parallel Jobs

• Stages: Implemented as OSH operators (pre-built components)

• Types of Stages

• Passive stages (E and L of ETL)

– Read data

– Write data

– E.g., Sequential File, Oracle, Peek stages

• Processor (active) stages (T of ETL)

– Transform data

– Filter data

– Aggregate data

– Generate data

– Split / Merge data

– E.g., Transformer, Aggregator, Join, Sort stages

• Links

– “Pipes” through which the data moves from stage to stage


Quiz – True or False?

• DataStage Designer is used to build and compile your ETL jobs

• Manager is used to execute your jobs after you build them

• Director is used to execute your jobs after you build them

• Administrator is used to set global and project properties


Deployment of DataStage 8.1

____________________________Module 2: Deployment


Unit Objectives

After completing this unit, you should be able to

• Identify the components of Information server that needs to be installed

• Describe what a deployment domain consists of

• Describe different domain deployment options

• Describe the installation process

• Start the information server


What Gets Deployed An information server domain, consisting of the following:

• Metadata server, hosted by an IBM Web Sphere Application server instance

• One or more DataStage servers

– DataStage server includes both the parallel and server engines

• One DB2 UDB instance containing the repository database

• Information server clients

– Administration console

– Reporting console

– DataStage clients

• Administrator

• Designer

• Director

• Additional information server applications

– Information analyzer

– Business glossary

– Rotational data architect

– Information services Director

– Federation Server 2/16/2010 29Training Material - Sample Copy

Deployment : Everything on One Machine

Clients DB2 Instance With repository

Metadata Server Backbone

Clients

DataStage Server

• Here we have a single domain

with the hosted applications all on

one machine

• Additional client workstations can

connect to this machine using

TCP/IP


Deployment : DataStage on separate machine

• Here the domain is split

between two machines

- DataStage server

- Metadata server and

DB2 repository


DB2 instance with Repository

DataStage Server

Clients


Metadata server and DB2 on separate machines

• Here the domain is split

between three machines

– DataStage server

– MetaData Server

– DB2 repository

DataStage server DB2 Instances

With repository

Clients



Information Server Installation


Installation Configuration Layer• Configuration layers include:

– Client

• DataStage and information server clients

– Engine

• DataStage and other application engines

– Domain

• Metadata server and hosted metadata server components

• Installed products domain components

– Repository

• Repository database server and database

– Documentation

• Selected layers are installed on the machine local to the installation

• Already existing components can be configured and used

– E.g. DB2, Web Sphere Application Server


Information server Start-up– Start the Metadata Server

– From windows start menu, click “Start the server” after the profile to be used (e.g. default)

– From the command line, open the profile bin directory

– Enter the startup server1

> server 1 is the default name of the application server hosting the Metadata server

– Start the ASB agent

– From windows start menu, click “Start the agent” after selecting the information server folder

– Only required if DataStage and metadata server are on different machines.

– To begin work in DataStage, double-click on a DataStage client icon.

– To begin work in Administration and reporting consoles, double-click on the Web Console for Information Server icon


Starting the Metadata Server Backbone

Application Server Profiles folder Profile Start the Server

Startup command


Starting the ASB Agent

Start the agent


Testing the Installation

• You can partially test the installation by logging on to the information Server web console


Checkpoint

1. What application components make up a domain ?

2. Can a domain contain multiple DataStage servers ?

3. Does the DB2 instance and the repository database need to be on the same machine as the application server ?

4. Suppose DataStage is on the separate machine from the Application server. What two components need to be running before you log onto DataStage ?


Check point solutions

1. Metadata server hosted by the application server. One or more DataStage servers. One DB2/UDB instance containing the suit repository database

2. Yes. The DataStage servers must be on separate machines. They can be on different platforms e.g. one server running on Windows and another running on Linux

3. No. the DB2 instance with the repository can reside on a separate machine/ platform than the Application server.

4. The Application server and the ASB agent2/16/2010 40Training Material - Sample Copy


Module 3:Administering DataStage



___________________________Module 2: Administering DataStage


Unit objectives

After completing this unit, you should be able to:

• Open the administrative console

• Create new user and groups

• Assign suite roles and product roles to users and groups

• Give user DataStage credentials

• Log on to DataStage Administrator

• Add a DataStage user on the permissions tab and specify the user’s role

• Specify DataStage global and project defaults

• List and describe important environment variables


Information server administrator console

• Web application for administering information server

• Used for :

– Domain management

– Session management

– Management of users and groups

– Logging management

– Scheduling management


Opening the Administrator Web Console

Information server

console address

Information

Server

administrator ID


Users and Group Management


User and group Management

• Suite authorization can be provided to users or group– Users that are members of the group acquire the authorization of the group

• Authorizations are provided in the form of roles– Two types of roles

• Suit roles: Apply to the suite

• Suite components roles : Apply to a specific product or components of Information Server, e.g. DataStage.

• Suite roles– Administrator

• Perform user and group management tasks

• Includes all the privileges of the Suite user role

– User

• Create views of scheduled tasks and logged messages

• Create and run reports


User and group Management (cont…)

• Suite components roles– DataStage

• DataStage user

– Permissions are assigned within DataStage

> Developer, Operator, Super Operator, Productions Manager

– DataStage administrator

• Full permission to work in DataStage Administrator, Designer and Director

– And so on for all products in the Suite


Creating a DataStage user ID

Administration console

Create new user

Users


Assigning DataStage Roles

Users

User ID Assign Suite role

Assign DataStage

User role


DataStage Credential Mapping

• Users given DataStage Administrator or DataStage user product roles in the suite administrator console do not automatically receive DataStage credentials

– Users with DataStage Administrator roles need to be mapped to a valid user on the DataStage server machine

• This DataStage users must have file access permission to the DataStage engine/Project files or Administrator rights on the operating system

– Users with DataStage user roles need to be mapped to a valid user on the DataStage server machine and need additional DataStage assigned permissions (developer or operator…)


DataStage Credential Mapping

Assign DataStage

User role


DataStage 8.1 Administrator


Module Objectives

• Setting project properties in Administrator

• Defining Environment Variables

• Importing / Exporting DataStage objects in Manager

• Importing Table Definitions defining sources and targets in Manager


Logging on to AdministratorHost name,

port number of

application

server

DataStage

administrator ID

and Password

Name or IP

address of

DataStage

server machine2/16/2010 55Training Material - Sample Copy

Setting Project Properties

• Project Properties

• Projects can be created and deleted in Administrator

• Each project is associated with a directory on the DataStage Server

• Project properties, defaults, and environmental variables are specified in Administrator Can be overridden at the job level


Setting Project Properties

• To set project properties, log onto Administrator, select your project, and then click “Properties”

Click to specify

project properties

Server projects

Link to Information

server adminstation

console2/16/2010 57Training Material - Sample Copy

Project Properties General Tab

Enable Runtime

column Propagation

Specify auto-purge

Environment variable

setting


Environment Variables

User-defined

variables

Parallel job variables


Environment reporting variables

Display score

Display OSH

Display record

counts


Permissions Tab


Adding users and Groups

Available users /

groups. Must have

a DataStage

product User role

Add DataStage

users


Specify DataStage Role

Added DataStage

user

Select DataStage

role


Tracing Tab


Parallel Tab


Sequence Tab


Check Point

1. Authorizations can be assigned to what two items

2. What two items of authorization roles can be assigned to a user and group?

3. In addition to suite authorization to logon to DataStage, What else does a DataStage developer requires to work in DataStage


Check Point Solutions

1. Users and Groups. Members of the group acquires the authorizations of the group

2. Suite roles and Product roles

3. Must be mapped to a user with DataStage credentials



___________________________Module 4: DataStage Designer


Unit Objectives

After completing this unit , you should be able to:

• Logon to DataStage

• Navigate around DataStage Designer

• Import and Export DataStage objects in to file

• Import a table definition for a Sequential file


Logging on to DataStage Designer

Host name, port

number of

application server

DataStage server

machine / project


Designer Work AreaRepository

Menus Toolbar

Parallel

canvas

Palette


Importing and Exporting DataStage Objects


Importing and DataStage Exporting Objects

• What Is Metadata?


Repository Window

Default jobs folder

Default table

definitions folder

Search for objects

in the project

Project


Repository Contents

• Metadata

• Describing sources and targets: Table definitions

• Describing inputs / outputs from external routines

• Describing inputs and outputs to BuildOp and CustomOp stages

• DataStage objects

• Jobs

• Routines

• Compiled jobs / objects

• Stages2/16/2010 76Training Material - Sample Copy

Import and Export

• Any object in Repository can be exported to a file

• Can export whole projects

• Use for backup

• Sometimes used for version control

• Can be used to move DataStage objects from one project to another

• Use to share DataStage jobs and projects with other developers


Export Procedure

• In Repository Export Window, click “Export>DataStage Components”

• Select DataStage objects for export

• Specify type of export:

– DSX: Default format

– XML: Enables processing of export file by XML applications, e.g., for

– generating reports

• Specify file path on client machine


Export Window

Selected objects

Default jobs folder

Click to select

objects from the

repository

Begin export

Export type


Import Procedure

• In Repository, click “Import>DataStage Components”

– Or “Import>DataStage Components (XML)” if you are importing an XMLformat export file

• Select DataStage objects for import


Import OptionsImport all objects

in the file

Display list to

select form


Importing Table definitions

• Table definitions describe the format and columns of files and tables

– Import format and column definitions from sequential files

– Import relational table column definitions

– Import COBOL files and Many other things

• Table definitions can be loaded into job stages

• Table definitions can be used to define Routine and Stage interfaces


Sequential File Import Procedure

• In Designer, click Import>Table Definitions>Sequential File Definitions

• Select directory containing sequential file and then the file

• Examined format and column definitions and edit is necessary


Sequential Import Window

Select repository folder

Select File

Start import

Select directory containing files


Specify Format

Edit Columns

Delimiter

Select if first row

has column names


Edit Column Names and TypesDouble-click to define

extended properties


Extended Properties window

Available properties

Property categories


Table Definition General Tab

SourceType

Stored table

definition


Quiz - True or False?

• You can export DataStage objects such as jobs, but you can’t export metadata, such as field definitions of a sequential file.

• The directory to which you export is on the DataStage client machine, not on the DataStage server machine.



___________________________Module 5: Creating Parallel Jobs


Unit Objectives


• Design a simple Parallel job in Designer

• Define a job parameter

• Compile your job

• Run your job in Director

• View the job log


Creating Parallel Jobs

• What Is a Parallel Job?

– Executable DataStage program

– Created in DataStage Designer

– Can use components from Manager Repository

– Built using a graphical user interface

– Compiles into Orchestrate shell language (OSH) and object code (from generated C++)


Job Development Overview

• Import metadata defining sources and targets

-- Can be done within Designer or Manager

• In Designer, add stages defining data extractions and loads

• Add processing stages to define data transformations

• Add links defining the flow of data from sources to targets

• Compile the job

• In Director, validate, run, and monitor your job

-- Can also run the job in Designer

-- Can only view the job log in Director


Logging on to DataStage Designer

Host name, port number of

application server

DataStage server machine /

project


Designer Work Area

RepositoryMenus Toolbar

Parallel

canvas

Palette


Designer Toolbar• Provides quick access to the main functions of Designer

Toolbar


Tools PaletteStage

categorie

s

Stages


Adding Stages and Links

• Drag stages from the Tools Palette to the diagram

-- Can also be dragged from Stage Type branch to the

diagram

• Draw links from source to target stage

--Right mouse over source stage

--Release mouse button over target stage


Job Creation Example Sequence

• Brief walkthrough of procedure

• Assumes table definition of source already exists in the repository


Create New Parallel Job

Open new

window

Parallel job


Drag Stages and Links From Palette

Row Generator

Peek

Compile

Job

properties


Renaming Links and Stages

• Click on a stage or link to rename it

• Meaningful names have many

benefits

– Documentation

– Clarity

– Fewer development errors


DataStage Designer Stages


Row Generator Stage

• Produces mock data for specified columns

• No inputs link; single output link

• On Properties tab, specify number of rows

• On Columns tab, load or specify column definitions- Click Edit Row over a column to specify the values to be generated for that column

- A number of algorithms for generating values are available depending on the data type

• Algorithms for Integer type- Random: seed, limit

- Cycle: Initial value, increment

• Algorithms for string type: Cycle , alphabet

• Algorithms for date type: Random, cycle


Inside the Row Generator Stage

Properties tab

Set Property

value

Property


Columns Tab

Select table

Definition

View Data

Load a Table

Definition


Extended Properties

Specified Properties

and their values

Additional

properties to add


Peek Stage

• Displays field values

- Displayed in job log or sent to a

- Skip records option

- Can control number of records to be displayed

- Shows data in each partition, labeled 0, 1, 2, …

• Useful stub stage for iterative job development

- Develop job to a stopping point and check the data


Peek Stage Properties

Output to Job log


Job Parameters

• Defined in Job Properties window

• Makes the job more flexible

• Parameters can be:- Used in directory and file names

- Used to specify property values

- Used in constraints and derivations

• Parameter values are determined at run time

• When used for directory and files names and names of properties, surround with pound signs (#)

- E.g., #NumRows#

• Job parameters can reference DataStage and system environment variables- Prefaced by e.g: APT_CONFIG_FILE


Defining a Job Parameter

Parameter Tab

Parameter


Using a Job Parameter in a Stage

Job Parameter

surrounded with

pound signs


Adding Job Documentation

• Job Properties

- Short and long descriptions

- Shows in Manager

• Annotation stage

- Added from the Tools Palette

- Display formatted text descriptions on diagram


Job Properties Documentation

Documentation


Annotation Stage Properties


Compiling a JobCompile


Error or Successful Message

Highlight stage

with errorClick for more

info


Running Jobs and Viewing the Job

Log in Designer


Prerequisite to Job Execution


DataStage Director

• Use to run and schedule jobs

• View runtime messages

• Can invoke from DataStage Designer- Tools > Run Director


Run Options

Stop after number

of warnings

Stop after number

of rows


Run Options


Job Status View


Job Log ViewClick the open book

icon to view log

messages

Peek Messages


Message Details


Other Director Options

• Schedule job to run on a particular date/time

• Clear job log of messages

• Set job log purging conditions

• Set Director options- Row limits

- Abort after x warnings


Running Jobs From Command Line

• dsjob –run –param numrows=10 dx444 GenDataJob

- Runs a job

- Use –run to run the job

- Use –param to specify parameters

- In this example, dx444 is the name of the project

- In this example, GenDataJob is the name of the job

• dsjob –logsum dx444 GenDataJob

- Displays a job’s messages in the log

• Documented in “Parallel Job Advanced Developer’s Guide”


Check Points

1. Which stage can be used to display output data in the job log?

2. Which stage is used for documenting your job on the job Canvas?

3. What command is used to run jobs from the operating system command line?


Check Point Solution

1.Dsjob –run

2.Peek Stage

3.Annotation Stage



____________________________Module 6: Accessing Sequential Data


Module Objectives

After completing this module, you should be able to:

• Understand the stages for accessing different kinds of sequential data

• Sequential File stage

• Complex Flat File stage

• Create jobs that read from and write to sequential files

• Create reject Link

• Work with NULL’s in sequential file

• Read from multiple files using file patterns

• Use multiple readers


Types of Sequential Data stages

• Sequential

Fixed or variable length

• Data Set

• Complex Flat File


How Sequential Data is Handled

• Import and export operators are generated

– Stages get translated into operators during the compile

• Import operators convert data from the external format, as

• described by the Table Definition, to the framework

• internal format

– Internally, the format of data is described by schemas

• Export operators reverse the process

• Messages in the job log use the “import” / “export”

• terminology

– E.g., “100 records imported successfully; 2 rejected”

– E.g., “100 records exported successfully; 0 rejected”

– Records get rejected when they cannot be converted correctly

• during the import or export2/16/2010 133Training Material - Sample Copy

Using the Sequential File Stage

• Both import and export of general files (text, binary) are performed by the sequential file stage


Features of Sequential File Stage

• Normally executes in sequential mode

• Executes in parallel when reading multiple files

• Can use multiple readers within a node

Reads chunks of a single file in parallel

• The stage needs to be told:

How file is divided into rows (record format)

How row is divided into columns (column format)


File Format Example


Sequential File Stage Rules

• One input link

• One stream output link

• Optionally, one reject link

– Will reject any records not matching metadata in the column definitions

– Example: You specify three columns separated by commas, but the row that’s read had no commas in it


Job Design using Sequential stages


Sequential Source Columns Tab


Input Sequential Stage Properties


Format Tab


Reading Using a File Pattern


Properties –Multiple Readers


Sequential Stage As a Target


Reject Link• Reject mode =

– Continue: Continue reading records

– Fail: Abort job

– Output: Send down output link

• In a source stage

– All records not matching the metadata (column definitions) are rejected

• In a target stage

– All records that fail to be written for any reason

• Rejected records consist of one column, datatype = raw


Inside the Copy Stage


Reading and Writing Null Values


Working with NULLS

• Internally, NULL is represented by a special value outside the range of

• any existing, legitimate values• If NULL is written to a non- nullable column, the job will abort• Columns can be specified as nullable

NULLs can be written to nullable columns• You must “handle” NULLs written to non- nullable columns in a• Sequential File stage

You need to tell DataStage what value to write to the file Unhandled rows are rejected

• In a Sequential source stage, you can specify values you want• DataStage to convert to NULLs


Specifying a Value For Null


Dataset Stage


Dataset• Binary data file• Preserves partitioning

– Component dataset files are written to on each partition• Suffixed by .ds• Referred to by a header file• Managed by Data Set Management utility from GUI (Manager, Designer,

Director)• Represents persistent data• Key to good performance in set of linked jobs• No import / export conversions are needed• No repartitioning needed• Accessed using DataSet stage• Implemented with two types of components:

– Descriptor file:contains metadata, data location, but NOT the data itself

– Data file (s)• contains the data

multiple files, one per partition (node)2/16/2010 151Training Material - Sample Copy

Job with Dataset Stage


Displaying Data and Schema


Data and Schema Display


File Set

• Use to read and write to filesets

• Files suffixed by .fs

• Files are similar to a dataset

– Partitioned

– Implemented with header file and data files

• How filesets differ from datasets

– Data files are text files

• Hence readable by external applications

– Datasets have a proprietary data format which may change in future DataStage versions


Checkpoint

1. List three types of file data

2. What makes datasets perform better than other types of files in parallel jobs

3. What is the difference between a data set and a file set?


Checkpoint solutions

1. Sequential, dataset, complex flat files

2. They are partitioned and they store data in the native parallel format

3. Both are partitioned. Data sets store data in a binary format not readable by user applications. File sets are readable.



____________________________Module 7: Platform Architecture


Module Objectives


• Describe parallel processing architecture

• Describe pipeline parallelism

• Describe partition parallelism

• List and describe partitioning and collecting algorithms

• Describe configuration files

• Describe the parallel job compilation process

• Explain OSH

• Explain the Score


Key EE Concepts


Scalable Hardware Environments


Pipeline Parallelism

• Transform, clean, load processes execute simultaneously• Like a conveyor belt moving rows from process to process

– Start downstream process while upstream process is running

• Advantages:– Reduces disk usage for staging areas– Keeps processors busy

• Still has limits on scalability


Partition Parallelism

• Divide the incoming stream of data into subsets to be separately

• processed by an operation

Subsets are called partitions (nodes)

• Each partition of data is processed by the same operation

E.g., if operation is Filter, each partition will be filtered in exactly the same way

• Facilitates near-linear scalability

8 times faster on 8 processors

24 times faster on 24 processors

This assumes the data is evenly distributed


Three-Node Partitioning

• Here the data is partitioned into three partitions

• The operation is performed on each partition of data separately and in parallel

• If the data is evenly distributed, the data will be processed three times faster


EE Combines Partitioning and Pipelining

Within EE, pipelining, partitioning and repartitioning are automatic, Job developer only identifies:• Sequential vs. Parallel operations (by stage)• Method of data partitioning• Configuration file (which identifies resources)• Advanced stage options (buffer tuning, operator combining, etc.)


Job Design V. Execution


Configuration File• Configuration file separates configuration (hardware / software)

from job design

– Specified per job at runtime by $APT_CONFIG_FILE

– Change hardware and resources without changing job design

• Defines number of nodes (logical processing units) with their resources (need not match physical CPUs)

– Dataset, Scratch, Buffer disk (file systems) Optional resources (Database, SAS, etc.) Advanced resource optimizations

– “Pools” (named subsets of nodes)

• Multiple configuration files can be used at runtime

– Optimizes overall throughput and matches job characteristics to overall hardware resources

– Allows runtime constraints on resource usage on a per job basis


Example configuration File


Partitioning and Collecting


Partitioning and Collecting

• Partitioning breaks incoming rows into sets (partitions) of rows

• Each partition of rows is processed separately by the stage/operator

If the hardware and configuration file supports parallel processing, partitions of rows will be processed in parallel

• Collecting returns partitioned data back to a single stream

• Partitioning / Collecting occurs on stage Input links

• Partitioning / Collecting is implemented automatically

Based on stage and stage properties

How the data is partitioned / collected can be specified


Partitioning / Collecting Algorithms• Partitioning algorithms include:

Round robinHash: Determine partition based on key value- Requires key specification

Entire: Send all rows down all partitionsSame: Preserve the same partitioningAuto: Let DataStage choose the algorithm

• Collecting algorithms include:Round robinSort Merge- Read in by key- Presumes data is sorted by the key in each partition- Builds a single sorted stream based on the key ordered- Read all records from first partition, then second, …


Keyless V. Keyed Partitioning Algorithms

• Keyless: Rows are distributed independently of data valuesRound RobinEntireSame

• Keyed: Rows are distributed based on values in the specified key

• Hash: Partition based on keyExample: Key is State. All “CA” rows go into the same partition; all

“MA” rows go in the same partition. Two rows of the same state never go into different partitionsModulus: Partition based on modulus of key divided by the number of partitions. Key is a numeric type.

Example: Key is OrderNumber (numeric type). Rows with the same order number will all go into the same partition.DB2: Matches DB2 EEE partitioning


Round Robin and Random Partitioning

• Keyless partitioning methods

• Rows are evenly distributed across

partitions• Good for initial import of data if no

• other partitioning is needed

• Useful for redistributing data

• Fairly low overhead

• Round Robin assigns rows to partitions like

dealing cards• Row/Partition assignment will be the

same for a given $APT_CONFIG_FILE

• Random has slightly higher

• overhead, but assigns rows in a

• non-deterministic fashion between job runs


ENTIRE Partitioning

• Each partition gets a complete copy of the data

• Useful for distributing lookup and reference data

• May have performance impact on MPP / clustered environment

• On SMP platforms, Lookup stage (only) uses shared memory instead of duplicating ENTIRE reference data

• On MPP platforms, each server uses shared memory for a single local copy.

• ENTIRE is the default partitioning for Lookup reference links with “Auto” Partitioning.

• On SMP platforms, it is a food practice to set this explicitly on the Normal Lookup reference lnk(s)


HASH Partitioning

• Keyed partitioning method

• Rows are distributed according to the Values in Key coloumns.

• Guarantees that rows with same key values go into the same partition

• Needed to prevent matching rows from “hiding” in other partitions

• E.g. Join, Merge

• Remove Duplicate

• Partition distribution is relatively equal if the data across the source key columns evenly distributed


Modulus Partitioning

• Keyed partitioning method Rows are distributed according to the values in one integer key column

• Uses modulus

• partition = MOD (key _value / # partitions)

• Faster than HASH

• Guarantees that rows with identical key values go in the same partition

• Partition size is relatively equal if the data within the key

• Column is evenly distributed


Auto Partitioning• DataStage inserts partition components as necessary to

ensure correct results- Before any stage with “Auto” partitioning- Generally chooses ROUND-ROBIN or SAME- Inserts Hash on the Stage that require matched key values

(e.g. Join, Merge, Remove Duplicates)- Inserts ENTIRE on Normal (not Sparse) Lookup reference links

• Not always appropriate for MPP/clusters Since DataStage has limited awareness of your data and business rules, explicitly specify HASH partitioning when needed

- DataStage has no visibility into Transformer logic- Hash is required before Sort and Aggregator stages- DataStage sometimes inserts un-needed partitioning Check the log


Partitioning Requirements for Related Records

• Misplaced records- Using Aggregator stage to sum customer sales by customer number

- If there are 25 customers, 25 records should be output

But suppose records with the same customer numbers are spread across partitions

-This will produce more than 25 groups (records) Solution: Use hash partitioning algorithm

• Partition imbalances

-If all the records are going down only one of the nodes, then the job is in effect running sequentially


Unequal Distribution Example


Partitioning / Collecting Link Icons


More Partitioning Icons


Partitioning Tab


Collecting Specification



____________________________Module 8: Combining Data


Module objectivesAfter completing this module, you should be able to:

• Combine data using the Lookup stage

• Define range lookups

• Combine data using Merge stage

• Combine data using the Join stage

• Combine data using the Funnel stage


Combining Data

Ways to combine data:• Horizontally:

– Multiple input links– One output link made of columns from different input links. – Joins– Lookup– Merge

• Vertically:– One input link, one output link combining groups of related – Records into a single record– Aggregator– Remove Duplicates

• Funneling: Multiple input streams funneled into a single output stream– Funnel stage


Lookup, Merge, Join Stages• These stages combine two or more input links

Data is combined by designated "key" column(s)

• These stages differ mainly in:

Memory usage

Treatment of rows with unmatched key values

Input requirements (sorted, de-duplicated)


Not all Links are Created Equal

• DataStage distinguishes between

- The Primary Input : (Framework Port 0)

- Secondary Inputs : In some cases “reference” ( other Framework Ports)

• Conventions:

• Tip Check “ Link Ordering “tab to make sure intended Primary is Listed first.


Lookup Stage


Lookup Features

• One Stream Input link (Source)• Multiple Reference links (Lookup files)• One output link• Optional Reject link

Only one per Lookup stage, regardless of number of reference links

• Lookup Failure optionsContinue, Drop, Fail, Reject

• Can return multiple matching rows• Hash tables are built in memory from the lookup files

Indexed by keyShould be small enough to fit into physical memory


Lookup Types• Equality match

- Match exactly values in the lookup key column of the reference link toselected values in the source row

- Return row or rows (if multiple matches are to be returned) that match

• Caseless match- Like an equality match except that it’s caselessE.g., “abc” matches “AbC”Range on the reference link

- Two columns on the reference link define the range- A match occurs when a selected value in the source row is within therange

• Range on the source link- Two columns on the source link define the range- A match occurs when a selected value in the reference link is within the

range


The Lookup Stage

• Uses one or more key columns as an index into a table

• Usually contains other values associated with each key.

• The lookup table is created in memory before any lookup source rows are processed


Lookup from Sequential File Example


Lookup Stage With an Equality Match


Handling Lookup Failures


Lookup Failure Actions

• If the lookup fails to find a matching key column, one of these actions can be taken:

- Fail: the lookup Stage reports an error and the job fails immediately. This is

the default.

- Drop: the input row with the failed lookup(s) is dropped

- continue: the input row is transferred to the output, together with the successful table entries. The failed table entry (s) are not transferred, resulting in either default output values or null output values.

- Reject: the input row with the failed lookup(s) is transferred to a second

output link, the"reject" link

• There is no option to capture unused table entries- Compare with the Join and Merge stages


Lookup Stage Behavior

• We shall first use a simplest case, optimal input

-Two input links:“Source" as primary, “Look up"

as secondary

- sorted on key column (here "Citizen"),

- without duplicates on key


Lookup Stage


The Lookup Stage

• Lookup Tables should be small enough to fit into physical memory

• On a MPP you should partition the lookup tables using entire partitioning method or partition them by the same hash key as the source link

- Entire results in multiple copies (one for each partition)

• On a SMP, choose entire or accept the default (which is entire)

- Entire does not result in multiple copies because

memory is shared


Designing a Range Lookup Job


Range Lookup Job


Range on Reference Link


Selecting the Stream column


Range on Expression Editor


Range on Stream Link


Specifying the Range Lookup


Range Expression Editor


Join Stage


The Join Stage

• Four types:

– Inner

– Left outer

– Right outer

– Full outer

• 2 or more sorted input links, 1 output link

– "left" on primary input, "right" on secondary input

– Pre-sort make joins "lightweight": few rows need to be in RAM

• Follow the RDBMS-style relational model

– Cross-products in case of duplicates

– Matching entries are reusable for multiple matches

– Non-matching entries can be captured (Left, Right, Full)

• No fail/reject option for missed matches2/16/2010 209Training Material - Sample Copy

Job with Join Stage


Join Stage Editor

• Link Order• Immaterial for

Inner and Full Outer Joins, but very important for Left/Right Outer

• joins) Multiple

Multiple key

columns

allowed

One of four variants:

Inner

Left Outer

Right

Outer

Full Outer


Join Stage Behavior

• We shall first use a simplest case, optimal input:– two input links: "left" as primary, "right" as secondary

– sorted on key column (here “citizen”),

– without duplicates on key


Inner Join

• Transfers rows from both data sets whose key columns contain equal values to the output link

• Treats both inputs symmetrically

Output of inner join on key Citizen


Left Outer Join

• Transfers all values from the left link and transfers values from the right link only where key columns match.


Left Outer JoinCheck Link Ordering tab to make sure intended Primary is listed first


Right Outer Join

• Transfers all values from the right link and transfers values from the left link only where key columns match


Full Outer Join

• Transfers rows from both data sets, whose key columns contain equal values, to the output link.

• It also transfers rows, whose key columns contain unequal values, from both input links to the output link.

• Treats both input symmetrically.

• Creates new columns, with new column names!


Merge Stage


Merge Stage

• Similar to Join stage

• Input links must be sorted

– Master link and one or more secondary links

– Master must be duplicate-free

• Light-weight

– Little memory required, because of the sort requirement

• Unmatched master rows can be kept or dropped

• Unmatched secondary links can be captured in a reject link


Merge Stage Job


The Merge Stage

• Allows composite keys

• Multiple update links

• Matched update rows are consumed

• Unmatched updates in input port n can be captured in output port n

• Lightweight


Stage Editor


Comparison: Joins, Lookup, Merge


Funnel Stage


What is a Funnel Stage?

• A processing stage that combines data from multiple input links to a single output link

• Useful to combine data from several identical data sources into a single large dataset

• Operates in three modes

– Continuous

– SortFunnel

– Sequence


Three Funnel modes

• Continuous:– Combines the records of the input link in no guaranteed

order.– It takes one record from each input link in turn. If data is not

available on an input link, the stage skips to the next link rather than waiting.

– Does not attempt to impose any order on the data it is processing.

• Sort Funnel: Combines the input records in the order defined by the value(s) of one or more key columns and the order of the output records is determined by these sorting keys.

• Sequence: Copies all records from the first input link to the output link, then all the records from the second input link and so on.


Sort Funnel Method

• Produces a sorted output (assuming input links are all sorted on key)

• Data from all input links must be sorted on the same key column• Typically data from all input links are hash partitioned before they

are sorted– Selecting “Auto” partition type under Input Î Partitioning tab defaults to

this hash partitioning guarantees that all the records with same key column values are located in the same partition and are processed on the same node.

• Allows for multiple key columns– 1 primary key column, n secondary key columns– Funnel stage first examines the primary key in each input record.– For records with multiple records with same primary key value, it will then

examine secondary keys to determine the order of records it will output


Funnel Stage Example


Funnel Stage Properties


Check Point

1. Name three stages that horizontally join data?

2. Which stage uses the least amount of memory? Join or Lookup?

3. Which stage requires that the input data is sorted? Join or Lookup?



1. Lookup, Merge, Join

2. Lookup

3. Join



____________________________Module 9: Sorting and Aggregating Data


Module ObjectivesAfter completing this module, you should be able to :

• Sort data using in-stage sorts and Sort stage

• Combine data using Aggregator stage

• Combine data Remove Duplicates stage


Sort Stage


Sorting DataUses

• Some stages require sorted input– Join, merge stages require sorted input

• Some stages use less memory with sorted input– E.g., Aggregator

Sorts can be done:

• Within stages– On input link Partitioning tab, set partitioning to anything other than

Auto

• In a separate Sort stage– Makes sort more visible on diagram

– Has more options


Sorting Alternatives


In-Stage Sorting


Sort Stage


Sort Keys

• Add one or more keys

• Specify sort mode for each key

– Sort: Sort by this key

– Don’t sort (previously sorted):

– Assume the data has already been sorted by this key

– Continue sorting by any secondary keys

• Specify sort order: ascending / descending

• Specify case sensitive or not


Sort Options• Sort Utility

– DataStage – the default– Unix: Don’t use. Slower than DataStage sort utility

• Stable• Allow duplicates• Memory usage

– Sorting takes advantage of the available memory for increased performance

Uses disk if necessary– Increasing amount of memory can improve performance

• Create key change column– Add a column with a value of 1 / 0– 1 indicates that the key value has changed– 0 mean that the key value hasn’t changed– Useful for processing groups of rows in a Transformer


Partitioning Vs Sorting KeysPartitioning keys are often different than Sorting keys

• Keyed partitioning (e.g., Hash) is used to group related records into the same partition

• Sort keys are used to establish order within each partition

For example,

• Partition on HouseHoldID, sort on HouseHoldID, Entry Date

– Partitioning on HouseHoldID ensures that the same ID will not be spread across multiple partitions

– Sorting orders the records with same ID by entry date• Useful for deciding which of a group of duplicate records with the

same ID should be retained


Aggregator Stage


Aggregator Stage

Purpose: Perform data aggregations

Specify:• Zero or more Key Coloumns that define the aggregation units

(or groups)

• Columns to be aggregated

• Aggregation functions, include among many others:– count (nulls/no nulls)Sum

– Max / Min / Range

• The grouping method (Hash table or pre sort) is a Performance issue


Job with Aggregator Stage


Aggregator Types

• Count rows– Count rows in each group

– Put result in a specified output column

• Calculation– Select column

– Put result of calculation in a specified output column

• Calculations include:– Sum

– Count

– Min, max

– Mean

– Missing value count

– Non-missing value count

– Percent coefficient of variation


Count Row Aggregator Properties


Calculation type Aggregator properties


Grouping MethodsHash (default)

• Calculations are made for all groups and stored in memory– Hash table structure (hence the name)

• Results are written out after all input has been processed

• Input does not need to be sorted

• Useful when the number of unique groups is small– Running tally for each group’s aggregations needs to fit into memory

Sort

• Requires the input data to be sorted by grouping keys– Does not perform the sort! Expects the sort

• Only a single aggregation group is kept in memory– When a new group is seen, the current group is written out


Remove Duplicates Stage


Removing Duplicates

• Can be done by Sort stage

Use unique option

– No choice on which to keep

– Stable sort always retains the first row in the group

– Non-stable sort is indeterminate

OR

• Remove Duplicates stage– Has more sophisticated ways to remove duplicates

Can choose to retain first or last


Remove Duplicates Stage Job


Remove Duplicates Stage Properties


Check Point

1. What stage is used to perform calculation of columns values Grouped in specified ways

2. In what two ways can sorts be performed

3. What is stable sort



1. Aggregator Stage

2. Using the sort stage. In-Stage sorts

3. Stable sort preserves the order of non-key values



____________________________Module 10 : Transforming Data


Module ObjectivesAfter completing this module, you should be able to:

• Use the Transformer stage in parallel jobs

• Define constraints

• Define derivations

• Use stage variables

• Create a parameter set and use its parameters in constraints and derivations


Transformer Stage


Transformer Stage• Column mappings

• Derivations– Written in Basic

– Final compiled code is C++ generated object code

• Constraints– Filter data

– Direct data down different output links

• For different processing or storage

• Expressions for constraints and derivations can reference– Input columns

– Job parameters

– Functions

– System variables and constants

– Stage variables

– External routines2/16/2010 258Training Material - Sample Copy

Job with a Transformer Stage


Inside the Transformer Stage


Defining a Constraint


Defining a Derivation


IF THEN ELSE Derivation• Use IF THEN ELSE to conditionally derive a value

• Format:

– IF <condition> THEN <expression1> ELSE <expression1>

– If the condition evaluates to true then the result of expression1 will be copied to the target column or stage variable

– If the condition evaluates to false then the result of expression2 will be copied to the target column or stage variable

Example:

– Suppose the source column is named In.OrderID and the target column is named Out.OrderID

– Replace In.OrderID values of 3000 by 4000

– IF In.OrderID = 3000 THEN 4000 ELSE Out.OrderID


String Functions and Operators

• Substring operator– Format: “String” *loc, length+

– Example:

• Suppose In.Description contains the string “Orange Juice”

• In Description*8,5+ “Juice”

• UpCase(<string>) / DownCase(<string>)– Example: UpCase(In.Description) Æ “ORANGE JUICE”

• Len(<string>)– Example: Len(In.Description) Æ 12


Checking for NULLs

• Nulls can be introduced into the data flow from lookups– Mismatches (lookup failures) can produce nulls

• Can be handled in constraints, derivations, stage variables, or a combination of these

• NULL functions– Testing for NULL

• IsNull (<column>)

• IsNotNull (<column>)

– Replace NULL with a

• NullToValue(<column>, <value>)

– Set to NULL:

• Example: IF In.Col = 5 THEN SetNull() ELSE In.Col


Transformer Functions

• Date & Time

• Logical

• Null Handling

• Number

• String

• Type Conversion


Transformer Execution Order

• Derivations in stage variables

• Constraints are executed before derivations

• Column derivations in earlier links are executed before later links

• Derivations in higher columns are executed before lower columns


Transformer Stage Variables

• Derivations execute in order from top to bottom

– Later stage variables can reference earlier stage variables

– Earlier stage variables can reference later stage variables

– These variables will contain a value derived from the previous row that came into the Transformer

• Multi-purpose

– Counters

– Store values from previous rows to make comparisons

– Store derived values to be used in multiple target field derivations

– Can be used to control execution of constraints


Transformer Reject Links


Otherwise Link


Defining an Otherwise Link


Specifying Link Ordering


Parameter Sets


Parameter Sets

• Store a collection of parameters in a named object

• One or more value files can be named and specified

– A value file stores value for specified parameters

– Values are picked up at run time

• Parameter sets can be added to the job parameters specified on the parameter tab of the job properties


Creating a New parameter set


Parameter Tab


Value Tab


Adding parameter set to job properties


Using parameter set parameters


Check point

• What occurs first? Derivations or Constraints?

• Can stage variables be referenced in constraints?

• Where should you test for nulls with in a transformer? Stage variable derivations or Output column derivations?


Check point solution

• Constraints

• Yes

• Stage variable derivations. Reference the stage variable in Output column derivations



____________________________Module 11 : Repository Functions


Module Objectives


• Perform a simple Find

• Perform a Advanced Find

• Perform Impact Analysis

• Compare the difference between two Table Definitions

• Compare the differences between two jobs


Searching the Repository


Quick Find


Found Results


Advance Find Window


Advance Find Filtering options

• Type: Type of object– Job, Table Definition etc

• Creation: Range of Dates– Eg: Up to week ago

• Last Modification: Range of Dates– E.g.: Up to a week ago

• Where Used: Objects that use specified objects– E.g.: A job that uses a specified table definition

• Dependencies of: Objects that are dependencies of objects– E.g.: A Table definition that is referenced in a specified job

• Options– Case Sensitivity

– Search with in a last result set


Using the Found Result


Impact Analysis


Performing Impact Analysis

• Find where table definitions are used– Right click over a stage or table definition– Select ‘Find where Table definition used’ or– Select ‘Find where Table definition used(deep)’

• Deep includes additional object types– Displays a list of objects using table definition

• Find Object Dependencies– Select ‘Find Dependencies’ or– Select ‘Find Dependencies(deep)’– Display list of objects dependent on the one selected

• Graphical Functionality– Display the dependency path– Collapse selected object– Move the graphical object– “Birds-eye” view


Initiating Impact Analysis for a Stage


Display the Dependency graphically


Display the dependency path


Generating a HTML report


Job and Table Difference Reports


Finding the Difference between Two Jobs

• Example: Job1 is saved as Job2. Changes are made to job 2. What changes have been made?

– Here job1 may be a production job. Job2 is a copy of production job after enhancements or other changes have been made to it


Initiating the Comparison


Comparison Results


Saving to an HTML file


Comparing Table Definitions• Same procedure as when comparing jobs


Check Point

• You can compare the differences between what two kind of objects

• What “Wild card” characters can be used in a Find?

• You have a job whose name begins with “abc”. You can’t remember the rest of the name or where the job is located. What would be the fastest way to export the job to a file?

• Name three filters you can use in an Advanced Find?



• Jobs. Table Definitions.

• Asterisk(*). It stands for any zero or more characters.

• Do a Find for objects matching “abc”. Filter by type Job. Locate the job in the result set, click the mouse button over it, and then click export.

• Type of object, creation data range, last modified date range, where used, dependencies of, other options including case sensitivity and search with in last result set.



____________________________Module 12: Working with Relational Data


Module Objectives


• Import table definitions for relational tables

• Create Data connections

• Use Connector stages in a job

• Use Sql Builder to define SQL Select statements

• Use Sql Builder to define SQL Insert and Update statements

• Use DB2 Enterprise stage


Working with Relational Data• Importing relational data

– Import using ODBC or orchdbutil• orchdbutil is preferred, in order to get correct type conversions

• Data Connection objects– Store database connection information in a named object

• Stages available to access relational data– Connector stages

• Parallel support• Most functionality• ConsistentGUI and functionality across all relational types

– Enterprise stages• Parallel support

– Plug-in stages• Functionality ported from DataStage Server Jobs

– Selecting data– Build SELECT statements using SQL Builder

• Writing data– Build INSERT, UPDATE, DELETE statements using SQL Builder


Importing Table Definitions



• Can import using ODBC or Orchestrate schema definitions– Orchestrate schema imports are better because the datatypes are

nore accurate

• Import->Table Definitions->Orchestrate Schema Definitions

• Import->Table Definitions->ODBC Table Definitions

• ODBC


Orchestrate Schema Import


ODBC Import



____________________________Module 12: Working with Relational Data


Module Objectives


• Import table definitions for relational tables

• Create Data connections

• Use Connector stages in a job

• Use Sql Builder to define SQL Select statements

• Use Sql Builder to define SQL Insert and Update statements

• Use DB2 Enterprise stage


Working with Relational Data• Importing relational data

– Import using ODBC or orchdbutil• orchdbutil is preferred, in order to get correct type conversions

• Data Connection objects– Store database connection information in a named object

• Stages available to access relational data– Connector stages

• Parallel support• Most functionality• ConsistentGUI and functionality across all relational types

– Enterprise stages• Parallel support

– Plug-in stages• Functionality ported from DataStage Server Jobs

– Selecting data– Build SELECT statements using SQL Builder

• Writing data– Build INSERT, UPDATE, DELETE statements using SQL Builder



• Can import using ODBC or Orchestrate schema definitions– Orchestrate schema imports are better because the datatypes are

nore accurate

• Import->Table Definitions->Orchestrate Schema Definitions

• Import->Table Definitions->ODBC Table Definitions

• ODBC


Orchestrate Schema Import


ODBC Import



____________________________Module 12: MetaData in the Parallel Framework


Module Objectives


• Explain Schemas

• Create Schemas

• Explain Runtime Column Propagation (RCP)

• Turn RCP ON and OFF

• Build a job that reads data from a sequential file using a Schema

• Build a shared Container


Schema

• Alternative way to specifying column definitions and record formats– Similar to a Table Definition

• Written in a plain text file

• Can be imported as a Table Definition

• Can be created from a Table Definition

• Can be used in place of a Table Definition in a Sequential file stage– Requires RCP

– Schema file path can be parameterized

• Enables a single job to process files with different column

definitions


Creating a Schema• Using a text editor

– Follow correct syntax for definitions

– Not recommended

• Import from an existing DataSet or FileSet– On DataStage Manager import->Table Definitions -> Orchestrate

Schema Definitions

– Select check box for a file with .fs or .ds

• Import from a Database Table

• Create from a Table Definition– Click parallel on layout tab


Importing a Schema


Creating a Schema from a Table Definition


Reading a sequential file using Schema


Runtime Column Propagation (RCP)

• When RCP is turned on:

– Columns of data can flow through a stage without being explicitly defined in the stage

– Target columns in a stage need not have any columns explicitly mapped to them

• No column mapping enforcement at design time

– Input columns are mapped to unmapped columns by name

• How implicit columns get into a job

– Read a file using a schema in a Sequential File stage

– Read a database table using “Select *”

– Explicitly define as an output column in a stage earlier in the flow


Runtime Column Propagation (RCP)

• Benefits of RCP

– Job flexibility• Job can process input with different layouts

– Ability to create reusable • components in shared containersComponent logic an apply to a

single named column

• All other columns flow through untouched


Enabling Runtime Column Propagation (RCP)

• Project Level– DataStage Administrator Parallel tab

• Job Level– Job Properties General tab

• Stage Level– Link Output Column tab

• Setting at a lower level override settings at a higher level– E.g.: disable at project level, but enable for a given job

– E.g.: enable at job level, but disable a given stage


Enabling RCP at Project Level


Enabling RCP at Job Level


Enabling RCP at Stage Level• Sequential File Stage

– Output columns tab

• Transformer– Open Stage Properties

– Stage properties Output tab


When RCP is Disabled• DataStage Designer enforces Stage Input to Output column

mapping


When RCP is Enabled

• DataStage does not enforce mapping rules

• Runtime error if no incoming columns match unmapped target column names


Shared Containers


Shared Containers

• Encapsulate job design components into a shared container

• Provide reusable job design components

• Example– Apply stored transformer business logic


Creating a Shared Container

• Select Stages from an existing job

• Click Edit -> Construct Container -> Shared


Using a Shared Container in a Job


Mapping Input/output Links to the Container


Check Point

• What are the two benefits of RCP?

• What can you use to encapsulate stages and links in a job to make them reusable?



• Job Flexibility. Ability to create reusable components

• Shared Containers



____________________________Module 14: Job Control


Module Objectives

After Completing this module, you should be able to:• Use the DataStage job sequencer to build a job that controls a

sequence of jobs• Use Sequencer links and stages to control Sequence of set of

jobs• Use Sequencer triggers and stages to control the conditions

under which jobs run• Pass information in job parameters from master controlling

job to the controlled jobs• Define user variables• Enable restart• Handle errors and exceptions


What is a job Sequence• A master controlling job that controls the execution of a set of

subordinate jobs• Passes values to Sub-ordinate job parameters• Controls the order of execution(links)• Specifies condition under which the subordinate jobs get

executed(triggers)• Specifies Complex flow of control

– Loops– All/Some– Wait for file

• Perform System Activities– Email– Execute system commandsand executables

• Can include Restart checkpoints


Basics for Creating a new Job Sequence

• Open a new job sequence– Specify whether its restartable

• Add Stages– Stages to execute job

– Stages to execute system commands and Executables

– Special Purpose Stages

• Add Links

• Specify the order in which jobs are to be executed– Specify Triggers

– Triggers specify the condition under which control passes across a link

• Specify error handling

• Enable/ Disable Restart Checkpoints


Job Sequencer stages• Run Stages

– Job activity: Run a job– Execute Command: Run a system

command– Notification Activity: Send an Email

• Flow control Stages– Sequencer: Go if All / Some– Wait for file: Go when File

exists/Don’t Exist– Start Loop/End Loop– Nested Condition: Go if Condition

Satisfied• Error Handling

– Exception handler– Terminator

• Variables– User Variables


Example


Sequence Properties


Job Activity Stage Properties


Job Activity Trigger


Execute Command Stage


Notification Activity Stage


User Variables Stages


Referencing the User Variable


Flow of Control Stages


Wait for File Stage


Sequencer Stage


Nested Condition Stage


Loop Stages

Counter

values

Pass counter

value

Reference

Link to start


Error Handling


Handling Activities that fail

Pass control to

Exception Stage

When an Activity

Fails


Exception Handler Stage

Control goes here if

any activity fails


Restart


Enable Restart

Enable checkpoints

to be added


Disable Check Point at a StageDon’t checkpoint

this activity


Check Point

• Which Stage is used to run jobs in a Job Sequence

• Does the Exception Handler stage support an input

link



• Job Activity Stage

• No, control is automatically passed to the stage when an exception occurs


IBM WebSphere DataStage 8.1____________________________

Special Topics 1 : Complex Flat File Stage


Module objectives

After completing this module, you should be able to:• Import table definitions from a COBOL copybook• Design a job that extracts data from a COBOL file containing

multiple records type• Specify in a Complex Flat File(CFF) stage the column layouts of

each record type• Specify in a CFF stage how to identify when a record is read of

a specific type• Select in a CFF stage which columns from the different records

types are to be output from the stage


Complex Flat File Stage

• Process data in a COBOL file– File is described by a COBOL file description – File can contain multiple record types

• COBOL copybooks with multiple record formats can be imported as COBOL File Definitions– Each format is stored as a separate DataStage table definition

• Columns can be loaded for each record type• On the records ID Tab, you specify how to identify each type

of record.• Columns from any or all record types can be selected for

output– This allows columns of data from multiple records of different types to

be combined into a single output record


Sample COBOL Copybook

CLIENT record format

POLICY record format

COVERAGE record format


Importing a COBOL File DefinitionLevel 01 column

position

Level 01 items


COBOL Table Definitions

Level numbers


COBOL File Layout

COBOL layout

Layout tab


Specifying a Data Mask

Select date mask


Example Data File with Multiple Formats

Record Type = ‘1’ CLIENT record

Record Type = ‘2’ POLICY record

Record Type = ‘3’ COVERAGE record


Sample Job with CFF Stage

CFF Stage


File Options tab

Data file


Records Tab

Load columns for record type

Set as master

Active record type

Add another record type


Record ID Tab

Condition that identifies the type of record


Selection Tab


Record Options Tab


Layout Tab

Layout tab

COBOL layout


View Data

CLIENT columns POLICY columns Coverage columns


Processing Multi-Format Records

Derivations identify which type of record is

coming into the Transformer

Stage variable in the Transformer


Transformer Constraints


Checkpoint

1. What types of files contain the Metadata that is typically into the CFF stage?

2. Does the CFF stage support variable length records?

3. How does DataStage know which type of record it is reading from a file containing records of different formats?

4. What does accomplish to select a record type as a master?

5. How many record types can be designated Master?



1. COBOL copybooks or COBOL file definitions.

2. Yes, it can read files containing multiple file formats, each of a different physical length.

3. On the records ID tab, you define constraints that identify the record type. These must reference fields common to all record formats.

4. When a master record is read, all outputs columns are emptied before the master record contents are written.

5. Only one.



____________________________Topics 2 : Slowly Changing Dimension


Unit Objectives


• Design a job that creates a surrogate key source key file

• Design a job that updates a surrogate key source key file from a dimension table

• Design a job that processes a star schema database with Type 1 and Type 2 slowly changing dimensions


Surrogate Key Generation Stage


Surrogate Key Generator Stage

• Use to create and update the surrogate key state file

• Surrogate key state file

– One file per dimension table

– Stores the last used surrogate key integer for the dimension table

– Binary file


Example Job to Create Surrogate State Files

Create Surrogate State

File for Store dimension table

Create Surrogate State File for Product dimension table


Editing the Surrogate Key Generator Stage

Path to state file

Create the state file


Example Job to Update the Surrogate State File


Specifying the Update Information

Table column containing

surrogate key values

Update the state file


Slowly Changing Dimension Stage


Slowly Changing Dimension Stage

• Used for processing a star schema• Performs a lookup into a star schema dimension table

– Multiple SDC stages can daisy chained to process multiple dimension tables

• Inserts new rows into the dimension table as required • Updates existing rows in the dimension table as required

– Type 1 fields of a matching row are overwritten– Type 2 fields of matching row are retrained as history rows

• A new record with the new field value is added to the dimension table and made the current record

• Generally used in conjunction with the Surrogate key Generator stage– Creates a Surrogate key state file that retains a list of the previously

used surrogate keys


Star Schema Database Structure and Mapping


Example Slowly Changing Dimension Job

Check for matching StoreDim

rows

Perform Type 1 and

Type 2 updates to StoreDim

Tables

Check for matching Product

rows

Perform Type 1 and

Type 2 updates to

Product Tables


Working in the SCD Stage• Five “Fast Path ” pages to edit• Select the output link

– This is the link coming out of the SCD stage that is not used to update the dimension table

• Specify the purpose codes– Fields to match by

• Business key fields and the source fields to match to it– Surrogate Key– Type 1 fields– Type 2 fields– Current indicator for Type 2– Effective Date, Expire date for Type 2

• Surrogate key Management – Location of State file

• Dimension update specification• Output mappings


Selecting the Output Link

Select the output link


Specifying the Purpose Codes

Lookup key mapping

Type 1 field

Type 2 field

Surrogate key

Fields used for Type2 handling


Surrogate Key Management

Path to state file

Initial surrogate key value

Number of values to retrieve at one

time


Dimension Update Specification

Function used to retrieve the next

surrogate key value

Values that means current

Function used to calculate history date

range


Output Mappings


Checkpoint

1. How many Slowly Changing Dimension are needed to process a star schema with 4 dimension tables ?

2. How many surrogate key state files are needed to process a star schema with dimension tables ?

3. What’s the difference between a Type1 and a Type 2 dimension field attribute ?

4. What additional fields are needed for handling a Type 2 Slowly Changing Dimension field attribute ?



1. Four SCD stages are needed. One for each dimension table . Each SCD stage does a lookup and update to the table

2. Four surrogate key state files are needed. One for each dimension table. A separate state file is used for each.

3. Type 1 is a simple update. The value in the dimension record field is overwritten with the new value. Type 2 retains the value in a history record. A new record is created with the current value.

4. Three additional fields are needed : the current indicator is needed to flag whether a given record contains current type 2 value or an earlier value. The Effective Date and Expire Date fields are used to specify when the given record is applicable.



____________________________Topics 3 : Installation Run Through


Module Objectives


• Run-through the installation process

• Start the Information Server


Start the Installation


Installation and Response File Selection


Information Server Domain Location

Destination Folder


Installation Layers


License file Selection


Product Selection


Installation Type


Repository Installation


Repository Configuration


Application Server Installation


Application Server Administration


Information Server Administration


DataStage Projects


Language Selection


DB2 Server Selection


DB2 Instance Owner


ODBC Drivers


National Language Support


Information Server User IDs

• Database owner: e.g., xmeta

• DB2 instance owner: e.g., db2admin

• Application server ID: e.g., appserv

• Information Server administrator: e.g., admin

• Be sure all user IDs have passwords that have not expired


Testing the Installation• You can partially test the installation by logging onto the

Information Server web console


Checkpoint

1. List the five installation layers that can be installed.



• Client, Engine, Domain, Repository, and Documentation.



____________________________Topics 4: Solution Development Jobs


Module objectives


• List and describe the Warehouse jobs

• Understand the stages and techniques used in the

• Warehouse jobs


Introduction to the Solution Development Exercises


Solution Development Jobs

• Series of 4 jobs extracted from production jobs

• Use a variety of stages in interesting, realistic configurations– Sort, Aggregator stages

– Join, lookup stage

– Peek, Filter stages

– Modify stage

– Oracle stage

• Contain useful techniques– Use of Peeks

– Datasets used to “connect” jobs

– Use of project environment variables in job parameters

– Fork Joins

– Lookups for auditing


Warehouse Job 01


Warehouse Job 02


Warehouse Job 03


Warehouse Job 04


Warehouse Job 02 With Lookup


Checkpoint

1. What is a Fork-Join?


Checkpoint solutions1. A development technique that forks a stream into two

outputs and later joins these two streams back again.


datastage8.1_trgmaterials

Documents

datastage module

datastage designer module

datastage parallel jobs

datastage development

parallel jobs module

aggregating data module

sequential data module

relational data module