datastage8.1_trgmaterials

441
IBM WebSphere DataStage Introduction to DataStage 8.1

Upload: karthik-bablu

Post on 12-Dec-2015

13 views

Category:

Documents


1 download

DESCRIPTION

datastage

TRANSCRIPT

IBM WebSphere

DataStage

Introduction to DataStage 8.1

Course Contents• Module 01: Introduction

• Module 02: Deployment

• Module 03: Administering DataStage

• Module 04: DataStage Designer

• Module 05: Creating Parallel jobs

• Module 06: Accessing Sequential data

• Module 07: Platform Architecture

• Module 08: Combining Data

• Module 09: Sorting and aggregating data

• Module 10: Transforming Data

• Module 11: Repository Functions

• Module 12: Working with Relational data

• Module 13: Metadata in Parallel Framework

• Module 14: Job Control

2/16/2010 2Training Material - Sample Copy

Course Objectives• DataStage Clients and Server

• Setting up the parallel environment

• Importing metadata

• Building DataStage jobs

• Loading metadata into job stages

• Accessing Sequential data

• Accessing Relational data

• Introducing the Parallel framework architecture

• Transforming data

• Sorting and aggregating data

• Merging data

• Configuration files

2/16/2010 3Training Material - Sample Copy

IBM WebSphere DataStage 8.1

____________________________Module 1: Introduction

2/16/2010 4Training Material - Sample Copy

Unit Objectives

After completing this unit , you should be able to :

• List and describe the uses of DataStage

• List and describe DataStage clients

• Describe the DataStage workflow

• List and compare different types of DataStage jobs

• Describe two types of parallelism exhibited by DataStage parallel jobs

2/16/2010 5Training Material - Sample Copy

What is IBM WebSphere DataStage?

• Design jobs for Extraction, Transformation, and Loading (ETL)

• Ideal tool for data integration projects – such as, data warehouses, data marts, and system migrations

• Import, export, create, and manage metadata for use within jobs

• Schedule, run, and monitor jobs all within DataStage

• Administer your DataStage development and execution environments

• Create batch (controlling) jobs

2/16/2010 6Training Material - Sample Copy

IBM information server• Suit of applications, including DataStage, that :

- Shares a common repository

– DB2, by default

- Shares a common set of application services and functionality

– Provided by Metadata server components hosted by an application server

- IBM Web Sphere Application server

– Provided services included:

• Security

• Repository

• Logging and reporting

• Metadata management

• Managed using web console clients

- Administration console

- Reporting console

2/16/2010 7Training Material - Sample Copy

IBM Information Server

2/16/2010 8Training Material - Sample Copy

Information Server Backbone

Web Sphere Web Sphere Web Sphere Web Sphere Web Sphere Web Sphere

Information service

director

Business

GlossaryInformation Analyzer DataStage Quality Stage

Federation Server

Metadata Access Services Metadata Analysis Services

Metadata Server

Information Server Console

2/16/2010 9Training Material - Sample Copy

Comparison between IBM WebSphere and DataStage 8.1

2/16/2010 10Training Material - Sample Copy

What’s the Same• DataStage Designer and Director work very much the same

- What you could do before you can still do

- Minor changes to menus and GUI

• Same job types are supported

- Parallel jobs

- Server jobs

- Mainframe jobs

- Job sequences

• All stages that existed in DataStage 7.5x are still supported

• Previous DataStage functionality is still supported

- Export DataStage components (dsx)

• Now occurs in designer

- Import Metadata

• Sequential files

• Database

• COBOL file definitions

• Job compile, execution ,run-time log are the same2/16/2010 11Training Material - Sample Copy

What’s Different Part I

• QualityStage is now embedded within a DataStage designer

• DataStage Quality stage(a.k.a DataStage) are now hosted by Metadata server

- There is a layer of administration ,logging, reporting, security that occurs at Metadata server level(outside of DataStage)

• Managed by the administration and reporting console

- Repository is now managed by the Metadata server and its services

• No longer a universe database

• Repository model has been completely recognized

- Installation and deployment are now more flexible (complex)

• More deployment option

• Metadata server, repository, DataStage engines can be on different machines and platforms

2/16/2010 12Training Material - Sample Copy

What’s new or different? Part II

• DataStage manager is gone

– All manager functionality has been moved into Designer

• DataStage

Permissions are implemented differently

GUI : new icons, menu arrangements, etc.

New stages

• Slowly changing dimensions (SCD) stage

• Surrogate key management

• Connector stages

Stage enhancements

• Lookup stage now supports range lookups

• Complex flat file(CFF) stage now supports multiple record format

• SQL builders accessible in connector stages

2/16/2010 13Training Material - Sample Copy

What’s New or Different ? Part III• DataStage repository enhancements

– Flexible folder organization

– Repository search

– Enhanced DataStage components export

– Job and table definition difference reports

– Impact analysis

• New objects

– Parameters sets: Named sets of parameters

– Database connectors : Named sets of database connection property values

• New utilities

– Performance analyzer

– Resource Estimator

2/16/2010 14Training Material - Sample Copy

Components of DataStage 8.1

____________________________Module 1: Introduction

2/16/2010 15Training Material - Sample Copy

Client Logon

16

Information server

console address

Information

Server

administrator ID

2/16/2010 16Training Material - Sample Copy

Information Server Administration console

Administration Console

Reporting

Console

2/16/2010 17Training Material - Sample Copy

DataStage Architecture

Clients

Parallel engine Server Engine

Shared repositoryEngines

2/16/2010 18Training Material - Sample Copy

DataStage Administrator

2/16/2010 19Training Material - Sample Copy

DataStage Designer

New Menus / Icons

Repository Search

2/16/2010 20Training Material - Sample Copy

DataStage DirectorNothing new here

2/16/2010 21Training Material - Sample Copy

Steps to Develop job in DataStage

• Define global and project properties in Administrator

• Import metadata into the Repository

• Designer Repository View lists the existing jobs

• Build job in Designer

• Compile job in Designer

• Run and monitor job in Director

-- Jobs can also be run in designer, but job log messages can not be viewed in designer

2/16/2010 22Training Material - Sample Copy

DataStage Project Repository

User -added folder

Standard Job Folder

Standard table

definitions folder

2/16/2010 23Training Material - Sample Copy

Types of DataStage Jobs• Parallel jobs

– Executed under control of DataStage Server runtime environment

– Built-in functionality for Pipeline and Partitioning Parallelism

– Compiled into OSH (Orchestrate Scripting Language)

– OSH executes Operators

• Executable C++ class instances

– Runtime monitoring in DataStage Director

• Job Sequences (Batch jobs, Controlling jobs)

– Master Server jobs that kick-off jobs and other activities

– Can kick-off Server or Parallel jobs

– Runtime monitoring in DataStage Director

• Server jobs (Requires Server Edition license)

– Executed by the DataStage Server Edition

– Compiled into Basic (interpreted pseudo-code)

– Runtime monitoring in DataStage Director

• Mainframe jobs (Requires Mainframe Edition license)

– Compiled into COBOL

– Executed on the Mainframe, outside of DataStage2/16/2010 24Training Material - Sample Copy

Design Elements of Parallel Jobs

• Stages: Implemented as OSH operators (pre-built components)

• Types of Stages

• Passive stages (E and L of ETL)

– Read data

– Write data

– E.g., Sequential File, Oracle, Peek stages

• Processor (active) stages (T of ETL)

– Transform data

– Filter data

– Aggregate data

– Generate data

– Split / Merge data

– E.g., Transformer, Aggregator, Join, Sort stages

• Links

– “Pipes” through which the data moves from stage to stage

2/16/2010 25Training Material - Sample Copy

Quiz – True or False?

• DataStage Designer is used to build and compile your ETL jobs

• Manager is used to execute your jobs after you build them

• Director is used to execute your jobs after you build them

• Administrator is used to set global and project properties

2/16/2010 26Training Material - Sample Copy

Deployment of DataStage 8.1

____________________________Module 2: Deployment

2/16/2010 27Training Material - Sample Copy

Unit Objectives

After completing this unit, you should be able to

• Identify the components of Information server that needs to be installed

• Describe what a deployment domain consists of

• Describe different domain deployment options

• Describe the installation process

• Start the information server

2/16/2010 28Training Material - Sample Copy

What Gets Deployed An information server domain, consisting of the following:

• Metadata server, hosted by an IBM Web Sphere Application server instance

• One or more DataStage servers

– DataStage server includes both the parallel and server engines

• One DB2 UDB instance containing the repository database

• Information server clients

– Administration console

– Reporting console

– DataStage clients

• Administrator

• Designer

• Director

• Additional information server applications

– Information analyzer

– Business glossary

– Rotational data architect

– Information services Director

– Federation Server 2/16/2010 29Training Material - Sample Copy

Deployment : Everything on One Machine

Clients DB2 Instance With repository

Metadata Server Backbone

Clients

DataStage Server

• Here we have a single domain

with the hosted applications all on

one machine

• Additional client workstations can

connect to this machine using

TCP/IP

2/16/2010 30Training Material - Sample Copy

Deployment : DataStage on separate machine

• Here the domain is split

between two machines

- DataStage server

- Metadata server and

DB2 repository

Metadata Server Backbone

DB2 instance with Repository

DataStage Server

Clients

2/16/2010 31Training Material - Sample Copy

Metadata server and DB2 on separate machines

• Here the domain is split

between three machines

– DataStage server

– MetaData Server

– DB2 repository

DataStage server DB2 Instances

With repository

Clients

Metadata Server Backbone

2/16/2010 32Training Material - Sample Copy

Information Server Installation

2/16/2010 33Training Material - Sample Copy

Installation Configuration Layer• Configuration layers include:

– Client

• DataStage and information server clients

– Engine

• DataStage and other application engines

– Domain

• Metadata server and hosted metadata server components

• Installed products domain components

– Repository

• Repository database server and database

– Documentation

• Selected layers are installed on the machine local to the installation

• Already existing components can be configured and used

– E.g. DB2, Web Sphere Application Server

2/16/2010 34Training Material - Sample Copy

Information server Start-up– Start the Metadata Server

– From windows start menu, click “Start the server” after the profile to be used (e.g. default)

– From the command line, open the profile bin directory

– Enter the startup server1

> server 1 is the default name of the application server hosting the Metadata server

– Start the ASB agent

– From windows start menu, click “Start the agent” after selecting the information server folder

– Only required if DataStage and metadata server are on different machines.

– To begin work in DataStage, double-click on a DataStage client icon.

– To begin work in Administration and reporting consoles, double-click on the Web Console for Information Server icon

2/16/2010 35Training Material - Sample Copy

Starting the Metadata Server Backbone

Application Server Profiles folder Profile Start the Server

Startup command

2/16/2010 36Training Material - Sample Copy

Starting the ASB Agent

Start the agent

2/16/2010 37Training Material - Sample Copy

Testing the Installation

• You can partially test the installation by logging on to the information Server web console

2/16/2010 38Training Material - Sample Copy

Checkpoint

1. What application components make up a domain ?

2. Can a domain contain multiple DataStage servers ?

3. Does the DB2 instance and the repository database need to be on the same machine as the application server ?

4. Suppose DataStage is on the separate machine from the Application server. What two components need to be running before you log onto DataStage ?

2/16/2010 39Training Material - Sample Copy

Check point solutions

1. Metadata server hosted by the application server. One or more DataStage servers. One DB2/UDB instance containing the suit repository database

2. Yes. The DataStage servers must be on separate machines. They can be on different platforms e.g. one server running on Windows and another running on Linux

3. No. the DB2 instance with the repository can reside on a separate machine/ platform than the Application server.

4. The Application server and the ASB agent2/16/2010 40Training Material - Sample Copy

IBM WebSphere DataStage 8.0

Module 3:Administering DataStage

2/16/2010 41Training Material - Sample Copy

IBM WebSphere DataStage 8.1

___________________________Module 2: Administering DataStage

2/16/2010 42Training Material - Sample Copy

Unit objectives

After completing this unit, you should be able to:

• Open the administrative console

• Create new user and groups

• Assign suite roles and product roles to users and groups

• Give user DataStage credentials

• Log on to DataStage Administrator

• Add a DataStage user on the permissions tab and specify the user’s role

• Specify DataStage global and project defaults

• List and describe important environment variables

2/16/2010 43Training Material - Sample Copy

Information server administrator console

• Web application for administering information server

• Used for :

– Domain management

– Session management

– Management of users and groups

– Logging management

– Scheduling management

2/16/2010 44Training Material - Sample Copy

Opening the Administrator Web Console

Information server

console address

Information

Server

administrator ID

2/16/2010 45Training Material - Sample Copy

Users and Group Management

2/16/2010 46Training Material - Sample Copy

User and group Management

• Suite authorization can be provided to users or group– Users that are members of the group acquire the authorization of the group

• Authorizations are provided in the form of roles– Two types of roles

• Suit roles: Apply to the suite

• Suite components roles : Apply to a specific product or components of Information Server, e.g. DataStage.

• Suite roles– Administrator

• Perform user and group management tasks

• Includes all the privileges of the Suite user role

– User

• Create views of scheduled tasks and logged messages

• Create and run reports

2/16/2010 47Training Material - Sample Copy

User and group Management (cont…)

• Suite components roles– DataStage

• DataStage user

– Permissions are assigned within DataStage

> Developer, Operator, Super Operator, Productions Manager

– DataStage administrator

• Full permission to work in DataStage Administrator, Designer and Director

– And so on for all products in the Suite

2/16/2010 48Training Material - Sample Copy

Creating a DataStage user ID

Administration console

Create new user

Users

2/16/2010 49Training Material - Sample Copy

Assigning DataStage Roles

Users

User ID Assign Suite role

Assign DataStage

User role

2/16/2010 50Training Material - Sample Copy

DataStage Credential Mapping

• Users given DataStage Administrator or DataStage user product roles in the suite administrator console do not automatically receive DataStage credentials

– Users with DataStage Administrator roles need to be mapped to a valid user on the DataStage server machine

• This DataStage users must have file access permission to the DataStage engine/Project files or Administrator rights on the operating system

– Users with DataStage user roles need to be mapped to a valid user on the DataStage server machine and need additional DataStage assigned permissions (developer or operator…)

2/16/2010 51Training Material - Sample Copy

DataStage Credential Mapping

Assign DataStage

User role

2/16/2010 52Training Material - Sample Copy

DataStage 8.1 Administrator

2/16/2010 53Training Material - Sample Copy

Module Objectives

• Setting project properties in Administrator

• Defining Environment Variables

• Importing / Exporting DataStage objects in Manager

• Importing Table Definitions defining sources and targets in Manager

2/16/2010 54Training Material - Sample Copy

Logging on to AdministratorHost name,

port number of

application

server

DataStage

administrator ID

and Password

Name or IP

address of

DataStage

server machine2/16/2010 55Training Material - Sample Copy

Setting Project Properties

• Project Properties

• Projects can be created and deleted in Administrator

• Each project is associated with a directory on the DataStage Server

• Project properties, defaults, and environmental variables are specified in Administrator Can be overridden at the job level

2/16/2010 56Training Material - Sample Copy

Setting Project Properties

• To set project properties, log onto Administrator, select your project, and then click “Properties”

Click to specify

project properties

Server projects

Link to Information

server adminstation

console2/16/2010 57Training Material - Sample Copy

Project Properties General Tab

Enable Runtime

column Propagation

Specify auto-purge

Environment variable

setting

2/16/2010 58Training Material - Sample Copy

Environment Variables

User-defined

variables

Parallel job variables

2/16/2010 59Training Material - Sample Copy

Environment reporting variables

Display score

Display OSH

Display record

counts

2/16/2010 60Training Material - Sample Copy

Permissions Tab

2/16/2010 61Training Material - Sample Copy

Adding users and Groups

Available users /

groups. Must have

a DataStage

product User role

Add DataStage

users

2/16/2010 62Training Material - Sample Copy

Specify DataStage Role

Added DataStage

user

Select DataStage

role

2/16/2010 63Training Material - Sample Copy

Tracing Tab

2/16/2010 64Training Material - Sample Copy

Parallel Tab

2/16/2010 65Training Material - Sample Copy

Sequence Tab

2/16/2010 66Training Material - Sample Copy

Check Point

1. Authorizations can be assigned to what two items

2. What two items of authorization roles can be assigned to a user and group?

3. In addition to suite authorization to logon to DataStage, What else does a DataStage developer requires to work in DataStage

2/16/2010 67Training Material - Sample Copy

Check Point Solutions

1. Users and Groups. Members of the group acquires the authorizations of the group

2. Suite roles and Product roles

3. Must be mapped to a user with DataStage credentials

2/16/2010 68Training Material - Sample Copy

IBM WebSphere DataStage 8.1

___________________________Module 4: DataStage Designer

2/16/2010 69Training Material - Sample Copy

Unit Objectives

After completing this unit , you should be able to:

• Logon to DataStage

• Navigate around DataStage Designer

• Import and Export DataStage objects in to file

• Import a table definition for a Sequential file

2/16/2010 70Training Material - Sample Copy

Logging on to DataStage Designer

Host name, port

number of

application server

DataStage server

machine / project

2/16/2010 71Training Material - Sample Copy

Designer Work AreaRepository

Menus Toolbar

Parallel

canvas

Palette

2/16/2010 72Training Material - Sample Copy

Importing and Exporting DataStage Objects

2/16/2010 73Training Material - Sample Copy

Importing and DataStage Exporting Objects

• What Is Metadata?

2/16/2010 74Training Material - Sample Copy

Repository Window

Default jobs folder

Default table

definitions folder

Search for objects

in the project

Project

2/16/2010 75Training Material - Sample Copy

Repository Contents

• Metadata

• Describing sources and targets: Table definitions

• Describing inputs / outputs from external routines

• Describing inputs and outputs to BuildOp and CustomOp stages

• DataStage objects

• Jobs

• Routines

• Compiled jobs / objects

• Stages2/16/2010 76Training Material - Sample Copy

Import and Export

• Any object in Repository can be exported to a file

• Can export whole projects

• Use for backup

• Sometimes used for version control

• Can be used to move DataStage objects from one project to another

• Use to share DataStage jobs and projects with other developers

2/16/2010 77Training Material - Sample Copy

Export Procedure

• In Repository Export Window, click “Export>DataStage Components”

• Select DataStage objects for export

• Specify type of export:

– DSX: Default format

– XML: Enables processing of export file by XML applications, e.g., for

– generating reports

• Specify file path on client machine

2/16/2010 78Training Material - Sample Copy

Export Window

Selected objects

Default jobs folder

Click to select

objects from the

repository

Begin export

Export type

2/16/2010 79Training Material - Sample Copy

Import Procedure

• In Repository, click “Import>DataStage Components”

– Or “Import>DataStage Components (XML)” if you are importing an XMLformat export file

• Select DataStage objects for import

2/16/2010 80Training Material - Sample Copy

Import OptionsImport all objects

in the file

Display list to

select form

2/16/2010 81Training Material - Sample Copy

Importing Table definitions

• Table definitions describe the format and columns of files and tables

– Import format and column definitions from sequential files

– Import relational table column definitions

– Import COBOL files and Many other things

• Table definitions can be loaded into job stages

• Table definitions can be used to define Routine and Stage interfaces

2/16/2010 82Training Material - Sample Copy

Sequential File Import Procedure

• In Designer, click Import>Table Definitions>Sequential File Definitions

• Select directory containing sequential file and then the file

• Examined format and column definitions and edit is necessary

2/16/2010 83Training Material - Sample Copy

Sequential Import Window

Select repository folder

Select File

Start import

Select directory containing files

2/16/2010 84Training Material - Sample Copy

Specify Format

Edit Columns

Delimiter

Select if first row

has column names

2/16/2010 85Training Material - Sample Copy

Edit Column Names and TypesDouble-click to define

extended properties

2/16/2010 86Training Material - Sample Copy

Extended Properties window

Available properties

Property categories

2/16/2010 87Training Material - Sample Copy

Table Definition General Tab

SourceType

Stored table

definition

2/16/2010 88Training Material - Sample Copy

Quiz - True or False?

• You can export DataStage objects such as jobs, but you can’t export metadata, such as field definitions of a sequential file.

• The directory to which you export is on the DataStage client machine, not on the DataStage server machine.

2/16/2010 89Training Material - Sample Copy

IBM WebSphere DataStage 8.1

___________________________Module 5: Creating Parallel Jobs

2/16/2010 90Training Material - Sample Copy

Unit Objectives

After completing this unit, you should be able to:

• Design a simple Parallel job in Designer

• Define a job parameter

• Compile your job

• Run your job in Director

• View the job log

2/16/2010 91Training Material - Sample Copy

Creating Parallel Jobs

• What Is a Parallel Job?

– Executable DataStage program

– Created in DataStage Designer

– Can use components from Manager Repository

– Built using a graphical user interface

– Compiles into Orchestrate shell language (OSH) and object code (from generated C++)

2/16/2010 92Training Material - Sample Copy

Job Development Overview

• Import metadata defining sources and targets

-- Can be done within Designer or Manager

• In Designer, add stages defining data extractions and loads

• Add processing stages to define data transformations

• Add links defining the flow of data from sources to targets

• Compile the job

• In Director, validate, run, and monitor your job

-- Can also run the job in Designer

-- Can only view the job log in Director

2/16/2010 93Training Material - Sample Copy

Logging on to DataStage Designer

Host name, port number of

application server

DataStage server machine /

project

2/16/2010 94Training Material - Sample Copy

Designer Work Area

RepositoryMenus Toolbar

Parallel

canvas

Palette

2/16/2010 95Training Material - Sample Copy

Designer Toolbar• Provides quick access to the main functions of Designer

Toolbar

2/16/2010 96Training Material - Sample Copy

Tools PaletteStage

categorie

s

Stages

2/16/2010 97Training Material - Sample Copy

Adding Stages and Links

• Drag stages from the Tools Palette to the diagram

-- Can also be dragged from Stage Type branch to the

diagram

• Draw links from source to target stage

--Right mouse over source stage

--Release mouse button over target stage

2/16/2010 98Training Material - Sample Copy

Job Creation Example Sequence

• Brief walkthrough of procedure

• Assumes table definition of source already exists in the repository

2/16/2010 99Training Material - Sample Copy

Create New Parallel Job

Open new

window

Parallel job

2/16/2010 100Training Material - Sample Copy

Drag Stages and Links From Palette

Row Generator

Peek

Compile

Job

properties

2/16/2010 101Training Material - Sample Copy

Renaming Links and Stages

• Click on a stage or link to rename it

• Meaningful names have many

benefits

– Documentation

– Clarity

– Fewer development errors

2/16/2010 102Training Material - Sample Copy

DataStage Designer Stages

2/16/2010 103Training Material - Sample Copy

Row Generator Stage

• Produces mock data for specified columns

• No inputs link; single output link

• On Properties tab, specify number of rows

• On Columns tab, load or specify column definitions- Click Edit Row over a column to specify the values to be generated for that column

- A number of algorithms for generating values are available depending on the data type

• Algorithms for Integer type- Random: seed, limit

- Cycle: Initial value, increment

• Algorithms for string type: Cycle , alphabet

• Algorithms for date type: Random, cycle

2/16/2010 104Training Material - Sample Copy

Inside the Row Generator Stage

Properties tab

Set Property

value

Property

2/16/2010 105Training Material - Sample Copy

Columns Tab

Select table

Definition

View Data

Load a Table

Definition

2/16/2010 106Training Material - Sample Copy

Extended Properties

Specified Properties

and their values

Additional

properties to add

2/16/2010 107Training Material - Sample Copy

Peek Stage

• Displays field values

- Displayed in job log or sent to a

- Skip records option

- Can control number of records to be displayed

- Shows data in each partition, labeled 0, 1, 2, …

• Useful stub stage for iterative job development

- Develop job to a stopping point and check the data

2/16/2010 108Training Material - Sample Copy

Peek Stage Properties

Output to Job log

2/16/2010 109Training Material - Sample Copy

Job Parameters

• Defined in Job Properties window

• Makes the job more flexible

• Parameters can be:- Used in directory and file names

- Used to specify property values

- Used in constraints and derivations

• Parameter values are determined at run time

• When used for directory and files names and names of properties, surround with pound signs (#)

- E.g., #NumRows#

• Job parameters can reference DataStage and system environment variables- Prefaced by e.g: APT_CONFIG_FILE

2/16/2010 110Training Material - Sample Copy

Defining a Job Parameter

Parameter Tab

Parameter

2/16/2010 111Training Material - Sample Copy

Using a Job Parameter in a Stage

Job Parameter

surrounded with

pound signs

2/16/2010 112Training Material - Sample Copy

Adding Job Documentation

• Job Properties

- Short and long descriptions

- Shows in Manager

• Annotation stage

- Added from the Tools Palette

- Display formatted text descriptions on diagram

2/16/2010 113Training Material - Sample Copy

Job Properties Documentation

Documentation

2/16/2010 114Training Material - Sample Copy

Annotation Stage Properties

2/16/2010 115Training Material - Sample Copy

Compiling a JobCompile

2/16/2010 116Training Material - Sample Copy

Error or Successful Message

Highlight stage

with errorClick for more

info

2/16/2010 117Training Material - Sample Copy

Running Jobs and Viewing the Job

Log in Designer

2/16/2010 118Training Material - Sample Copy

Prerequisite to Job Execution

2/16/2010 119Training Material - Sample Copy

DataStage Director

• Use to run and schedule jobs

• View runtime messages

• Can invoke from DataStage Designer- Tools > Run Director

2/16/2010 120Training Material - Sample Copy

Run Options

Stop after number

of warnings

Stop after number

of rows

2/16/2010 121Training Material - Sample Copy

Run Options

2/16/2010 122Training Material - Sample Copy

Job Status View

2/16/2010 123Training Material - Sample Copy

Job Log ViewClick the open book

icon to view log

messages

Peek Messages

2/16/2010 124Training Material - Sample Copy

Message Details

2/16/2010 125Training Material - Sample Copy

Other Director Options

• Schedule job to run on a particular date/time

• Clear job log of messages

• Set job log purging conditions

• Set Director options- Row limits

- Abort after x warnings

2/16/2010 126Training Material - Sample Copy

Running Jobs From Command Line

• dsjob –run –param numrows=10 dx444 GenDataJob

- Runs a job

- Use –run to run the job

- Use –param to specify parameters

- In this example, dx444 is the name of the project

- In this example, GenDataJob is the name of the job

• dsjob –logsum dx444 GenDataJob

- Displays a job’s messages in the log

• Documented in “Parallel Job Advanced Developer’s Guide”

2/16/2010 127Training Material - Sample Copy

Check Points

1. Which stage can be used to display output data in the job log?

2. Which stage is used for documenting your job on the job Canvas?

3. What command is used to run jobs from the operating system command line?

2/16/2010 128Training Material - Sample Copy

Check Point Solution

1.Dsjob –run

2.Peek Stage

3.Annotation Stage

2/16/2010 129Training Material - Sample Copy

IBM WebSphere DataStage 8.1

____________________________Module 6: Accessing Sequential Data

2/16/2010 130Training Material - Sample Copy

Module Objectives

After completing this module, you should be able to:

• Understand the stages for accessing different kinds of sequential data

• Sequential File stage

• Complex Flat File stage

• Create jobs that read from and write to sequential files

• Create reject Link

• Work with NULL’s in sequential file

• Read from multiple files using file patterns

• Use multiple readers

2/16/2010 131Training Material - Sample Copy

Types of Sequential Data stages

• Sequential

Fixed or variable length

• Data Set

• Complex Flat File

2/16/2010 132Training Material - Sample Copy

How Sequential Data is Handled

• Import and export operators are generated

– Stages get translated into operators during the compile

• Import operators convert data from the external format, as

• described by the Table Definition, to the framework

• internal format

– Internally, the format of data is described by schemas

• Export operators reverse the process

• Messages in the job log use the “import” / “export”

• terminology

– E.g., “100 records imported successfully; 2 rejected”

– E.g., “100 records exported successfully; 0 rejected”

– Records get rejected when they cannot be converted correctly

• during the import or export2/16/2010 133Training Material - Sample Copy

Using the Sequential File Stage

• Both import and export of general files (text, binary) are performed by the sequential file stage

2/16/2010 134Training Material - Sample Copy

Features of Sequential File Stage

• Normally executes in sequential mode

• Executes in parallel when reading multiple files

• Can use multiple readers within a node

Reads chunks of a single file in parallel

• The stage needs to be told:

How file is divided into rows (record format)

How row is divided into columns (column format)

2/16/2010 135Training Material - Sample Copy

File Format Example

2/16/2010 136Training Material - Sample Copy

Sequential File Stage Rules

• One input link

• One stream output link

• Optionally, one reject link

– Will reject any records not matching metadata in the column definitions

– Example: You specify three columns separated by commas, but the row that’s read had no commas in it

2/16/2010 137Training Material - Sample Copy

Job Design using Sequential stages

2/16/2010 138Training Material - Sample Copy

Sequential Source Columns Tab

2/16/2010 139Training Material - Sample Copy

Input Sequential Stage Properties

2/16/2010 140Training Material - Sample Copy

Format Tab

2/16/2010 141Training Material - Sample Copy

Reading Using a File Pattern

2/16/2010 142Training Material - Sample Copy

Properties –Multiple Readers

2/16/2010 143Training Material - Sample Copy

Sequential Stage As a Target

2/16/2010 144Training Material - Sample Copy

Reject Link• Reject mode =

– Continue: Continue reading records

– Fail: Abort job

– Output: Send down output link

• In a source stage

– All records not matching the metadata (column definitions) are rejected

• In a target stage

– All records that fail to be written for any reason

• Rejected records consist of one column, datatype = raw

2/16/2010 145Training Material - Sample Copy

Inside the Copy Stage

2/16/2010 146Training Material - Sample Copy

Reading and Writing Null Values

2/16/2010 147Training Material - Sample Copy

Working with NULLS

• Internally, NULL is represented by a special value outside the range of

• any existing, legitimate values• If NULL is written to a non- nullable column, the job will abort• Columns can be specified as nullable

NULLs can be written to nullable columns• You must “handle” NULLs written to non- nullable columns in a• Sequential File stage

You need to tell DataStage what value to write to the file Unhandled rows are rejected

• In a Sequential source stage, you can specify values you want• DataStage to convert to NULLs

2/16/2010 148Training Material - Sample Copy

Specifying a Value For Null

2/16/2010 149Training Material - Sample Copy

Dataset Stage

2/16/2010 150Training Material - Sample Copy

Dataset• Binary data file• Preserves partitioning

– Component dataset files are written to on each partition• Suffixed by .ds• Referred to by a header file• Managed by Data Set Management utility from GUI (Manager, Designer,

Director)• Represents persistent data• Key to good performance in set of linked jobs• No import / export conversions are needed• No repartitioning needed• Accessed using DataSet stage• Implemented with two types of components:

– Descriptor file:contains metadata, data location, but NOT the data itself

– Data file (s)• contains the data

multiple files, one per partition (node)2/16/2010 151Training Material - Sample Copy

Job with Dataset Stage

2/16/2010 152Training Material - Sample Copy

Displaying Data and Schema

2/16/2010 153Training Material - Sample Copy

Data and Schema Display

2/16/2010 154Training Material - Sample Copy

File Set

• Use to read and write to filesets

• Files suffixed by .fs

• Files are similar to a dataset

– Partitioned

– Implemented with header file and data files

• How filesets differ from datasets

– Data files are text files

• Hence readable by external applications

– Datasets have a proprietary data format which may change in future DataStage versions

2/16/2010 155Training Material - Sample Copy

Checkpoint

1. List three types of file data

2. What makes datasets perform better than other types of files in parallel jobs

3. What is the difference between a data set and a file set?

2/16/2010 156Training Material - Sample Copy

Checkpoint solutions

1. Sequential, dataset, complex flat files

2. They are partitioned and they store data in the native parallel format

3. Both are partitioned. Data sets store data in a binary format not readable by user applications. File sets are readable.

2/16/2010 157Training Material - Sample Copy

IBM WebSphere DataStage 8.1

____________________________Module 7: Platform Architecture

2/16/2010 158Training Material - Sample Copy

Module Objectives

After completing this module, you should be able to:

• Describe parallel processing architecture

• Describe pipeline parallelism

• Describe partition parallelism

• List and describe partitioning and collecting algorithms

• Describe configuration files

• Describe the parallel job compilation process

• Explain OSH

• Explain the Score

2/16/2010 159Training Material - Sample Copy

Key EE Concepts

2/16/2010 160Training Material - Sample Copy

Scalable Hardware Environments

2/16/2010 161Training Material - Sample Copy

Pipeline Parallelism

• Transform, clean, load processes execute simultaneously• Like a conveyor belt moving rows from process to process

– Start downstream process while upstream process is running

• Advantages:– Reduces disk usage for staging areas– Keeps processors busy

• Still has limits on scalability

2/16/2010 162Training Material - Sample Copy

Partition Parallelism

• Divide the incoming stream of data into subsets to be separately

• processed by an operation

Subsets are called partitions (nodes)

• Each partition of data is processed by the same operation

E.g., if operation is Filter, each partition will be filtered in exactly the same way

• Facilitates near-linear scalability

8 times faster on 8 processors

24 times faster on 24 processors

This assumes the data is evenly distributed

2/16/2010 163Training Material - Sample Copy

Three-Node Partitioning

• Here the data is partitioned into three partitions

• The operation is performed on each partition of data separately and in parallel

• If the data is evenly distributed, the data will be processed three times faster

2/16/2010 164Training Material - Sample Copy

EE Combines Partitioning and Pipelining

Within EE, pipelining, partitioning and repartitioning are automatic, Job developer only identifies:• Sequential vs. Parallel operations (by stage)• Method of data partitioning• Configuration file (which identifies resources)• Advanced stage options (buffer tuning, operator combining, etc.)

2/16/2010 165Training Material - Sample Copy

Job Design V. Execution

2/16/2010 166Training Material - Sample Copy

Configuration File• Configuration file separates configuration (hardware / software)

from job design

– Specified per job at runtime by $APT_CONFIG_FILE

– Change hardware and resources without changing job design

• Defines number of nodes (logical processing units) with their resources (need not match physical CPUs)

– Dataset, Scratch, Buffer disk (file systems) Optional resources (Database, SAS, etc.) Advanced resource optimizations

– “Pools” (named subsets of nodes)

• Multiple configuration files can be used at runtime

– Optimizes overall throughput and matches job characteristics to overall hardware resources

– Allows runtime constraints on resource usage on a per job basis

2/16/2010 167Training Material - Sample Copy

Example configuration File

2/16/2010 168Training Material - Sample Copy

Partitioning and Collecting

2/16/2010 169Training Material - Sample Copy

Partitioning and Collecting

• Partitioning breaks incoming rows into sets (partitions) of rows

• Each partition of rows is processed separately by the stage/operator

If the hardware and configuration file supports parallel processing, partitions of rows will be processed in parallel

• Collecting returns partitioned data back to a single stream

• Partitioning / Collecting occurs on stage Input links

• Partitioning / Collecting is implemented automatically

Based on stage and stage properties

How the data is partitioned / collected can be specified

2/16/2010 170Training Material - Sample Copy

Partitioning / Collecting Algorithms• Partitioning algorithms include:

Round robinHash: Determine partition based on key value- Requires key specification

Entire: Send all rows down all partitionsSame: Preserve the same partitioningAuto: Let DataStage choose the algorithm

• Collecting algorithms include:Round robinSort Merge- Read in by key- Presumes data is sorted by the key in each partition- Builds a single sorted stream based on the key ordered- Read all records from first partition, then second, …

2/16/2010 171Training Material - Sample Copy

Keyless V. Keyed Partitioning Algorithms

• Keyless: Rows are distributed independently of data valuesRound RobinEntireSame

• Keyed: Rows are distributed based on values in the specified key

• Hash: Partition based on keyExample: Key is State. All “CA” rows go into the same partition; all

“MA” rows go in the same partition. Two rows of the same state never go into different partitionsModulus: Partition based on modulus of key divided by the number of partitions. Key is a numeric type.

Example: Key is OrderNumber (numeric type). Rows with the same order number will all go into the same partition.DB2: Matches DB2 EEE partitioning

2/16/2010 172Training Material - Sample Copy

Round Robin and Random Partitioning

• Keyless partitioning methods

• Rows are evenly distributed across

partitions• Good for initial import of data if no

• other partitioning is needed

• Useful for redistributing data

• Fairly low overhead

• Round Robin assigns rows to partitions like

dealing cards• Row/Partition assignment will be the

same for a given $APT_CONFIG_FILE

• Random has slightly higher

• overhead, but assigns rows in a

• non-deterministic fashion between job runs

2/16/2010 173Training Material - Sample Copy

ENTIRE Partitioning

• Each partition gets a complete copy of the data

• Useful for distributing lookup and reference data

• May have performance impact on MPP / clustered environment

• On SMP platforms, Lookup stage (only) uses shared memory instead of duplicating ENTIRE reference data

• On MPP platforms, each server uses shared memory for a single local copy.

• ENTIRE is the default partitioning for Lookup reference links with “Auto” Partitioning.

• On SMP platforms, it is a food practice to set this explicitly on the Normal Lookup reference lnk(s)

2/16/2010 174Training Material - Sample Copy

HASH Partitioning

• Keyed partitioning method

• Rows are distributed according to the Values in Key coloumns.

• Guarantees that rows with same key values go into the same partition

• Needed to prevent matching rows from “hiding” in other partitions

• E.g. Join, Merge

• Remove Duplicate

• Partition distribution is relatively equal if the data across the source key columns evenly distributed

2/16/2010 175Training Material - Sample Copy

Modulus Partitioning

• Keyed partitioning method Rows are distributed according to the values in one integer key column

• Uses modulus

• partition = MOD (key _value / # partitions)

• Faster than HASH

• Guarantees that rows with identical key values go in the same partition

• Partition size is relatively equal if the data within the key

• Column is evenly distributed

2/16/2010 176Training Material - Sample Copy

Auto Partitioning• DataStage inserts partition components as necessary to

ensure correct results- Before any stage with “Auto” partitioning- Generally chooses ROUND-ROBIN or SAME- Inserts Hash on the Stage that require matched key values

(e.g. Join, Merge, Remove Duplicates)- Inserts ENTIRE on Normal (not Sparse) Lookup reference links

• Not always appropriate for MPP/clusters Since DataStage has limited awareness of your data and business rules, explicitly specify HASH partitioning when needed

- DataStage has no visibility into Transformer logic- Hash is required before Sort and Aggregator stages- DataStage sometimes inserts un-needed partitioning Check the log

2/16/2010 177Training Material - Sample Copy

Partitioning Requirements for Related Records

• Misplaced records- Using Aggregator stage to sum customer sales by customer number

- If there are 25 customers, 25 records should be output

But suppose records with the same customer numbers are spread across partitions

-This will produce more than 25 groups (records) Solution: Use hash partitioning algorithm

• Partition imbalances

-If all the records are going down only one of the nodes, then the job is in effect running sequentially

2/16/2010 178Training Material - Sample Copy

Unequal Distribution Example

2/16/2010 179Training Material - Sample Copy

Partitioning / Collecting Link Icons

2/16/2010 180Training Material - Sample Copy

More Partitioning Icons

2/16/2010 181Training Material - Sample Copy

Partitioning Tab

2/16/2010 182Training Material - Sample Copy

Collecting Specification

2/16/2010 183Training Material - Sample Copy

IBM WebSphere DataStage 8.1

____________________________Module 8: Combining Data

2/16/2010 184Training Material - Sample Copy

Module objectivesAfter completing this module, you should be able to:

• Combine data using the Lookup stage

• Define range lookups

• Combine data using Merge stage

• Combine data using the Join stage

• Combine data using the Funnel stage

2/16/2010 185Training Material - Sample Copy

Combining Data

Ways to combine data:• Horizontally:

– Multiple input links– One output link made of columns from different input links. – Joins– Lookup– Merge

• Vertically:– One input link, one output link combining groups of related – Records into a single record– Aggregator– Remove Duplicates

• Funneling: Multiple input streams funneled into a single output stream– Funnel stage

2/16/2010 186Training Material - Sample Copy

Lookup, Merge, Join Stages• These stages combine two or more input links

Data is combined by designated "key" column(s)

• These stages differ mainly in:

Memory usage

Treatment of rows with unmatched key values

Input requirements (sorted, de-duplicated)

2/16/2010 187Training Material - Sample Copy

Not all Links are Created Equal

• DataStage distinguishes between

- The Primary Input : (Framework Port 0)

- Secondary Inputs : In some cases “reference” ( other Framework Ports)

• Conventions:

• Tip Check “ Link Ordering “tab to make sure intended Primary is Listed first.

2/16/2010 188Training Material - Sample Copy

Lookup Stage

2/16/2010 189Training Material - Sample Copy

Lookup Features

• One Stream Input link (Source)• Multiple Reference links (Lookup files)• One output link• Optional Reject link

Only one per Lookup stage, regardless of number of reference links

• Lookup Failure optionsContinue, Drop, Fail, Reject

• Can return multiple matching rows• Hash tables are built in memory from the lookup files

Indexed by keyShould be small enough to fit into physical memory

2/16/2010 190Training Material - Sample Copy

Lookup Types• Equality match

- Match exactly values in the lookup key column of the reference link toselected values in the source row

- Return row or rows (if multiple matches are to be returned) that match

• Caseless match- Like an equality match except that it’s caselessE.g., “abc” matches “AbC”Range on the reference link

- Two columns on the reference link define the range- A match occurs when a selected value in the source row is within therange

• Range on the source link- Two columns on the source link define the range- A match occurs when a selected value in the reference link is within the

range

2/16/2010 191Training Material - Sample Copy

The Lookup Stage

• Uses one or more key columns as an index into a table

• Usually contains other values associated with each key.

• The lookup table is created in memory before any lookup source rows are processed

2/16/2010 192Training Material - Sample Copy

Lookup from Sequential File Example

2/16/2010 193Training Material - Sample Copy

Lookup Stage With an Equality Match

2/16/2010 194Training Material - Sample Copy

Handling Lookup Failures

2/16/2010 195Training Material - Sample Copy

Lookup Failure Actions

• If the lookup fails to find a matching key column, one of these actions can be taken:

- Fail: the lookup Stage reports an error and the job fails immediately. This is

the default.

- Drop: the input row with the failed lookup(s) is dropped

- continue: the input row is transferred to the output, together with the successful table entries. The failed table entry (s) are not transferred, resulting in either default output values or null output values.

- Reject: the input row with the failed lookup(s) is transferred to a second

output link, the"reject" link

• There is no option to capture unused table entries- Compare with the Join and Merge stages

2/16/2010 196Training Material - Sample Copy

Lookup Stage Behavior

• We shall first use a simplest case, optimal input

-Two input links:“Source" as primary, “Look up"

as secondary

- sorted on key column (here "Citizen"),

- without duplicates on key

2/16/2010 197Training Material - Sample Copy

Lookup Stage

2/16/2010 198Training Material - Sample Copy

The Lookup Stage

• Lookup Tables should be small enough to fit into physical memory

• On a MPP you should partition the lookup tables using entire partitioning method or partition them by the same hash key as the source link

- Entire results in multiple copies (one for each partition)

• On a SMP, choose entire or accept the default (which is entire)

- Entire does not result in multiple copies because

memory is shared

2/16/2010 199Training Material - Sample Copy

Designing a Range Lookup Job

2/16/2010 200Training Material - Sample Copy

Range Lookup Job

2/16/2010 201Training Material - Sample Copy

Range on Reference Link

2/16/2010 202Training Material - Sample Copy

Selecting the Stream column

2/16/2010 203Training Material - Sample Copy

Range on Expression Editor

2/16/2010 204Training Material - Sample Copy

Range on Stream Link

2/16/2010 205Training Material - Sample Copy

Specifying the Range Lookup

2/16/2010 206Training Material - Sample Copy

Range Expression Editor

2/16/2010 207Training Material - Sample Copy

Join Stage

2/16/2010 208Training Material - Sample Copy

The Join Stage

• Four types:

– Inner

– Left outer

– Right outer

– Full outer

• 2 or more sorted input links, 1 output link

– "left" on primary input, "right" on secondary input

– Pre-sort make joins "lightweight": few rows need to be in RAM

• Follow the RDBMS-style relational model

– Cross-products in case of duplicates

– Matching entries are reusable for multiple matches

– Non-matching entries can be captured (Left, Right, Full)

• No fail/reject option for missed matches2/16/2010 209Training Material - Sample Copy

Job with Join Stage

2/16/2010 210Training Material - Sample Copy

Join Stage Editor

• Link Order• Immaterial for

Inner and Full Outer Joins, but very important for Left/Right Outer

• joins) Multiple

Multiple key

columns

allowed

One of four variants:

Inner

Left Outer

Right

Outer

Full Outer

2/16/2010 211Training Material - Sample Copy

Join Stage Behavior

• We shall first use a simplest case, optimal input:– two input links: "left" as primary, "right" as secondary

– sorted on key column (here “citizen”),

– without duplicates on key

2/16/2010 212Training Material - Sample Copy

Inner Join

• Transfers rows from both data sets whose key columns contain equal values to the output link

• Treats both inputs symmetrically

Output of inner join on key Citizen

2/16/2010 213Training Material - Sample Copy

Left Outer Join

• Transfers all values from the left link and transfers values from the right link only where key columns match.

2/16/2010 214Training Material - Sample Copy

Left Outer JoinCheck Link Ordering tab to make sure intended Primary is listed first

2/16/2010 215Training Material - Sample Copy

Right Outer Join

• Transfers all values from the right link and transfers values from the left link only where key columns match

2/16/2010 216Training Material - Sample Copy

Full Outer Join

• Transfers rows from both data sets, whose key columns contain equal values, to the output link.

• It also transfers rows, whose key columns contain unequal values, from both input links to the output link.

• Treats both input symmetrically.

• Creates new columns, with new column names!

2/16/2010 217Training Material - Sample Copy

Merge Stage

2/16/2010 218Training Material - Sample Copy

Merge Stage

• Similar to Join stage

• Input links must be sorted

– Master link and one or more secondary links

– Master must be duplicate-free

• Light-weight

– Little memory required, because of the sort requirement

• Unmatched master rows can be kept or dropped

• Unmatched secondary links can be captured in a reject link

2/16/2010 219Training Material - Sample Copy

Merge Stage Job

2/16/2010 220Training Material - Sample Copy

The Merge Stage

• Allows composite keys

• Multiple update links

• Matched update rows are consumed

• Unmatched updates in input port n can be captured in output port n

• Lightweight

2/16/2010 221Training Material - Sample Copy

Stage Editor

2/16/2010 222Training Material - Sample Copy

Comparison: Joins, Lookup, Merge

2/16/2010 223Training Material - Sample Copy

Funnel Stage

2/16/2010 224Training Material - Sample Copy

What is a Funnel Stage?

• A processing stage that combines data from multiple input links to a single output link

• Useful to combine data from several identical data sources into a single large dataset

• Operates in three modes

– Continuous

– SortFunnel

– Sequence

2/16/2010 225Training Material - Sample Copy

Three Funnel modes

• Continuous:– Combines the records of the input link in no guaranteed

order.– It takes one record from each input link in turn. If data is not

available on an input link, the stage skips to the next link rather than waiting.

– Does not attempt to impose any order on the data it is processing.

• Sort Funnel: Combines the input records in the order defined by the value(s) of one or more key columns and the order of the output records is determined by these sorting keys.

• Sequence: Copies all records from the first input link to the output link, then all the records from the second input link and so on.

2/16/2010 226Training Material - Sample Copy

Sort Funnel Method

• Produces a sorted output (assuming input links are all sorted on key)

• Data from all input links must be sorted on the same key column• Typically data from all input links are hash partitioned before they

are sorted– Selecting “Auto” partition type under Input Î Partitioning tab defaults to

this hash partitioning guarantees that all the records with same key column values are located in the same partition and are processed on the same node.

• Allows for multiple key columns– 1 primary key column, n secondary key columns– Funnel stage first examines the primary key in each input record.– For records with multiple records with same primary key value, it will then

examine secondary keys to determine the order of records it will output

2/16/2010 227Training Material - Sample Copy

Funnel Stage Example

2/16/2010 228Training Material - Sample Copy

Funnel Stage Properties

2/16/2010 229Training Material - Sample Copy

Check Point

1. Name three stages that horizontally join data?

2. Which stage uses the least amount of memory? Join or Lookup?

3. Which stage requires that the input data is sorted? Join or Lookup?

2/16/2010 230Training Material - Sample Copy

Check Point Solution

1. Lookup, Merge, Join

2. Lookup

3. Join

2/16/2010 231Training Material - Sample Copy

IBM WebSphere DataStage 8.1

____________________________Module 9: Sorting and Aggregating Data

2/16/2010 232Training Material - Sample Copy

Module ObjectivesAfter completing this module, you should be able to :

• Sort data using in-stage sorts and Sort stage

• Combine data using Aggregator stage

• Combine data Remove Duplicates stage

2/16/2010 233Training Material - Sample Copy

Sort Stage

2/16/2010 234Training Material - Sample Copy

Sorting DataUses

• Some stages require sorted input– Join, merge stages require sorted input

• Some stages use less memory with sorted input– E.g., Aggregator

Sorts can be done:

• Within stages– On input link Partitioning tab, set partitioning to anything other than

Auto

• In a separate Sort stage– Makes sort more visible on diagram

– Has more options

2/16/2010 235Training Material - Sample Copy

Sorting Alternatives

2/16/2010 236Training Material - Sample Copy

In-Stage Sorting

2/16/2010 237Training Material - Sample Copy

Sort Stage

2/16/2010 238Training Material - Sample Copy

Sort Keys

• Add one or more keys

• Specify sort mode for each key

– Sort: Sort by this key

– Don’t sort (previously sorted):

– Assume the data has already been sorted by this key

– Continue sorting by any secondary keys

• Specify sort order: ascending / descending

• Specify case sensitive or not

2/16/2010 239Training Material - Sample Copy

Sort Options• Sort Utility

– DataStage – the default– Unix: Don’t use. Slower than DataStage sort utility

• Stable• Allow duplicates• Memory usage

– Sorting takes advantage of the available memory for increased performance

Uses disk if necessary– Increasing amount of memory can improve performance

• Create key change column– Add a column with a value of 1 / 0– 1 indicates that the key value has changed– 0 mean that the key value hasn’t changed– Useful for processing groups of rows in a Transformer

2/16/2010 240Training Material - Sample Copy

Partitioning Vs Sorting KeysPartitioning keys are often different than Sorting keys

• Keyed partitioning (e.g., Hash) is used to group related records into the same partition

• Sort keys are used to establish order within each partition

For example,

• Partition on HouseHoldID, sort on HouseHoldID, Entry Date

– Partitioning on HouseHoldID ensures that the same ID will not be spread across multiple partitions

– Sorting orders the records with same ID by entry date• Useful for deciding which of a group of duplicate records with the

same ID should be retained

2/16/2010 241Training Material - Sample Copy

Aggregator Stage

2/16/2010 242Training Material - Sample Copy

Aggregator Stage

Purpose: Perform data aggregations

Specify:• Zero or more Key Coloumns that define the aggregation units

(or groups)

• Columns to be aggregated

• Aggregation functions, include among many others:– count (nulls/no nulls)Sum

– Max / Min / Range

• The grouping method (Hash table or pre sort) is a Performance issue

2/16/2010 243Training Material - Sample Copy

Job with Aggregator Stage

2/16/2010 244Training Material - Sample Copy

Aggregator Types

• Count rows– Count rows in each group

– Put result in a specified output column

• Calculation– Select column

– Put result of calculation in a specified output column

• Calculations include:– Sum

– Count

– Min, max

– Mean

– Missing value count

– Non-missing value count

– Percent coefficient of variation

2/16/2010 245Training Material - Sample Copy

Count Row Aggregator Properties

2/16/2010 246Training Material - Sample Copy

Calculation type Aggregator properties

2/16/2010 247Training Material - Sample Copy

Grouping MethodsHash (default)

• Calculations are made for all groups and stored in memory– Hash table structure (hence the name)

• Results are written out after all input has been processed

• Input does not need to be sorted

• Useful when the number of unique groups is small– Running tally for each group’s aggregations needs to fit into memory

Sort

• Requires the input data to be sorted by grouping keys– Does not perform the sort! Expects the sort

• Only a single aggregation group is kept in memory– When a new group is seen, the current group is written out

2/16/2010 248Training Material - Sample Copy

Remove Duplicates Stage

2/16/2010 249Training Material - Sample Copy

Removing Duplicates

• Can be done by Sort stage

Use unique option

– No choice on which to keep

– Stable sort always retains the first row in the group

– Non-stable sort is indeterminate

OR

• Remove Duplicates stage– Has more sophisticated ways to remove duplicates

Can choose to retain first or last

2/16/2010 250Training Material - Sample Copy

Remove Duplicates Stage Job

2/16/2010 251Training Material - Sample Copy

Remove Duplicates Stage Properties

2/16/2010 252Training Material - Sample Copy

Check Point

1. What stage is used to perform calculation of columns values Grouped in specified ways

2. In what two ways can sorts be performed

3. What is stable sort

2/16/2010 253Training Material - Sample Copy

Check Point Solution

1. Aggregator Stage

2. Using the sort stage. In-Stage sorts

3. Stable sort preserves the order of non-key values

2/16/2010 254Training Material - Sample Copy

IBM WebSphere DataStage 8.1

____________________________Module 10 : Transforming Data

2/16/2010 255Training Material - Sample Copy

Module ObjectivesAfter completing this module, you should be able to:

• Use the Transformer stage in parallel jobs

• Define constraints

• Define derivations

• Use stage variables

• Create a parameter set and use its parameters in constraints and derivations

2/16/2010 256Training Material - Sample Copy

Transformer Stage

2/16/2010 257Training Material - Sample Copy

Transformer Stage• Column mappings

• Derivations– Written in Basic

– Final compiled code is C++ generated object code

• Constraints– Filter data

– Direct data down different output links

• For different processing or storage

• Expressions for constraints and derivations can reference– Input columns

– Job parameters

– Functions

– System variables and constants

– Stage variables

– External routines2/16/2010 258Training Material - Sample Copy

Job with a Transformer Stage

2/16/2010 259Training Material - Sample Copy

Inside the Transformer Stage

2/16/2010 260Training Material - Sample Copy

Defining a Constraint

2/16/2010 261Training Material - Sample Copy

Defining a Derivation

2/16/2010 262Training Material - Sample Copy

IF THEN ELSE Derivation• Use IF THEN ELSE to conditionally derive a value

• Format:

– IF <condition> THEN <expression1> ELSE <expression1>

– If the condition evaluates to true then the result of expression1 will be copied to the target column or stage variable

– If the condition evaluates to false then the result of expression2 will be copied to the target column or stage variable

Example:

– Suppose the source column is named In.OrderID and the target column is named Out.OrderID

– Replace In.OrderID values of 3000 by 4000

– IF In.OrderID = 3000 THEN 4000 ELSE Out.OrderID

2/16/2010 263Training Material - Sample Copy

String Functions and Operators

• Substring operator– Format: “String” *loc, length+

– Example:

• Suppose In.Description contains the string “Orange Juice”

• In Description*8,5+ “Juice”

• UpCase(<string>) / DownCase(<string>)– Example: UpCase(In.Description) Æ “ORANGE JUICE”

• Len(<string>)– Example: Len(In.Description) Æ 12

2/16/2010 264Training Material - Sample Copy

Checking for NULLs

• Nulls can be introduced into the data flow from lookups– Mismatches (lookup failures) can produce nulls

• Can be handled in constraints, derivations, stage variables, or a combination of these

• NULL functions– Testing for NULL

• IsNull (<column>)

• IsNotNull (<column>)

– Replace NULL with a

• NullToValue(<column>, <value>)

– Set to NULL:

• Example: IF In.Col = 5 THEN SetNull() ELSE In.Col

2/16/2010 265Training Material - Sample Copy

Transformer Functions

• Date & Time

• Logical

• Null Handling

• Number

• String

• Type Conversion

2/16/2010 266Training Material - Sample Copy

Transformer Execution Order

• Derivations in stage variables

• Constraints are executed before derivations

• Column derivations in earlier links are executed before later links

• Derivations in higher columns are executed before lower columns

2/16/2010 267Training Material - Sample Copy

Transformer Stage Variables

• Derivations execute in order from top to bottom

– Later stage variables can reference earlier stage variables

– Earlier stage variables can reference later stage variables

– These variables will contain a value derived from the previous row that came into the Transformer

• Multi-purpose

– Counters

– Store values from previous rows to make comparisons

– Store derived values to be used in multiple target field derivations

– Can be used to control execution of constraints

2/16/2010 268Training Material - Sample Copy

Transformer Reject Links

2/16/2010 269Training Material - Sample Copy

Otherwise Link

2/16/2010 270Training Material - Sample Copy

Defining an Otherwise Link

2/16/2010 271Training Material - Sample Copy

Specifying Link Ordering

2/16/2010 272Training Material - Sample Copy

Parameter Sets

2/16/2010 273Training Material - Sample Copy

Parameter Sets

• Store a collection of parameters in a named object

• One or more value files can be named and specified

– A value file stores value for specified parameters

– Values are picked up at run time

• Parameter sets can be added to the job parameters specified on the parameter tab of the job properties

2/16/2010 274Training Material - Sample Copy

Creating a New parameter set

2/16/2010 275Training Material - Sample Copy

Parameter Tab

2/16/2010 276Training Material - Sample Copy

Value Tab

2/16/2010 277Training Material - Sample Copy

Adding parameter set to job properties

2/16/2010 278Training Material - Sample Copy

Using parameter set parameters

2/16/2010 279Training Material - Sample Copy

Check point

• What occurs first? Derivations or Constraints?

• Can stage variables be referenced in constraints?

• Where should you test for nulls with in a transformer? Stage variable derivations or Output column derivations?

2/16/2010 280Training Material - Sample Copy

Check point solution

• Constraints

• Yes

• Stage variable derivations. Reference the stage variable in Output column derivations

2/16/2010 281Training Material - Sample Copy

IBM WebSphere DataStage 8.1

____________________________Module 11 : Repository Functions

2/16/2010 282Training Material - Sample Copy

Module Objectives

After completing this module, you should be able to:

• Perform a simple Find

• Perform a Advanced Find

• Perform Impact Analysis

• Compare the difference between two Table Definitions

• Compare the differences between two jobs

2/16/2010 283Training Material - Sample Copy

Searching the Repository

2/16/2010 284Training Material - Sample Copy

Quick Find

2/16/2010 285Training Material - Sample Copy

Found Results

2/16/2010 286Training Material - Sample Copy

Advance Find Window

2/16/2010 287Training Material - Sample Copy

Advance Find Filtering options

• Type: Type of object– Job, Table Definition etc

• Creation: Range of Dates– Eg: Up to week ago

• Last Modification: Range of Dates– E.g.: Up to a week ago

• Where Used: Objects that use specified objects– E.g.: A job that uses a specified table definition

• Dependencies of: Objects that are dependencies of objects– E.g.: A Table definition that is referenced in a specified job

• Options– Case Sensitivity

– Search with in a last result set

2/16/2010 288Training Material - Sample Copy

Using the Found Result

2/16/2010 289Training Material - Sample Copy

Impact Analysis

2/16/2010 290Training Material - Sample Copy

Performing Impact Analysis

• Find where table definitions are used– Right click over a stage or table definition– Select ‘Find where Table definition used’ or– Select ‘Find where Table definition used(deep)’

• Deep includes additional object types– Displays a list of objects using table definition

• Find Object Dependencies– Select ‘Find Dependencies’ or– Select ‘Find Dependencies(deep)’– Display list of objects dependent on the one selected

• Graphical Functionality– Display the dependency path– Collapse selected object– Move the graphical object– “Birds-eye” view

2/16/2010 291Training Material - Sample Copy

Initiating Impact Analysis for a Stage

2/16/2010 292Training Material - Sample Copy

Display the Dependency graphically

2/16/2010 293Training Material - Sample Copy

Display the dependency path

2/16/2010 294Training Material - Sample Copy

Generating a HTML report

2/16/2010 295Training Material - Sample Copy

Job and Table Difference Reports

2/16/2010 296Training Material - Sample Copy

Finding the Difference between Two Jobs

• Example: Job1 is saved as Job2. Changes are made to job 2. What changes have been made?

– Here job1 may be a production job. Job2 is a copy of production job after enhancements or other changes have been made to it

2/16/2010 297Training Material - Sample Copy

Initiating the Comparison

2/16/2010 298Training Material - Sample Copy

Comparison Results

2/16/2010 299Training Material - Sample Copy

Saving to an HTML file

2/16/2010 300Training Material - Sample Copy

Comparing Table Definitions• Same procedure as when comparing jobs

2/16/2010 301Training Material - Sample Copy

Check Point

• You can compare the differences between what two kind of objects

• What “Wild card” characters can be used in a Find?

• You have a job whose name begins with “abc”. You can’t remember the rest of the name or where the job is located. What would be the fastest way to export the job to a file?

• Name three filters you can use in an Advanced Find?

2/16/2010 302Training Material - Sample Copy

Check Point Solution

• Jobs. Table Definitions.

• Asterisk(*). It stands for any zero or more characters.

• Do a Find for objects matching “abc”. Filter by type Job. Locate the job in the result set, click the mouse button over it, and then click export.

• Type of object, creation data range, last modified date range, where used, dependencies of, other options including case sensitivity and search with in last result set.

2/16/2010 303Training Material - Sample Copy

IBM WebSphere DataStage 8.1

____________________________Module 12: Working with Relational Data

2/16/2010 304Training Material - Sample Copy

Module Objectives

After completing this module, you should be able to:

• Import table definitions for relational tables

• Create Data connections

• Use Connector stages in a job

• Use Sql Builder to define SQL Select statements

• Use Sql Builder to define SQL Insert and Update statements

• Use DB2 Enterprise stage

2/16/2010 305Training Material - Sample Copy

Working with Relational Data• Importing relational data

– Import using ODBC or orchdbutil• orchdbutil is preferred, in order to get correct type conversions

• Data Connection objects– Store database connection information in a named object

• Stages available to access relational data– Connector stages

• Parallel support• Most functionality• ConsistentGUI and functionality across all relational types

– Enterprise stages• Parallel support

– Plug-in stages• Functionality ported from DataStage Server Jobs

– Selecting data– Build SELECT statements using SQL Builder

• Writing data– Build INSERT, UPDATE, DELETE statements using SQL Builder

2/16/2010 306Training Material - Sample Copy

Importing Table Definitions

2/16/2010 307Training Material - Sample Copy

Importing Table Definitions

• Can import using ODBC or Orchestrate schema definitions– Orchestrate schema imports are better because the datatypes are

nore accurate

• Import->Table Definitions->Orchestrate Schema Definitions

• Import->Table Definitions->ODBC Table Definitions

• ODBC

2/16/2010 308Training Material - Sample Copy

Orchestrate Schema Import

2/16/2010 309Training Material - Sample Copy

ODBC Import

2/16/2010 310Training Material - Sample Copy

IBM WebSphere DataStage 8.1

____________________________Module 12: Working with Relational Data

2/16/2010 311Training Material - Sample Copy

Module Objectives

After completing this module, you should be able to:

• Import table definitions for relational tables

• Create Data connections

• Use Connector stages in a job

• Use Sql Builder to define SQL Select statements

• Use Sql Builder to define SQL Insert and Update statements

• Use DB2 Enterprise stage

2/16/2010 312Training Material - Sample Copy

Working with Relational Data• Importing relational data

– Import using ODBC or orchdbutil• orchdbutil is preferred, in order to get correct type conversions

• Data Connection objects– Store database connection information in a named object

• Stages available to access relational data– Connector stages

• Parallel support• Most functionality• ConsistentGUI and functionality across all relational types

– Enterprise stages• Parallel support

– Plug-in stages• Functionality ported from DataStage Server Jobs

– Selecting data– Build SELECT statements using SQL Builder

• Writing data– Build INSERT, UPDATE, DELETE statements using SQL Builder

2/16/2010 313Training Material - Sample Copy

Importing Table Definitions

2/16/2010 314Training Material - Sample Copy

Importing Table Definitions

• Can import using ODBC or Orchestrate schema definitions– Orchestrate schema imports are better because the datatypes are

nore accurate

• Import->Table Definitions->Orchestrate Schema Definitions

• Import->Table Definitions->ODBC Table Definitions

• ODBC

2/16/2010 315Training Material - Sample Copy

Orchestrate Schema Import

2/16/2010 316Training Material - Sample Copy

ODBC Import

2/16/2010 317Training Material - Sample Copy

IBM WebSphere DataStage 8.1

____________________________Module 12: MetaData in the Parallel Framework

2/16/2010 318Training Material - Sample Copy

Module Objectives

After completing this module, you should be able to:

• Explain Schemas

• Create Schemas

• Explain Runtime Column Propagation (RCP)

• Turn RCP ON and OFF

• Build a job that reads data from a sequential file using a Schema

• Build a shared Container

2/16/2010 319Training Material - Sample Copy

Schema

• Alternative way to specifying column definitions and record formats– Similar to a Table Definition

• Written in a plain text file

• Can be imported as a Table Definition

• Can be created from a Table Definition

• Can be used in place of a Table Definition in a Sequential file stage– Requires RCP

– Schema file path can be parameterized

• Enables a single job to process files with different column

definitions

2/16/2010 320Training Material - Sample Copy

Creating a Schema• Using a text editor

– Follow correct syntax for definitions

– Not recommended

• Import from an existing DataSet or FileSet– On DataStage Manager import->Table Definitions -> Orchestrate

Schema Definitions

– Select check box for a file with .fs or .ds

• Import from a Database Table

• Create from a Table Definition– Click parallel on layout tab

2/16/2010 321Training Material - Sample Copy

Importing a Schema

2/16/2010 322Training Material - Sample Copy

Creating a Schema from a Table Definition

2/16/2010 323Training Material - Sample Copy

Reading a sequential file using Schema

2/16/2010 324Training Material - Sample Copy

Runtime Column Propagation (RCP)

• When RCP is turned on:

– Columns of data can flow through a stage without being explicitly defined in the stage

– Target columns in a stage need not have any columns explicitly mapped to them

• No column mapping enforcement at design time

– Input columns are mapped to unmapped columns by name

• How implicit columns get into a job

– Read a file using a schema in a Sequential File stage

– Read a database table using “Select *”

– Explicitly define as an output column in a stage earlier in the flow

2/16/2010 325Training Material - Sample Copy

Runtime Column Propagation (RCP)

• Benefits of RCP

– Job flexibility• Job can process input with different layouts

– Ability to create reusable • components in shared containersComponent logic an apply to a

single named column

• All other columns flow through untouched

2/16/2010 326Training Material - Sample Copy

Enabling Runtime Column Propagation (RCP)

• Project Level– DataStage Administrator Parallel tab

• Job Level– Job Properties General tab

• Stage Level– Link Output Column tab

• Setting at a lower level override settings at a higher level– E.g.: disable at project level, but enable for a given job

– E.g.: enable at job level, but disable a given stage

2/16/2010 327Training Material - Sample Copy

Enabling RCP at Project Level

2/16/2010 328Training Material - Sample Copy

Enabling RCP at Job Level

2/16/2010 329Training Material - Sample Copy

Enabling RCP at Stage Level• Sequential File Stage

– Output columns tab

• Transformer– Open Stage Properties

– Stage properties Output tab

2/16/2010 330Training Material - Sample Copy

When RCP is Disabled• DataStage Designer enforces Stage Input to Output column

mapping

2/16/2010 331Training Material - Sample Copy

When RCP is Enabled

• DataStage does not enforce mapping rules

• Runtime error if no incoming columns match unmapped target column names

2/16/2010 332Training Material - Sample Copy

Shared Containers

2/16/2010 333Training Material - Sample Copy

Shared Containers

• Encapsulate job design components into a shared container

• Provide reusable job design components

• Example– Apply stored transformer business logic

2/16/2010 334Training Material - Sample Copy

Creating a Shared Container

• Select Stages from an existing job

• Click Edit -> Construct Container -> Shared

2/16/2010 335Training Material - Sample Copy

Using a Shared Container in a Job

2/16/2010 336Training Material - Sample Copy

Mapping Input/output Links to the Container

2/16/2010 337Training Material - Sample Copy

Check Point

• What are the two benefits of RCP?

• What can you use to encapsulate stages and links in a job to make them reusable?

2/16/2010 338Training Material - Sample Copy

Check Point Solution

• Job Flexibility. Ability to create reusable components

• Shared Containers

2/16/2010 339Training Material - Sample Copy

IBM WebSphere DataStage 8.1

____________________________Module 14: Job Control

2/16/2010 340Training Material - Sample Copy

Module Objectives

After Completing this module, you should be able to:• Use the DataStage job sequencer to build a job that controls a

sequence of jobs• Use Sequencer links and stages to control Sequence of set of

jobs• Use Sequencer triggers and stages to control the conditions

under which jobs run• Pass information in job parameters from master controlling

job to the controlled jobs• Define user variables• Enable restart• Handle errors and exceptions

2/16/2010 341Training Material - Sample Copy

What is a job Sequence• A master controlling job that controls the execution of a set of

subordinate jobs• Passes values to Sub-ordinate job parameters• Controls the order of execution(links)• Specifies condition under which the subordinate jobs get

executed(triggers)• Specifies Complex flow of control

– Loops– All/Some– Wait for file

• Perform System Activities– Email– Execute system commandsand executables

• Can include Restart checkpoints

2/16/2010 342Training Material - Sample Copy

Basics for Creating a new Job Sequence

• Open a new job sequence– Specify whether its restartable

• Add Stages– Stages to execute job

– Stages to execute system commands and Executables

– Special Purpose Stages

• Add Links

• Specify the order in which jobs are to be executed– Specify Triggers

– Triggers specify the condition under which control passes across a link

• Specify error handling

• Enable/ Disable Restart Checkpoints

2/16/2010 343Training Material - Sample Copy

Job Sequencer stages• Run Stages

– Job activity: Run a job– Execute Command: Run a system

command– Notification Activity: Send an Email

• Flow control Stages– Sequencer: Go if All / Some– Wait for file: Go when File

exists/Don’t Exist– Start Loop/End Loop– Nested Condition: Go if Condition

Satisfied• Error Handling

– Exception handler– Terminator

• Variables– User Variables

2/16/2010 344Training Material - Sample Copy

Example

2/16/2010 345Training Material - Sample Copy

Sequence Properties

2/16/2010 346Training Material - Sample Copy

Job Activity Stage Properties

2/16/2010 347Training Material - Sample Copy

Job Activity Trigger

2/16/2010 348Training Material - Sample Copy

Execute Command Stage

2/16/2010 349Training Material - Sample Copy

Notification Activity Stage

2/16/2010 350Training Material - Sample Copy

User Variables Stages

2/16/2010 351Training Material - Sample Copy

Referencing the User Variable

2/16/2010 352Training Material - Sample Copy

Flow of Control Stages

2/16/2010 353Training Material - Sample Copy

Wait for File Stage

2/16/2010 354Training Material - Sample Copy

Sequencer Stage

2/16/2010 355Training Material - Sample Copy

Nested Condition Stage

2/16/2010 356Training Material - Sample Copy

Loop Stages

Counter

values

Pass counter

value

Reference

Link to start

2/16/2010 357Training Material - Sample Copy

Error Handling

2/16/2010 358Training Material - Sample Copy

Handling Activities that fail

Pass control to

Exception Stage

When an Activity

Fails

2/16/2010 359Training Material - Sample Copy

Exception Handler Stage

Control goes here if

any activity fails

2/16/2010 360Training Material - Sample Copy

Restart

2/16/2010 361Training Material - Sample Copy

Enable Restart

Enable checkpoints

to be added

2/16/2010 362Training Material - Sample Copy

Disable Check Point at a StageDon’t checkpoint

this activity

2/16/2010 363Training Material - Sample Copy

Check Point

• Which Stage is used to run jobs in a Job Sequence

• Does the Exception Handler stage support an input

link

2/16/2010 364Training Material - Sample Copy

Check Point Solution

• Job Activity Stage

• No, control is automatically passed to the stage when an exception occurs

2/16/2010 365Training Material - Sample Copy

IBM WebSphere DataStage 8.1____________________________

Special Topics 1 : Complex Flat File Stage

2/16/2010 366Training Material - Sample Copy

Module objectives

After completing this module, you should be able to:• Import table definitions from a COBOL copybook• Design a job that extracts data from a COBOL file containing

multiple records type• Specify in a Complex Flat File(CFF) stage the column layouts of

each record type• Specify in a CFF stage how to identify when a record is read of

a specific type• Select in a CFF stage which columns from the different records

types are to be output from the stage

2/16/2010 367Training Material - Sample Copy

Complex Flat File Stage

• Process data in a COBOL file– File is described by a COBOL file description – File can contain multiple record types

• COBOL copybooks with multiple record formats can be imported as COBOL File Definitions– Each format is stored as a separate DataStage table definition

• Columns can be loaded for each record type• On the records ID Tab, you specify how to identify each type

of record.• Columns from any or all record types can be selected for

output– This allows columns of data from multiple records of different types to

be combined into a single output record

2/16/2010 368Training Material - Sample Copy

Sample COBOL Copybook

CLIENT record format

POLICY record format

COVERAGE record format

2/16/2010 369Training Material - Sample Copy

Importing a COBOL File DefinitionLevel 01 column

position

Level 01 items

2/16/2010 370Training Material - Sample Copy

COBOL Table Definitions

Level numbers

2/16/2010 371Training Material - Sample Copy

COBOL File Layout

COBOL layout

Layout tab

2/16/2010 372Training Material - Sample Copy

Specifying a Data Mask

Select date mask

2/16/2010 373Training Material - Sample Copy

Example Data File with Multiple Formats

Record Type = ‘1’ CLIENT record

Record Type = ‘2’ POLICY record

Record Type = ‘3’ COVERAGE record

2/16/2010 374Training Material - Sample Copy

Sample Job with CFF Stage

CFF Stage

2/16/2010 375Training Material - Sample Copy

File Options tab

Data file

2/16/2010 376Training Material - Sample Copy

Records Tab

Load columns for record type

Set as master

Active record type

Add another record type

2/16/2010 377Training Material - Sample Copy

Record ID Tab

Condition that identifies the type of record

2/16/2010 378Training Material - Sample Copy

Selection Tab

2/16/2010 379Training Material - Sample Copy

Record Options Tab

2/16/2010 380Training Material - Sample Copy

Layout Tab

Layout tab

COBOL layout

2/16/2010 381Training Material - Sample Copy

View Data

CLIENT columns POLICY columns Coverage columns

2/16/2010 382Training Material - Sample Copy

Processing Multi-Format Records

Derivations identify which type of record is

coming into the Transformer

Stage variable in the Transformer

2/16/2010 383Training Material - Sample Copy

Transformer Constraints

2/16/2010 384Training Material - Sample Copy

Checkpoint

1. What types of files contain the Metadata that is typically into the CFF stage?

2. Does the CFF stage support variable length records?

3. How does DataStage know which type of record it is reading from a file containing records of different formats?

4. What does accomplish to select a record type as a master?

5. How many record types can be designated Master?

2/16/2010 385Training Material - Sample Copy

Checkpoint solutions

1. COBOL copybooks or COBOL file definitions.

2. Yes, it can read files containing multiple file formats, each of a different physical length.

3. On the records ID tab, you define constraints that identify the record type. These must reference fields common to all record formats.

4. When a master record is read, all outputs columns are emptied before the master record contents are written.

5. Only one.

2/16/2010 386Training Material - Sample Copy

IBM WebSphere DataStage 8.0

____________________________Topics 2 : Slowly Changing Dimension

2/16/2010 387Training Material - Sample Copy

Unit Objectives

After completing this unit, you should be able to:

• Design a job that creates a surrogate key source key file

• Design a job that updates a surrogate key source key file from a dimension table

• Design a job that processes a star schema database with Type 1 and Type 2 slowly changing dimensions

2/16/2010 388Training Material - Sample Copy

Surrogate Key Generation Stage

2/16/2010 389Training Material - Sample Copy

Surrogate Key Generator Stage

• Use to create and update the surrogate key state file

• Surrogate key state file

– One file per dimension table

– Stores the last used surrogate key integer for the dimension table

– Binary file

2/16/2010 390Training Material - Sample Copy

Example Job to Create Surrogate State Files

Create Surrogate State

File for Store dimension table

Create Surrogate State File for Product dimension table

2/16/2010 391Training Material - Sample Copy

Editing the Surrogate Key Generator Stage

Path to state file

Create the state file

2/16/2010 392Training Material - Sample Copy

Example Job to Update the Surrogate State File

2/16/2010 393Training Material - Sample Copy

Specifying the Update Information

Table column containing

surrogate key values

Update the state file

2/16/2010 394Training Material - Sample Copy

Slowly Changing Dimension Stage

2/16/2010 395Training Material - Sample Copy

Slowly Changing Dimension Stage

• Used for processing a star schema• Performs a lookup into a star schema dimension table

– Multiple SDC stages can daisy chained to process multiple dimension tables

• Inserts new rows into the dimension table as required • Updates existing rows in the dimension table as required

– Type 1 fields of a matching row are overwritten– Type 2 fields of matching row are retrained as history rows

• A new record with the new field value is added to the dimension table and made the current record

• Generally used in conjunction with the Surrogate key Generator stage– Creates a Surrogate key state file that retains a list of the previously

used surrogate keys

2/16/2010 396Training Material - Sample Copy

Star Schema Database Structure and Mapping

2/16/2010 397Training Material - Sample Copy

Example Slowly Changing Dimension Job

Check for matching StoreDim

rows

Perform Type 1 and

Type 2 updates to StoreDim

Tables

Check for matching Product

rows

Perform Type 1 and

Type 2 updates to

Product Tables

2/16/2010 398Training Material - Sample Copy

Working in the SCD Stage• Five “Fast Path ” pages to edit• Select the output link

– This is the link coming out of the SCD stage that is not used to update the dimension table

• Specify the purpose codes– Fields to match by

• Business key fields and the source fields to match to it– Surrogate Key– Type 1 fields– Type 2 fields– Current indicator for Type 2– Effective Date, Expire date for Type 2

• Surrogate key Management – Location of State file

• Dimension update specification• Output mappings

2/16/2010 399Training Material - Sample Copy

Selecting the Output Link

Select the output link

2/16/2010 400Training Material - Sample Copy

Specifying the Purpose Codes

Lookup key mapping

Type 1 field

Type 2 field

Surrogate key

Fields used for Type2 handling

2/16/2010 401Training Material - Sample Copy

Surrogate Key Management

Path to state file

Initial surrogate key value

Number of values to retrieve at one

time

2/16/2010 402Training Material - Sample Copy

Dimension Update Specification

Function used to retrieve the next

surrogate key value

Values that means current

Function used to calculate history date

range

2/16/2010 403Training Material - Sample Copy

Output Mappings

2/16/2010 404Training Material - Sample Copy

Checkpoint

1. How many Slowly Changing Dimension are needed to process a star schema with 4 dimension tables ?

2. How many surrogate key state files are needed to process a star schema with dimension tables ?

3. What’s the difference between a Type1 and a Type 2 dimension field attribute ?

4. What additional fields are needed for handling a Type 2 Slowly Changing Dimension field attribute ?

2/16/2010 405Training Material - Sample Copy

Checkpoint solutions

1. Four SCD stages are needed. One for each dimension table . Each SCD stage does a lookup and update to the table

2. Four surrogate key state files are needed. One for each dimension table. A separate state file is used for each.

3. Type 1 is a simple update. The value in the dimension record field is overwritten with the new value. Type 2 retains the value in a history record. A new record is created with the current value.

4. Three additional fields are needed : the current indicator is needed to flag whether a given record contains current type 2 value or an earlier value. The Effective Date and Expire Date fields are used to specify when the given record is applicable.

2/16/2010 406Training Material - Sample Copy

IBM WebSphere DataStage 8.1

____________________________Topics 3 : Installation Run Through

2/16/2010 407Training Material - Sample Copy

Module Objectives

After completing this module, you should be able to:

• Run-through the installation process

• Start the Information Server

2/16/2010 408Training Material - Sample Copy

Start the Installation

2/16/2010 409Training Material - Sample Copy

Installation and Response File Selection

2/16/2010 410Training Material - Sample Copy

Information Server Domain Location

Destination Folder

2/16/2010 411Training Material - Sample Copy

Installation Layers

2/16/2010 412Training Material - Sample Copy

License file Selection

2/16/2010 413Training Material - Sample Copy

Product Selection

2/16/2010 414Training Material - Sample Copy

Installation Type

2/16/2010 415Training Material - Sample Copy

Repository Installation

2/16/2010 416Training Material - Sample Copy

Repository Configuration

2/16/2010 417Training Material - Sample Copy

Application Server Installation

2/16/2010 418Training Material - Sample Copy

Application Server Administration

2/16/2010 419Training Material - Sample Copy

Information Server Administration

2/16/2010 420Training Material - Sample Copy

DataStage Projects

2/16/2010 421Training Material - Sample Copy

Language Selection

2/16/2010 422Training Material - Sample Copy

DB2 Server Selection

2/16/2010 423Training Material - Sample Copy

DB2 Instance Owner

2/16/2010 424Training Material - Sample Copy

ODBC Drivers

2/16/2010 425Training Material - Sample Copy

National Language Support

2/16/2010 426Training Material - Sample Copy

Information Server User IDs

• Database owner: e.g., xmeta

• DB2 instance owner: e.g., db2admin

• Application server ID: e.g., appserv

• Information Server administrator: e.g., admin

• Be sure all user IDs have passwords that have not expired

2/16/2010 427Training Material - Sample Copy

Testing the Installation• You can partially test the installation by logging onto the

Information Server web console

2/16/2010 428Training Material - Sample Copy

Checkpoint

1. List the five installation layers that can be installed.

2/16/2010 429Training Material - Sample Copy

Checkpoint solutions

• Client, Engine, Domain, Repository, and Documentation.

2/16/2010 430Training Material - Sample Copy

IBM WebSphere DataStage 8.1

____________________________Topics 4: Solution Development Jobs

2/16/2010 431Training Material - Sample Copy

Module objectives

After completing this module, you should be able to:

• List and describe the Warehouse jobs

• Understand the stages and techniques used in the

• Warehouse jobs

2/16/2010 432Training Material - Sample Copy

Introduction to the Solution Development Exercises

2/16/2010 433Training Material - Sample Copy

Solution Development Jobs

• Series of 4 jobs extracted from production jobs

• Use a variety of stages in interesting, realistic configurations– Sort, Aggregator stages

– Join, lookup stage

– Peek, Filter stages

– Modify stage

– Oracle stage

• Contain useful techniques– Use of Peeks

– Datasets used to “connect” jobs

– Use of project environment variables in job parameters

– Fork Joins

– Lookups for auditing

2/16/2010 434Training Material - Sample Copy

Warehouse Job 01

2/16/2010 435Training Material - Sample Copy

Warehouse Job 02

2/16/2010 436Training Material - Sample Copy

Warehouse Job 03

2/16/2010 437Training Material - Sample Copy

Warehouse Job 04

2/16/2010 438Training Material - Sample Copy

Warehouse Job 02 With Lookup

2/16/2010 439Training Material - Sample Copy

Checkpoint

1. What is a Fork-Join?

2/16/2010 440Training Material - Sample Copy

Checkpoint solutions1. A development technique that forks a stream into two

outputs and later joins these two streams back again.

2/16/2010 441Training Material - Sample Copy