datastage quality stage fundamentals

530
IBM Information Server 8.x DataStage / QualityStage Fundamentals ValueCap ValueCap Systems Systems 1 ValueCap Systems - Proprietary

Upload: brightfulindia3

Post on 26-Oct-2014

157 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: DataStage Quality Stage Fundamentals

IBM Information Server 8.x

DataStage / QualityStage Fundamentals

ValueCapValueCap SystemsSystems

1ValueCap Systems - Proprietary

Page 2: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Introduction

What is Information Server(IIS) 8.x

� Suite of applications that share a common repository

� Common set of Application Services (hosted by WebSphere app

server)

� Data Integration toolset (ETL, Profiling and data quality)

� Employs scalable parallel processing Engine

� Supports N-tier layered architecture

� Newer version of data integration/ETL tool set offered by IBM

� Web Browser Interface to manage security and authentication

2

Page 3: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Product Suite

IIS Organized into 4 layers

� Client: Administration,

Analysis,Development, and User Interface

� Metadata Repository: Single repository

for each install. Can reside in DB2, Oracle

or SQL Server database. Stores

configuration, design and runtime

metadata. DB2 is supplied database.

� Domain: Common Services. Requires WebSphere Application Server. Single domain for each install.

� Engine: Core engine that run all ETL jobs. Engine install includes connectors, packs, job monitors, performance monitors, log service etc.,

Note : Metadata Repository , Domain and Engine can reside in either same server or separate server. Multiple engines can exist in a single Information Server install.

3

Page 4: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Detailed IS Architecture

IBM WebSphere Metadata Server

Metadata ServicesImport/Export

Manager

Metadata Workbench

DataStage &QualityStage

Engine

WebSphereBusiness Glossary

External Data Sources

(Erwin, Cognos)

DataStage &QualityStage

Client

Admin ConsoleClient

Reporting ConsoleClient

WebSphere Application Server

InformationAnalyzer

Metadata

DB

IADB

(Profiling)

Fast Track

Engine layer

Metadata Repository layer

Domain layer

Client layer

4

Page 5: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Information Server 8.1 Components

Core Components:

� Information Analyzer profiles and establishes an understanding of source systems, and monitors data

rules on an ongoing basis to eliminate the risk of proliferating incorrect and inaccurate data.

� QualityStage standardizes and matches information across heterogeneous sources.

� DataStage® extracts, transforms and loads data between multiple sources and targets.

� Metadata Server provides unified management, analysis and interchange of metadata through a shared

repository and services infrastructure.

� Business Glossary defines data stewards and creates and manages business terms, definitions and

relates these to physical data assets.

� Metadata Workbench provides unified management, analysis and interchange of metadata through a

shared repository and services infrastructure.

� FastTrack Easy-to-use import/export features allow business users to take advantage of familiar

Microsoft® Excel® interface and create new specifications.

� Federation Server defines integrated views across diverse and distributed information sources, including

cost-based query optimization and integrated caching.

� Information Services Director enables information access and integration processes for publishing as

reusable services in a SOA.

5

Page 6: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Information Server 8.1 Components

Optional Components:

� Rational Data Architect provides enterprise data modeling and information integration design capabilities.

� Replication Server provides high-speed, event-based replication between databases for high availability,

disaster recovery, data synchronization and data distribution.

� Data Event Publisher detects and responds to data changes in source systems, publishing changes to

subscribed systems, or feeding changed data into other modules for event-based processing.

� InfoSphere Change Data Capture Log-based Change Data Capture (CDC) technology, acquired in the

DataMirror acquisition, detects and delivers changed data across heterogeneous data sources such as

DB2®, Oracle, SQL Server and Sybase. Supports service-oriented architectures (SOAs) by packaging real-

time data transactions into XML documents and delivering to and from messaging middleware such as

WebSphere MQ.

� DataStage Pack for SAP BW (DataStage BW Pack) The DataStage BW Pack is a companion product of

the IBM Information Server. The pack was originally developed to support SAP BW and currently supports

both SAP BW and SAP BI. The GUIs of the DataStage BW Pack are installed on the DataStage Client. The

runtime part of the Pack is installed on the DataStage Server.

6

Page 7: DataStage Quality Stage Fundamentals

IBM Information Server 8.x

DataStage / QualityStage Fundamentals

ValueCapValueCap SystemsSystems

7ValueCap Systems - Proprietary

Page 8: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Course Objectives

Upon completion of this course, you will be able to:

� Understand principles of parallel processing and scalability

� Understand how to create and manage a scalable job using

DataStage

� Implement your business logic as a DataStage Job

� Build, Compile, and Execute DataStage Jobs

� Execute your DataStage Jobs in parallel

� Enhance DataStage functionality by creating your own

Stages

� Import and Export DataStage Jobs

8

Page 9: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Agenda

1. DataStage Overview

2. Parallel Framework Overview

3. Data Import and Export

4. Data Partitioning, Sorting, and Collection

5. Data Transformation and Manipulation

6. Data Combination

7. Custom Components: Wrappers

8. Custom Components: Buildops

9. Additional Topics

10. Glossary

9

Page 10

Page 73

Page 116

Page 252

Page 309

Page 364

Page 420

Page 450

Page 477

Page 526

Page 10: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

IS DataStage Overview

In this section we will discuss:

� Product History

� Product Architecture

� Project setup and configuration

� Job Design

� Job Execution

� Managing Jobs and Job Metadata

10

Page 11: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Product History

Prior to IBM’s acquisition of Ascential Software, Ascential had performed a

series of its own acquisitions:

� Ascential started off as VMark before it became Ardent Software and

introduced DataStage as an ETL solution

� Ardent was then acquired by Informix and through a reversal in

fortune, Ardent management took over Informix.

� Informix was then sold to IBM and Ascential Software was spun out

with approximately $1 Billion in the bank as a result.

� Ascential Software kept DataStage as its cash cow product, but started

focusing on a bigger picture: Data Integration for the Enterprise

11

Page 12: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

� DataStage Standard Edition was the original DataStage product andis also known as DataStage Server Edition. ‘Server’ will be goingaway with the Hawk release later in 2006.

� DataStage Enterprise Edition was originally Orchestrate, which hadbeen renamed to Parallel Extender after the Torrent acquisition.

� Vality’s Integrity was renamed to QualityStage

� DataStage TX was originally known as Mercator and renamed whenpurchased by Ascential.

� ProfileStage was once Metagenix’s Metarecon software

Product History (Continued…)

With plenty of money in the bank and a weakening economy, Ascential

embarked upon a phase of acquisitions to fulfill its vision as becoming the

leading Data Integration software provider.

12

Page 13: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Product History (Continued…)

By 2004, Ascential had completed its acquisitions and

turned its focus onto completely integrating the acquired

technologies.

� Ascential’s

Data Integration Suite:

Parallel Execution Engine

DISCOVERDISCOVER

Discover

data

content

and

structure

PREPAREPREPARE

Standardize,

match, and

correct data

TRANSFORM

and DELIVER

TRANSFORM

and DELIVER

Transform,

enrich, and

deliver

data

ProfileStage™ProfileStage™ QualityStage™QualityStage™ DataStage™DataStage™

Meta Data Management

Real-Time Integration Services

Enterprise Connectivity

and Event Management

Service-Oriented Architecture

13

Page 14: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Product History (Continued…)

In 2005, IBM acquired Ascential. In November of 2006,

IBM released Information Server version 8, which

included WebSphere Application Server, DataStage,

QualityStage, and other tools, some of which are part of

the standard install, and some of which are optional:

� FastTrack

� Metadata Workbench

� Information Analyzer (formerly ProfileStage)

� WebSphere Federation Server

� … and others.

14

Page 15: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

“Old” DataStage Client/Server Architecture

� 4 Clients, 1 Server…

Server – WIN, UNIX (AIX, Solaris, TRU64, HP-UX, USS)

Designer Director ManagerAdministrator

Client - Microsoft® Windows NT/2K/XP/2003

DataStageEnterpriseEditionFramework

DataStageRepository

15

Page 16: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

New DataStage Client/Server Architecture

� 3 Clients, 1 (or more) Server(s)…

Server – WIN, UNIX (Linux, AIX, Solaris, HP-UX, USS)

Designer Director Administrator

Client - Microsoft® Windows XP/2003/Vista

DataStageEnterpriseEditionFramework

CommonRepository

ApplicationServer

� Common Repository can

be on a separate server

� Default J2EE-compliant

Application Server is

WebSphere Application

Server

� Clients now handle both

DataStage and

QualityStage

� No more Manager client

16

Page 17: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Clients: Administrator

DataStage Administrator

� Manage licensing details

� Create, update, administer projects and users

� Manage environment variable settings for entire

project

17

Page 18: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Administrator – Logon

When 1st connecting to the Administrator,

you will need to provide the following:

• Server address where the DataStage

repository was installed

• Your userid

• Your password

• Assigned project

18

Page 19: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Administrator – Projects

Next, click on ‘Add’ to create a

new DataStage project.

• In this course, each student

will create his/her own project

• In typical development

environments, many developers

can work on the same project.

Project paths / locations can

be customized

C:\IBM\InformationServer\Projects\ANALYZEPROJECT

C:\IBM\InformationServer\Projects\

19

Page 20: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Administrator – Projects

Once a project has been created, it is populated with default settings.

To change these defaults, click on the Properties button to bring up the

Project Properties window.

Next, click on Environment

button…

C:\IBM\InformationServer\Projects\Sample

20

Page 21: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Administrator – Environment

This window displays all of the default environment variable settings, as

well as the user defined environment variables.

Click here when done

Do not change any

values for now…

21

Page 22: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Administrator – Other Options

Useful options to set for all

projects include:

• Enable job administration

in Director – this allows

various administrative actions

to be performed to jobs via

the Director interface

• Enable Runtime Column

Propagation for Parallel Jobs -

aka RCP – a feature which

allows column metadata to be

automatically propagated at

runtime. More on this later…

22

Page 23: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Clients: Designer

DataStage Designer

� Develop DataStage jobs or modify existing jobs

� Compile jobs

� Execute jobs

� Monitor job performance

� Manage table definitions

� Import table definitions

� Manage job metadata

� Generate job reports

23

Page 24: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Login

After login in you

should see a

similar screen:

24

Page 25: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Where to Start?

Select to create a

new DataStage job

Open any existing

DataStage job

Open a job that

you were recently

working on…

For majority of lab exercises, you will be selecting ‘Parallel Job’ or using

the Existing and Recent tabs.25

Page 26: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Elements

Indicates ‘Parallel Canvas’ (i.e. Parallel DataStage Job)

These ‘boxes’ can be docked invarious locations within this interface.Just click and drag around…

Icons can be made larger by right-clicking inside to access menu. Categories can

be edited and customized as well

The DataStage

Designer user

interface can be

customized to your

preferences.

Here are just a few

of the options…

26

Page 27: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Toolbar

NewJob

Open Existing

Job

Save /Save All

JobProperties

JobCompile

RunJob

GridLines

Snap toGrid

LinkMarkers

ZoomIn / Out

Some of the useful icons you will become very familiar with as you get to

know DataStage. Note that if you let the mouse pointer hover over any

icon, a tooltip will appear.

27

Page 28: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Paradigm

Left-click and drag thestage(s) onto the canvas.

You can also left-clickon the stage once and then position your mousecursor on the canvas andleft-click again to placethe chosen stage there.

28

Page 29: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Paradigm

To create the link, you can right-clickon the ‘upstream’ stage and dragthe mouse pointer to the ‘downstream’ stage. This will create a link as shown here.

Alternatively, you can select the link icon from the ‘General’ category in your Palette by left-clicking on it.

29

Page 30: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Design Feedback

When “Show stage validation errors” under the Diagram menu is selected (the default) DataStage Designer uses visual cues to alert users that there’s something wrong.

Placing the mouse cursor over an exclamation mark on a stage will display a message indicating what the problem is.

A red link indicates that the link cannot be left ‘dangling’ and must have a source and/or target attached to it.

30

Page 31: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Labels

You may notice that the default labels that are created on the stages and links are not very intuitive.

You can easily change them by left-clicking once on the stage or link and then start typing a more appropriate label. This is considered to be a best practice. You will understand why shortly.

Labels can also be changed by right-clicking on the stage or link and selecting the ‘Rename’ option.

31

Page 32: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Stage PropertiesDouble-clicking on any stage on the canvas or right-clicking and selecting ‘Properties’ will bring up the options dialogue for that particular stage.

Almost all stages will require you to open and edit their properties and set them to appropriate values.

However, almost all property dialogues follow the same paradigm.

32

Page 33: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Stage Properties

Here’s an example of a fairly common stage properties dialogue box.

The Properties tab will always contain the stage specific options. Mandatory entries will be highlighted red.

The Input tab allows you to view the incoming data layout as well as define data partitioning (we will cover this in detail later).

The Output tab allows you to view and map the outgoing data layout.

33

Page 34: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Stage Input

Once you’ve changed the link label to something more appropriate, it will make it easier to track your metadata. This is especially true if there are multiple inputs or outputs.

We will discuss partitioning in detail later…

Another useful feature on the Input properties tab is the fact that you can see what the incoming data layout looks like.

34

Page 35: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Stage Output

On the Output tab, there is a Mapping tab and another Columns tab.

Note that the columns are missing on the Output side. Where did they go? We saw them on the Input, right?

The answer lies in the Mapping tab. This is the Source to Target mapping paradigm you will find throughout DataStage. It is a means of propagating design-time metadata from source to target…

35

Page 36: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Field Mapping

Source to Target mapping is achieved by 2 methods in DataStage:� Left-clicking and dragging a field or collection of fields from the Source side (left) to

the Target side (right).� Left-clicking on the ‘Columns’ bar on the Source side and dragging it into the

Target side. This is illustrated above.

When performed correctly, you will see the Target side populated with some or all of the fields from the Source side, depending on your selection.

36

Page 37: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Field Mapping

Once the mapping is complete, you can go back into the Output Columns tab and you will notice that all of the fields you’ve mapped from Source to Target now appear under the Columns tab.

You may have also noticed the ‘Runtime column propagation’ option below the columns. This is here because we enabled it in the Administrator. If you do not see this option, it is likely because it did not get enabled.

37

Page 38: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – RCP

What is Runtime Column Propagation?

� Powerful feature which allows you to bypass Source

to Target mapping

� At runtime (not design time), it will automatically

propagate all source columns to the target for all

stages in your job.

� What this means: if you are reading in a database

table with 200 columns/fields, and your business

logic only affects 2 of those columns, then you only

need to specify 2 out of 200 columns and

subsequently enable RCP to handle the rest.

38

Page 39: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Mapping vs RCP

So, why Map when you can RCP?

� Design time vs runtime consideration

� When working on a job flow that affects many fields, it

is easier to have the metadata ‘there’ to work with

� Mapping also provides explicit documentation of what is

happening

� Note that RCP can be combined with Mapping

• Enable RCP by default, and then turn it off when you only want to

propagate a subset of fields. Do this by only mapping fields you need.

• It is often better to keep RCP enabled at all times, but be careful when

you only want to keep certain columns and not others!

39

Page 40: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Table Definitions

Table Definitions in DataStage are the same as a table layout or schema.

You can manually enter everything and these can be saved for re-use later…

Specify ‘location’ where the tabledefinition is to be saved. Once saved, table definition can beaccessed from the repository view.

40

Page 41: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Metadata Import

Table Definitions can also be

automatically generated by translating

definitions stored in various formats.

Popular options include COBOL

copybooks and RDBMS table layouts.

RDBMS layouts can be accessed via a

couple of options:

� ODBC Table Definitions

� Orchestrate Schema Definitions

(via orchdbutil option)

� Plug-in Meta Data Definitions

41

Page 42: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Job Properties

The Parameters tab allows users to add environment variables –both pre-defined anduser-defined.

Once selected, it willshow up in the Job Propertieswindow. The default value canbe altered to a different value.

Parameters can be used to control job behavior as wellas referenced within stages to allow for simple adjustment of properties without having to modify the job itself.

42

Page 43: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Job Compile/Run

Before a job can be executed, it must first be saved and compiled. Compilation will validate that all necessary options are set and defined within each of the stages in the job.

Compile Run

To run the job, just click on the run button on the Designer. Alternatively, you can also click on the run button from within the Director.

The Director will contain the job run log, which provides much more detail than the Designer will.

43

Page 44: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Job Statistics

As a job is executing, you canright-click on the canvas and select ‘Show performancestatistics’ to monitor your job’sperformance.

Note that the link colors signifyjob status. Blue means it is running and green means it hascompleted. If the link is red, thenthe job has aborted due to error.

44

Page 45: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Export

The Designer is also used for exporting and importing DataStage jobs, table Definitions, routines, containers, etc…

Items can be exported in 1 of 2 formats: DSX or XML. DSX format is DataStage’s internal format. Both formats can be opened and viewed in a standard text editor.

We do not recommend altering the contents unless you really know what you are doing!.

45

Page 46: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Export

You can export the contents of the entire project, or individual components. You can also export items into an existing file by selecting the ‘Append to existing file’ option.

Exported projects, depending on the total number of jobs, can grow to be several megabytes. However, these files can be easily compressed.

46

Page 47: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Designer – Import

Previously exported items can be imported via the Designer. You can choose to import everything or only selected content.

DSX files from previous versions of DataStage can also be imported. The upgrade to the current version will occur on the fly as the content is being imported into the repository.

47

Page 48: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Clients: Director

DataStage Director:

� Execute DataStage jobs

� Compile jobs

� Reset jobs

� Schedule jobs

� Monitor job performance

� Review job logs

48

Page 49: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Director – Access

The easiest way to access the Director is from within the Designer. This will bypass the need to re-login again.

Alternatively, you will have to double-click on the Director icon to bring up the Director interface.

49

Page 50: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Director – Interface

The Director’s default interface shows a list of Jobs along with their status.

You will be able to see if jobs are compiled, how long it took to run, and when it was last run.

50

Page 51: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Director – Toolbar

OpenProject

JobStatusView

JobScheduler

JobLog

ResetJob

RunJob

Some of the useful icons you will become very familiar with as you get to

know DataStage. Note that if you let the mouse pointer hover over any

icon, a tooltip will appear..

51

Page 52: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Director – Interface

Whenever a job runs,you can view the joblog in the Director.

Current entries arein black, whereas previous runs will show up in blue.

Double-click on anyentry to access moredetails. What you seehere is often just a summary view…

52

Page 53: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Director – MonitorTo enable job monitoring from within the Director, go to Tools menu and select ‘New Monitor’.

You can set the update intervals as well as specify which statistics you would like to see.

Colors correspond to status. Blue means it is running, green means it has finished, and red indicates a failure.

53

Page 54: DataStage Quality Stage Fundamentals

IBM Information Server DataStage /

QualityStage Fundamentals – Labs

ValueCapValueCap SystemsSystems

54ValueCap Systems - Proprietary

Page 55: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 1A: Project Setup & Configuration

55

Page 56: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 1A Objective

� Learn to setup and configure a simple project for IBM

Information Server DataStage / QualityStage

56

Page 57: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating a New Project

� Log into the DataStage Administrator using the userid

and password provided to you by the instructor.

Steps are outlined in the course material.

� Click on ‘Add’ button to create a new project.

Your instructor may advise you on a project name –

do not change the default project directory.

� Click OK when finished.

57

Page 58: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Project Setup

� Click on the new project you have just created and

select the ‘Properties’ button.

� Under the ‘General’ tab, check the boxes next to:

• Enable job administration in the Director

• Enable Runtime Column Propagation for Parallel Jobs

� Next, click on the ‘Environment’ button to bring up the

Environment Variables editor.

58

Page 59: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Environment Variable Settings

� The Environment Variables editor should be similar to

the screen shot shown here:

� We only need to change

a couple of values…

• APT_CONFIG_FILE

instructor will provide

value.

• Click on Reporting and

set APT_DUMP_SCORE

to TRUE.

• Instructor will provide details if any other environment

variable needs to be defined.

59

Page 60: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

� Setting APT_CONFIG_FILE defines the default

configuration file used by jobs in the project.

� Setting APT_DUMP_SCORE will enable additional

diagnostic information to appear in the Director log.

� Click OK button when finished editing Environment

Variables.

� Click OK and then Close to exit the Administrator.

� You have now finished configuring your project.

60

Page 61: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 1B: Designer Walkthrough

61

Page 62: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 1B Objective

� Become familiar with DataStage Designer.

62

Page 63: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Getting Into the Designer

� Log into the DataStage Designer using the userid and

password provided to you by the instructor. Be sure to

select the project you just created when login in.

� Once connected, select the

‘Parallel Job’ option and click

on OK.

� You should see a blank canvas

with the Parallel label in the upper left hand corner.63

Page 64: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Create a Simple Job

� We will construct the following simple job:

� Use the techniques covered in the lecture material to

build the job.

� Job consists of a Row Generator stage and a Peek

stage.

• For the Row Generator, you will need to enter the following

table definition:

� Alter the stage and link labels to match the diagram

above.64

Page 65: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Compile and Run the Job

� Save the job as ‘lab1b’

� Click on the Compile button.

• Did the job compile successfully? If not, can you determine

why not? Try to correct the problem(s) in order to get the job

to compile.

� Once the job has compiled successfully, right-click on

the canvas and select ‘Show performance statistics’

� Click on the Job Run button. Once your job finishes

executing, you should see the following output:

65

Page 66: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 1C: Director Walkthrough

66

Page 67: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 1C Objective

� Become familiar with DataStage Director.

67

Page 68: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Getting Into the Director

� Log into the DataStage Director using the userid and

password provided to you by the instructor. You can

also use the shortcut shown in the course materials.

� Once connected, you should see the status of lab1b,

which was just executed from within the Designer:

68

Page 69: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Viewing the Job Log

� Click on the Job Log button on the toolbar to access

the log for lab1b.

� The log should be very similar to the screenshot

here:

� There should not be any red (error) icons.

69

Page 70: DataStage Quality Stage Fundamentals

� Take a closer look at some of the entries in the log.

Double click on the following highlighted selections:

� First one shows the configuration file being used. The next

few entries show the output of the Peek stage.

ValueCap Systems - Proprietary

Director Job Log

Also note the Job Status

70

Page 71: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Stage Output

� The Peek stage output in the Director log should be

similar to the following:

� Peek stage is similar to inserting a ‘Print’ statement into

the middle of a program.

� Where did this data come from? The data was

generated by the Row Generator stage! You will learn

more about this powerful stage in later sections & labs.71

Page 72: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Agenda

1. DataStage Overview

2. Parallel Framework Overview

3. Data Import and Export

4. Data Partitioning, Sorting, and Collection

5. Data Transformation and Manipulation

6. Data Combination

7. Custom Components: Wrappers

8. Custom Components: Buildops

9. Additional Topics

10. Glossary

72

Page 10

Page 73

Page 116

Page 252

Page 309

Page 364

Page 420

Page 450

Page 477

Page 526

Page 73: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Parallel Framework Overview

In this section we will discuss:

� Hardware and Software Scalability

� Traditional processing

� Parallel processing

� Configuration File overview

� Parallel Datasets

73

Page 74: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Scalability

Scalability is a term often used in product marketing but

seldom well defined:

� Hardware vendors claim their products are highly

scalable

• Computers

• Storage

• Network

� Software vendors claim their products are highly

scalable

• RDBMS

• Middleware

74

Page 75: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Scalability Defined

How should scalability be defined? Well, that depends

on the product. For Parallel DataStage :

� The ability to process a fixed amount of data in

decreasing amounts of time as hardware resources

(cpu, memory, storage) are increased

� Could also be defined as the ability to process

growing amounts of data by increasing hardware

resources accordingly.

75

Page 76: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Scalability Illustrated

Hardware Resources (CPU, Memory, etc)

run tim

e

poor scalability

� Linear Scalability: runtime decreasesas amount of hardware resources are increased. For example: a job that takes 8 hours to run on 1 cpu, will take 4 hours on 2 cpu’s, 2 hours on 4 cpu’s, and 1 hour on 8 cpu’s.

� Poor Scalability: results when running time no longer improves asadditional hardware resources areadded.

� Super-linear Scalability: occurs when the job performs better than linear asamount of hardware resources areincreased.

(assumes that data volumes remain constant)

76

Page 77: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Hardware Scalability

Hardware vendors achieve scalability by:

� Using multiple processors

� Having large amounts of memory

� Installing fast storage mechanisms

� Leveraging a fast back plane

� Using very high bandwidth, high speed networking

solutions

77

Page 78: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Examples of Scalable Hardware

� SMP – 1 physical machine with 2 or more processors

and shared memory.

� MPP – 2 or more SMP’s interconnected by a high

bandwidth, high speed switch. Memory between

‘nodes’ of a MPP is not shared.

� Cluster – more than 2 computers connected together

by a network. Similar to MPP.

� Grid – several computers networked together.

Computers can be dynamically assigned to run jobs.

78

Page 79: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Software Scalability

Software scalability can occur via:

� Executing on scalable hardware

� Effective memory utilization

� Minimizing disk I/O

� Data partitioning

� Multi-threading

� Multi-processing

79

Page 80: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Software Scalability – DS EE

Parallel DataStage achieves scalability in a variety of

ways:

� Data Pipelining

� Data Partitioning

� Minimizing disk I/O

� In memory processing

We will explore these concepts in detail!

80

Page 81: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

The Parallel Framework

The Engine layer consists, in large part, of the Parallel

Framework (aka Orchestrate).

� The Framework was written in C++ and has a

published and documented API

� DS/QS jobs run on top of the Framework via OSH

� OSH is a scripting language much like Korn shell

� The Designer client will generate OSH automatically

� Framework relies on a configuration file to determine

level of parallelism during job execution.

81

Page 82: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Parallel Framework

Parallel Framework

ConfigurationFile

DataStage Jobexecutes on the Frameworkat runtime

Configuration Filecontains virtual mapof available systemresources.

Framework will reference the Configuration File to determine the degree of parallelism for the job at runtime.

82

Page 83: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Traditional Processing

RDBMSA B Cfile

Suppose we are interested in implementing the following business logic whereA, B, and C represent specific data transformation processes:

Manual implementation of the business logic typically results in the following:

RDBMSA B Cfile

disk disk diskstaging area:

Invoke loader

While the above solution works and eventually delivers the correct results,

problems will occur when data volumes increase and/or batch windows decrease!

� Disk I/O is the slowest link in the chain.

� Sequential processing prohibits scalability

83

Page 84: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Data Pipelining

What if, instead of persisting data to disk between processes, we could move thedata between processes in memory?

RDBMSA B Cfile

disk disk diskstaging area:

Invoke loader

The application will certainly run faster simply because we are now avoiding thedisk I/O that was previously present.

RDBMSA B Cfile

This concept is called data pipelining. Data continuously flows from Source toTarget, through the individual transformation processes. The downstream processno longer has to wait for all of the data to be written to disk – it can now beginprocessing as soon as the upstream process is finished with the first record!

84

Page 85: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Data Partitioning

Parallel processing would not be possible without data

partitioning. We will devote an entire lecture to this

subject matter later in this course. For now:

� Think of partitioning as the act of distributing records

into separate partitions for the purpose of dividing

the processing burden from one processor to many.

DataFile

Part

itio

ner Records 1 - 1000

Records 1001 - 2000

Records 2001 - 3000

Records 3001 - 4000

85

Page 86: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Parallel Processing

A

Inputfile

B C RDBMS

By combining data pipelining and partitioning, you can achieve what

people typically envision as being parallel processing:

In this model, data flows from source to target, upstream stage to

downstream stage, while remaining in the same partition throughout

the entire job. This is often referred to as partitioned parallelism.

86

Page 87: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Pipeline Parallel Processing

A

Inputfile

There is, however, a more powerful way to perform parallel processing.

We call this spaghetti pipeline parallelism.

B C RDBMS

What makes pipeline parallelism powerful is the following:

� Records are not bound to any given partition

� Records can flow down any partition

� Prevents backup and hotspots from occurring in any given partition

The parallel framework does this by default!

87

Page 88: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Pipeline Parallelism Example

Suppose you are traveling from point A to point B along a 6 lane toll-way. Between the start and end points, there are 3 toll stations your car must pass through and pay toll.

� During your journey, you will most likely change lanes. These lanes are just like partitions

� During your journey, you will likely use the toll stationwith the least number of cars

� Think about the fact that other cars are doing the same!

� Each car is like a record, toll stations are processes

� What would happen if you are stuck in a single lane

during the entire journey?

This is a simple real-world example of pipeline parallelism!

88

Page 89: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Configuration Files

Configuration files are used by the Parallel Framework to

determine the degree of parallelism for a given job.

� Configuration files are plain text files which reside on

the server side

� Several configuration files can co-exist, however, only

one can be referenced at a time by a job

� Configuration files have a minimum of one processing

node defined and no maximum

� Can be edited through the Designer or vi or other text editors

� Syntax is pretty simple and highly repetitive.

89

Page 90: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Configuration File Example

Here is a sample configuration file which will allow a job to run 4 way parallel. The path will be different for windows installations.{

node “node_1" {

fastname “dev_server"

pool "“

resource disk "/data/work" {}

resource scratchdisk "/data/scratch" {}

}

node “node_2" {

fastname “dev_server"

pool "“

resource disk "/data/work" {}

resource scratchdisk "/data/scratch" {}

}

node “node_3" {

fastname “dev_server"

pool "“

resource disk "/data/work" {}

resource scratchdisk "/data/scratch" {}

}

node “node_4" {

fastname “dev_server"

pool "“

resource disk "/data/work" {}

resource scratchdisk "/data/scratch" {}

}

}

Hostname for the ETL server� can also use IP address

Label for each node, can be anything� needs to be different for each node

Location for parallel dataset storage� used to spread I/O� can have multiple entries per node

Location for temporary scratch file storage� used to spread I/O� can have multiple entries per node

90

Page 91: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Reading & Writing Parallel Datasets

DataFile

Part

itio

ner Records 1 - 1000

Records 1001 - 2000

Records 2001 - 3000

Records 3001 - 4000

DataFileData

FileDataFileData

Files

Records 1 - 1000

Records 1001 - 2000

Records 2001 - 3000

Records 3001 - 4000

VS

Suppose that in each scenario illustrated below, we are reading in or

writing out 4000 records. Which performs better?

Records 1001 - 2000 DataFile

Co

llecto

rRecords 1 - 1000

Records 2001 - 3000

Records 3001 - 4000

DataFileData

FileDataFileData

Files

Records 1 - 1000

Records 1001 - 2000

Records 2001 - 3000

Records 3001 - 4000

VS

-- OR --

91

Page 92: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Parallel Dataset Advantage

DataFileData

FileDataFileData

Files

Records 1 - 1000

Records 1001 - 2000

Records 2001 - 3000

Records 3001 - 4000

Being able to read and write data in parallel will almost always be faster

and more scalable than reading or writing data sequentially.

DataFileData

FileDataFileData

Files

Records 1 - 1000

Records 1001 - 2000

Records 2001 - 3000

Records 3001 - 4000

Parallel Datasets perform better because:

� data I/O is distributed instead of sequential, thus removing a bottleneck

� data is stored using a format native to the Parallel Framework, thus

eliminating need for the Framework to re-interpret data contents

� data can be stored and read back in a pre-partitioned and sorted

manner

92

Page 93: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Parallel Dataset Mechanics

� Datasets are made up of several small fragments or

data files

� Fragments are stored per the resource disk entries in

the configuration file

• This is where distributing the I/O becomes important!

� Datasets are very much dependent on configuration

files.

• It’s a good practice to read the dataset using the same

configuration file that was originally used to create it.

93

Page 94: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Using Parallel Datasets

Parallel datasets should use a

‘.ds’ extention. The ‘.ds’ file is

only a descriptor file containing

metadata and location of actual

datasets.

When writing data to a parallel

dataset, be sure to specify

whether to create, overwrite,

append, or insert.

94

Page 95: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Browsing Datasets

Dataset viewer can be accessed from the ‘Tools’ menu in the Designer. Use the Dataset viewer to see all metadata as well as records stored within the dataset.

Alternatively, if all you want to do is browse the records in the dataset, you can use the View Data button in the properties window for the dataset stage.

deleting datasets

95

Page 96: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 2A: Simple Configuration File

96

Page 97: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 2A Objectives

� Learn to create a simple configuration file and

validate its contents.

� Note: You will need to leverage skills learned during

previous labs to complete subsequent labs.

97

Page 98: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating a Configuration File

� Log into the DataStage Designer using your assigned

userid and password.

� Click on the Tools menu to select Configurations…

98

Page 99: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Configuration File Editor

� The Configuration File editor should pop up, similar to

the one you see here.

� Click on ‘New’ and

select ‘default’. We

will use this as our

starting point to

create another

config file.

99

Page 100: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Checking the Configuration File

� Once you have opened the default configuration file,

click on the ‘Check’ button at the bottom. This action

will validate the contents of the configuration file.

• Always do this after you have created a configuration file. If

it fails this simple test, then there is no way any job will run

using this configuration file!

� What is in your configuration file will depend on the

hardware environment you are using (i.e. number of

cpus).

� For Example, on a 4 cpu system, you will likely see a

configuration file with 4 node entries defined.

100

Page 101: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Editing the Configuration File

� At this point, how many nodes do you see defined in your

default configuration file?

• Remember, this dictates how many way parallel your job will run. If

you see 8 node entries, then your job will run 8-way parallel.

� Regardless of how many cpus your system has, edit the

configuration file and create as many node entries as you have

cpus.

• The default may already have the nodes defined.

• Copy and paste is the fastest way to do this if you need to add

nodes. Keep in mind that node names need to be unique, while

everything else can stay the same! Pay attention to the { }’s!!!

• Your instructor may choose to provide you with alternate ‘resource

disk’ and ‘resource scratchdisk’ locations to use.

101

Page 102: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Save and Check the Config File

� Once you have finished editing the configuration file, click on the

Save button and save it as something other than default.

• Suggestions include using your initials along with the number of

nodes defined. This helps prevent other students from accidentally

using the wrong configuration file.

o For example: JD_Config_4node.apt

� Once you have saved your configuration file, click on the

‘Check’ button again at the bottom. This action will validate the

contents of your configuration file.

• Again, always do this after you have created a configuration file. If

it fails this simple test, then there is no way any job will run using

this configuration file!

• If the validation fails, use the error message to determine what the

problem is. Correct the problem and repeat the above step.

102

Page 103: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Save and Check the Configuration File

� Next, re-edit the configuration file you just created (and

validated) and remove all node entries except for the

first one.

� Check it again and, if no errors are returned, save it as

a 1node configuration using the same nomenclature

you applied to the multi-node configuration file you

previously had created.

• For example: JD_Config_1node.apt

• Note: when you check the configuration, it may prompt you to

save it first. You can check the configuration without saving it

first, but always remember to save it once it passes the

validation test.103

Page 104: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Checking the Configuration File

What does Parallel DataStage do when it is checking

the config file?

� Validates syntax

• Correct placement of all { }, “ ”, ‘ ‘, etc…

• Correct spelling and use of keywords such as node,

fastname, resource disk, resource scratchdisk, pool, etc…

� Validates information

• Fastname entry should match hostname or IP

• rsh permissions, if necessary, are in place

• Read and Write permissions exist for all of your resource

disk and scratchdisk entries

104

Page 105: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Changing Default Settings

Exit the Manager and go into the Administrator

– be sure to select your project and not someone else’s.

� Enter the Environment editor

• Find and set APT_CONFIG_FILE to the 1node configuration

file you just created. This makes it the default for your project.

• Find and set APT_DUMP_SCORE to TRUE. This will enable

additional diagnostic information to appear in the Director log.

� Click OK button when finished editing Environment

Variables.

� Click OK and then Close to exit the Administrator.

� You have now finished configuring your project.

105

Page 106: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 2B: Applying the Configiguration File

to a Simple DataStage Job.

106

Page 107: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 2B Objective

� Use your newly created configuration files to test a

simple DataStage application.

107

Page 108: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Create Lab2B Using Lab1B

� Open the job you created in Lab 1B – should be

called ‘lab1b’

� Save the job again using Save As – use the name

‘lab2b’

� Next, find the job properties icon:

• Click on the job properties icon to bring up the Job

Properties window.

108

Page 109: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Editing Job Parameters

� Click on the

Parameters tab

� Find and click on the

‘Add Environment Variable’ button

� You will see the big (and sometimes confusing) list of

environment variables. Take some time to browse

through these.

� Find and select APT_CONFIG_FILE

109

Page 110: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Defining APT_CONFIG_FILE

� Once selected, you will return to the Job Properties

window.

� Verify that the value for APT_CONFIG_FILE is the

same as the 1node configuration file you defined

previously in Lab 2A.

� Save, Compile, and Run your job.

110

Page 111: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Running Using Parameters

� When you run your job, you should see the following

Job Run Options dialogue:

� Note that it shows you the

default configuration file

being used, which

happens to be the one

defined previously in the Administrator.

� Keep this value for now, and just click on ‘Run’.

� Go to the Director to view the job run log.

111

Page 112: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Director Log Output

� Look for a similar entry in the job log for lab2b:

� Double click on it.

� You should see the contents of the 1node

configuration file used.

� Click on ‘Close’ to exit from the dialogue.

� Click on Run again and this time, change the

APT_CONFIG_FILE parameter to the multiple node

configuration file you defined in Lab 2A.

� Click the ‘Run’ button.

112

Page 113: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Director Log Output

� Again, look for a similar entry in the job log for lab2b:

� Double click on it.

� You should see the contents of the multiple node

configuration file used.

� Click on ‘Close’ to exit from the dialogue.

� You have just successfully run your job sequentially

and in parallel by simply changing the configuration

file!

113

Page 114: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Using APT_DUMP_SCORE

� Another way to verify degree of parallelism is to look

at the following output in your job log:

� The entries ‘Peek,0’ and ‘Peek,1’ show up as a result

of you having set APT_DUMP_SCORE to TRUE.

� The numbers 0 and 1 signify partition numbers. So if

you have a job running 4 way parallel, you should

see numbers 0 through 3.

114

Page 115: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Agenda

1. DataStage Overview

2. Parallel Framework Overview

3. Data Import and Export

4. Data Partitioning, Sorting, and Collection

5. Data Transformation and Manipulation

6. Data Combination

7. Custom Components: Wrappers

8. Custom Components: Buildops

9. Additional Topics

10. Glossary

115

Page 10

Page 73

Page 116

Page 252

Page 309

Page 364

Page 420

Page 450

Page 477

Page 526

Page 116: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Data Import and Export

In this section we will discuss:

� Data Generation, Copy, and Peek

� Data Sources and Targets

• Flat Files

• Parallel Datasets vs Filesets

• RDBMS

• Other

� Related Stages

116

Page 117: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Generating Columns and Rows

� DataStage allows you to easily test any job you

develop by providing an easy way to generate data.

• Row Generator generates as many records as you want

• Column Generator generates extra fields within existing

records. Must first have input records.

� To use either stage, you will need to have a table or

column definition.

� You can generate as little as 1 record with 1 column.

� Columns can be of any supported data type

• Integer, Float, Double, Decimal, Character, Varchar,

Date, and Timestamp

117

Page 118: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Row Generator

� The Row Generator is an excellent stage to use when

building jobs in Datastage. It allows you to test the

behavior of various stages within the product.

� To configure the Row Generator to work, you must

define at least 1 column. Looking at what we did for

the job in Lab 1b, we see that 3 columns were defined:

� We could have also loaded an existing table definition

instead of entering our own.

118

Page 119: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Row Generator

� Suppose we want to stick with the 3 column table

definition we created. As you saw in Lab 2B, the

Row Generator will produce records with

miscellaneous 10-byte character, integer, and date

values.

� There is, however, a way to specify values to be

generated. To do so, double click on the number

next to the column name.

119

Page 120: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Column Metadata Editor

� The Column Metadata

editor allows you to

provide specific data

generation instructions

for each and every field.

� Options vary by data

types.

� Frequent options

include cycle-through user-defined values, random

values, incremental values, and alphabetic algorithm

120

Page 121: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Character Generator Options

For a Character or Varchar type, when you click on ‘Algorithm’ you will have 2 options:

• cycle – cycle through only the

specific values you specify.

• alphabet – methodically cycle

through characters of the

alphabet. This is the default

behavior:

121

Page 122: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Number Generator Options

For an Integer, Decimal, or Float type, your

2 options are:• cycle – cycle through numbers beginning at the initial

value and incrementing by the increment value. You can

also define an upper limit.

• random – randomly generate numerical values. You

can define an upper limit and a seed for the random

number generator. You can also use the signed option

to generate negative numbers.

Note: In addition, with Decimal types, you also have the option of defining percent zero and percent invalid

122

Page 123: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Other Data Type Generator Options

� Date, Time, and Timestamp data types have some

useful options:

• Epoch: earliest date to use. For example, the default value

is 1960-01-01.

• Scale Factor: Specifies a multiplier to the increment value

for time. For example, a scale factor of 60 and an increment

of 1 means the field increments by 60 seconds.

• Use Current Date: Generator will insert current date value

for all rows. This cannot be used with other options.

123

Page 124: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Column Generator

� The Column Generator is an excellent stage to use when you need

to insert a new column or set of columns into a record layout.

� Column Generator requires you to specify the name of the column

first, and then in the output-mapping tab, you will need to map

source to target.

� In the output-columns tab, you will need to customize the column(s)

added the same way as it is done in the Row Generator.

• For example, if you are generating a dummy key, you would want to

make it an Integer type with an initial value of 0 and increment of 1.

• When running this in parallel, you can start with an initial value of ‘part’

and increment of ‘partcount’. ‘part’ is defined in the Framework as the

partition number and ‘partcount’ is the number of partitions.

124

Page 125: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Making Copies of Data

Copy Stage (in the Processing Palette):

� Incredibly flexible with little or no overhead at runtime

� Often used to create a duplicate of the incoming data:

� Can also be used to terminate a flow:

• Records get written out to /dev/null

• Useful when you don’t care about the target

or just want to test part of the flow.

125

Page 126: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Take a Look at the Data

Peek Stage (in the Development / Debug Pallette):

� Often used to help debug a job flow

� Can be inserted virtually anywhere in a job flow

• Must have an input data source

� Outputs fixed amount of records into the job log

• For example, in Lab 2B:

• Output volume can be

controlled

� Can also be used to terminate any job flow.

� Similar in behavior to inserting a ‘print’ statement into

your source code.

126

Page 127: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Importing Data from Outside DataStage

If you have access to real data, then you probably will

not have a lot of use for the Row Generator!

� DataStage can read in or import data from a large

variety of data sources:

• Flat Files

• Complex Files

• RDBMSs

• SAS datasets

• Queues

• Parallel Datasets & Filesets

• FTP

• Named Pipes

• Compressed Files

• Etc…

127

Page 128: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Importing Data

� There are 2 primary means of importing data from

external sources:

• Automatically – DataStage automatically reads the table

definition and applies it to the incoming data. Examples

include RDBMSs, SAS datasets, and parallel datasets.

• Manually – user must define the table definition that

corresponds to the data to be imported. These table

definitions can be entered manually or imported from an

existing copy book or schema file. Examples include flat

files and complex files.

128

Page 129: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Manual Data Import

When DataStage reads in data from an external source, there are 2

steps that will always take place:

� Recordization

• DataStage carves out the entire record based on the table definition

being used.

• Record delimiter is defined within table definition

� Columnization

• DataStage parses through the record it just carved out and

separates out the columns, again based on the table definition

provided.

• Column delimiters are also defined within the table definition

Can become very troublesome if you don’t know the correct layout

of your data!

129

Page 130: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Data Types

In order to properly setup a table definition, you must 1st

understand the internal data types used within DataStage:

� Integer: Signed or unsigned, 8-, 16-, 32- or 64-bit integer. In the

Designer you will see TinyInt, SmallInt, Integer, and BigInt instead.

� Floating Point: Single- (32 bits) or double-precision (64 bits); IEEE. In

the Designer you will see Float, and Double instead.

� String: Character string of fixed or a variable length. In the Designer

you will see Char and VarChar instead.

� Decimal: Numeric representation compatible with the IBM packed

decimal format. Decimal numbers consist of a precision (number of

decimal digits) greater than 1 with no maximum, and scale (fixed

position of decimal point) between 0 and the precision.

130

Page 131: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Data Types (continued)

� Date: Numeric representation compatible with RDBMS notion

of date (year, month and day). The default format is

month/date/year. This is represented by the default format

string of: %mm/%dd/%yyyy

� Time: Time of day with either one second or one microsecond

resolution. Time values range from 00:00:00 to

23:59:59.999999.

� Timestamp: Single field containing both a date and time.

� Raw: Untyped collection of contiguous bytes of a fixed or a

variable length. Optionally aligned. In the Designer you will

see Binary.

131

Page 132: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Data Types (continued)

� Subrecords (subrec): Nested form of field definition that

consists of multiple nested fields, Similar to COBOL record

levels or C structs. A subrecord itself does not define any

storage; instead, the fields of the subrecord define storage. The

fields in a subrecord can be of any data type, including tagged.

In addition, you can also nest subrecords and vectors of

subrecords, to any depth of nesting.

� Tagged Subrecord (tagged): Any one of a mutually

exclusive list of possible data types, including subrecord and

tagged fields. Similar to COBOL redefines or C unions, but more

type-safe. Defining a record with a tagged type allows each

record of a data set to have a different data type for the tagged

column.

132

Page 133: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Null Handling

All DataStage data types are nullable

� Tags and subrecs are not nullable, but their fields are

� Null fields do not have a value

� DataStage null is represented by an out-of-band indicator

� Nulls can be detected by a stage

� Nulls can be converted to/from a value

� Null fields can be ignored by a stage, can trigger error, or other

action

• Exporting a nullable field to a flat file without 1st defining how to

handle the null will cause an error.

133

Page 134: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Data Import Example

Suppose you have the following data:

Last, First, Purchase_DT, Item, Amount, Total

Smith,John,2004-02-27,widget #2,21,185.20

Doe,Jane,2005-07-03,widget #1,7,92.87

Adams,Sam,2006-01-15,widget #9,43,492.93

� What would your table definition look like for this

data?

• You need column names, which are provided for you

• You need data types for each column

• You need to specify ‘,’ as the column delimiter

• You need to specify newline as the record delimiter

134

Page 135: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Data Import Example (continued)

It is critical that you fill out the Formatoptions correctly, otherwise, DataStage will not be able to perform the necessaryrecordization and columnization!

Data types must also match the data itself, otherwise it will cause the columnization step to fail.

Sequential File Stage

135

Page 136: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Data Import Example (continued)

� Once all of the information is properly filled out, you

can press the ‘View Data’ button to see a sample of

your data and at the same time, validate that your

table definition is correct.

� If your table definition is not correct, then the View

Data operation will fail.

136

Page 137: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Data Import Example (continued)

� The table definition we used above worked for the data

we were given. Was this the only table definition that

would have worked? No, but this was the best one…

• VarChar is perhaps the most flexible data type, so we could have

defined all columns as VarChars.

• All numeric and date/time types can be imported as Char or

VarChar as well, but the reverse is rarely true.

• Decimal types can typically be imported as Float or Double and

vice versa, but be careful with precision – you may lose data!

• Integer types can also be imported as Decimal, Float, or Double.137

Page 138: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Data Import – Reject Handling

Data is not always clean. Whenunexpected or invalid values come up, you can:• Continue – default option. It will

discard any records where a field does not import correctly

• Fail – abort the job as soon asan invalid field value isencountered

• Output - send reject records down a reject link to a Dataset.Can also be passed onto otherstages for further processing.

138

Page 139: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Exporting Data to Disk

Once the data has been read into DataStage and

processed, it is then typically written out somewhere.

These targets can be the same as the sources which

originally produced the data or a completely different

target.

� Exporting data to a flat file is easier than importing it from a flat file,

simply because DataStage will use the table definition that has been

propagated downstream to define the data layout within the output target

file.

� You can easily edit the formatting properties within the Sequential File

stage for items such as null handling, delimiters, quotes, etc…

� Consider using a parallel dataset instead of flat file to stage data on disk!

Much faster and easier if there is another DSEE application which will

consume the data downstream.

139

Page 140: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Data Export to Flat File Example

Here’s an example of what it takes to setup the Sequential File stage to export data to a flat file.

140

Page 141: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Data Export to Parallel Dataset Example

With DataStage Parallel Datasets, regardless of it being a source or target, all you need to specify is its name and location! No need to worry about data types, handling nulls, or delimiters.

141

Page 142: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Automatic Data Import

Besides flat files and other manual sources, DataStage can also

import data from a Parallel Dataset or RDBMS… without the need to

first define a table definition!

� Parallel Datasets are self-describing datasets native to DataStage

• Easiest way to read and write data

� RDBMSs often store table definitions internally

• For example, the DESCRIBE or DESCRIBE TABLE command often

returns the table definition associated with the given table

� DataStage has the ability to:

• Automatically extract the table definition during design time.

• Automatically extract the table definition to match the data at runtime,

and propagate that table definition downstream using RCP

142

Page 143: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Parallel Datasets vs Parallel Filesets

Parallel Dataset vs Parallel Fileset

� Primary difference is format

• Parallel datasets are stored in a native DataStage format

• Parallel filesets are stored as ASCII

� Parallel filesets use a .fs extension vs .ds for parallel

datasets

• The .fs file is also a descriptor file, however, it’s ascii and

only contains the location of each fragment and the layout.

� Parallel datasets are faster than parallel filesets

• Parallel datasets avoid the recordization and columnization

process because data is already stored in a native format.

143

Page 144: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Parallel Datasets vs RDBMS

Parallel Dataset vs RDBMS

� Logically and functionally very similar

• Parallel datasets have data that is partitioned and stored

across several disks

• Table definition (aka schema) is stored and associated with

the table

� Parallel datasets can sometimes be faster than

loading/extracting a RDBMS. Some conditions that

can make this happen:

• Non-partitioned RDBMS tables

• Remote location of RDBMS

• Sequential RDBMS access mechanism

144

Page 145: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Importing RDBMS Table Definitions

Select from DB2, Oracle, or Informix

There are a couple of options you can choose for importing a RDBMS table definition for use during design time. Import Orchestrate Schema is one option.

• Once you enter all the necessary parameters, you can click on the Next button to import the table definition.

• Once imported, the table definition can be used at design time

145

Page 146: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Importing RDBMS Table Definitions

Other options for importing a RDBMS table definition include usingODBC or Plug-In Metadata access.

ODBC option requires that the correct ODBC driver be setup…

The Plug-In Metadata optionrequires that it be setup duringinstall.

Once setup, each option guides you through a simple process to import the table definition and save it for future re-use.

146

Page 147: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Using Saved Table Definitions

Table definition iconshows up on link

There are 2 ways to reference a saved table definition in a job. The first is to select it from the repository tree view on the left side, and then drag and drop it onto the link.

The presence of the icon on the link signifies that a table definition is present, or that metadata is present on the link.

Why do this when DataStage can do this automatically at runtime? Sometimes it is easier or more straight forward to have the metadata available at design time.

147

Page 148: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Using Saved Table Definitions

Another way to access saved table definitions is to use the Load button on the Output tab of any given stage. Note that you can also do this on the Input tab, but that is the same as loading it on the Output tab of the upstream (preceding) stage.

148

Page 149: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Loading Table Definitions

When loading a previously saved table definition, the column selection dialogue will appear. This allows you to optionally eliminate certain columnswhich you do not wantto carry over.

This is useful when youare only reading in somecolumns or your selectclause only has somecolumns.

149

Page 150: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

RDBMS Connectivity

DataStage offers an array of options for RDBMS connectivity, ranging from ODBC to highly-scalable native interfaces. For handling large data volumes, DataStage’s highly-scalable native database interfaces are the best way to go. While the icons may appear similar, always look for the ‘_enterprise’ label.

� DB2 – parallel extract, load, upsert, and lookup.

� Oracle – parallel extract, load, upsert, and lookup.

� Teradata – parallel extract and load

� Sybase – sequential extract, parallel load, upsert, and lookup

� Informix – parallel extract and load

150

Page 151: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Parallel RDBMS Interface

Query orApplication

Usually a query is submitted toa database sequentially, and the database then distributesthe query to execute it inparallel. The output, however,is returned sequentially. Similarly, when loading data,data is loaded sequentially 1st,before being distributed by thedatabase.

DataStage will avoid thisbottleneck by establishingparallel connections into thedatabase and execute queries,extract data, and load data inparallel. The degree of parallelismchanges depending on thedatabase configuration (i.e. numberof partitions that are set up).

Parallel DataStage

151

Page 152: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage and RDBMS Scalability

Query orApplication

While the database itself maybe highly scalable, the overallsolution which includes theapplication accessing thedatabase is not. Any sequentialbottlenecks in an end to endsolution will limit its ability toscale!

DataStage s native parallelconnectivity into the database isthe key enabler for a truly scalableend to end solution.

Parallel DataStage

152

Page 153: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Extracting from the RDBMS

Extracting data from DB2, Oracle, Teradata, and Sybase is pretty straightforward. The stage properties dialogue is very much the same for each database, despite its different behavior under the covers.

For all database stages, to extract datayou will need to provide the following:� Read Method – full table scan or user

defined query

• Table – table name if using Table asthe Read Method

� User – user id (optional with DB2)� Password – password (optional with DB2)� Server/Database – used for some

databases for establishing connectivity� Options – database specific options

153

Page 154: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Loading to the RDBMS

Loading data into DB2, Oracle, Teradata, and Sybase is also pretty straight forward. The stage properties dialogue is very much the same for each database, despite its different behavior under the covers.

For all database stages, to load datayou will need to provide the following:� Table – name of table to be loaded

� Write Method – write, load, upsert.

Details will be discussed shortly.

� Write Mode – Append, Create, Replace,

and Truncate

� User – user id (optional with DB2)� Password – password (optional with DB2)� Server/Database – used for some

databases for establishing connectivity� Options – database specific options

154

Page 155: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Write Methods Explained

� Write/Load – often the default option. Used to

append data into an existing table, create and load

data into a target table, or drop an existing table, re-

create it, and load data into it. The mechanics of the

load itself depend on the database.

� Upsert – update or insert data into the database.

There is also an option to delete data from the target

table.

� Lookup – perform a lookup against a table inside the

database. This is useful when the lookup table is

much larger than the input data.

155

Page 156: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Write Modes Explained

� Append – default option. Append data into an

existing table.

� Create – creates a table using the table definition

provided by the stage. If table already exists, then

job will fail. Insert table into the created table.

� Replace – if target table exists, drop the table first. If

table does not exist, create it. Insert data into the

created table.

� Truncate – delete all records from the target table,

but do not drop the table. Insert data into the empty

table.

156

Page 157: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

ConnectivityConnectivity

DataStage Oracle Enterprise Stage

157

Page 158: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Configuration for Oracle

To establish connectivity to Oracle, certain environment variables and stage options need to be defined:

� Environment Variables (defined via DataStage Administrator)

• ORACLE_SID – name of the ORACLE database to access

• ORACLE_HOME – location for ORACLE home

• PATH – append $ORACLE_HOME/bin

• LIBPATH or LD_LIBRARY_PATH – append $ORACLE_HOME/lib32 or $ORACLE_HOME/lib64, depending on the operating system. Path must be spelled out.

� Stage Options

• User – Oracle user-id

• Password – Oracle user password

• DB Options – can also accept SQL*Loader parameters such as:

o DIRECT = TRUE, PARALLEL = TRUE, …

158

Page 159: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Specifics for Extracting Oracle

Extracts from Oracle:

� Default option (depending on the version used) is to use the

SQL Builder interface, which allows you to use a graphical

interface to create a custom query.

• Note: the query generated will run sequentially by default.

� User-Defined Query option allows you to enter your own query

or copy and paste an existing query.

• Note: the custom query will run sequentially by default.

� Running SQL queries in parallel requires the use of the following

option:

• Partition Table option – enter name of the table containing the

partitioning strategy you are looking to match

159

Page 160: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Oracle Parallel Extract

Both set of options above will yield identical results.

� Leaving out the Partition Table option would cause the extract to

execute sequentially.

160

Page 161: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Specifics for Loading Oracle

There are 2 ways to put data into Oracle:

� Load (default option) – leverage the Oracle SQL*Loader

technology to load data into Oracle in parallel.

• Load uses the Direct Path load method by default

• Fastest way to load data into Oracle

• Select Append, Create, Replace, or Truncate mode

� Upsert – update or insert data in an Oracle table

• Runs in parallel

• Uses standard SQL Insert and Update statements

• Use auto-generated or user-defined SQL

Can also use DELETE option to remove data from target Oracle

table

161

Page 162: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Oracle Index Maintenance

Loading to range/hash partitioned table in parallel is supported,

however, if the table is indexed:

� Rebuild – can be used to rebuild global indexes. Can specify

NOLOGGING (speeds up rebuild by eliminating the log during index

rebuild) and COMPUTESTATISTICS to provide stats on the index.

� Maintenance is supported for local indexes partitioned the same way

the table is partitioned

� Don’t use both the rebuild and maintenance options in same stage –

either the global or local index must be dropped prior to the load.

� Using DB Options

• DIRECT=TRUE,PARALLEL=TRUE, SKIP_INDEX_MAINTENANCE=YES – to

allow the Oracle stage to run in parallel using direct path mode but indexes on

the table will be “unusable” after the load.

162

Page 163: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Relevant Stages

� Column Import – Only import a subset of the columns in a record,

leaving the rest as ‘raw’ or string. This is useful when you have a very

wide record and only plan on referencing a few columns.

� Column Export – Combine 2 or more columns into a single

column.

� Combine Records – Combines records in which particular key-

column values are identical into vectors of subrecords. As input, the

stage takes a data set in which one or more columns are chosen as

keys. All adjacent records whose key columns contain the same

value are gathered into the same record as subrecords.

� Make Subrecord – Combines specified vectors in an input data

set into a vector of subrecords whose columns have the names and

data types of the original vectors. Specify the vector columns to be

made into a vector of subrecords and the name of the new subrecord.

163

Page 164: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Relevant Stages (Continued)

� Split Subrecord – Inverse of Make Subrecord. Creates one new vector

column for each element of the original subrecord. Each top-level vector

column that is created has the same number of elements as the subrecord

from which it was created. The stage outputs columns of the same name and

data type as those of the columns that comprise the subrecord.

� Make Vector – Combines specified columns of an input data record into a

vector of columns. The stage has the following requirements:

• The input columns must form a numeric sequence, and must all be of the same type.

• The numbers must increase by one.

• The columns must be named column_name0 to column_namen, where column_name

starts the name of a column and 0 and n are the first and last of its consecutive

numbers.

• The columns do not have to be in consecutive order.

All these columns are combined into a vector of the same length as the number

of columns (n+1). The vector is called column_name. Any input columns that

do not have a name of that form will not be included in the vector but will be

output as top level columns.

164

Page 165: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Relevant Stages (Continued)

� Split Vector – Promotes the elements of a fixed-length vector to a

set of similarly named top-level columns. The stage creates columns

of the format name0 to nameN, where name is the original vector’s

name and 0 and N are the first and last elements of the vector.

� Promote Subrecord – Promotes the columns of an input

subrecord to top-level columns. The number of output records equals

the number of subrecord elements. The data types of the input

subrecord columns determine those of the corresponding top-level

columns.

� DRS – Dynamic Relational Stage. DRS reads data from any

DataStage stage and writes it to one of the supported relational

databases. It also reads data from any of the supported relational

databases and writes it to any DataStage stage. It supports the

following relational databases: DB2/UDB, Informix, Microsoft SQL

Server, Oracle, and Sybase. It also supports a generic ODBC.165

Page 166: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Relevant Stages (Continued)

� ODBC – Access or write data to remote sources via an ODBC interface.

� Stored Procedure – allows a stored procedure to be used as:

• A source, returning a rowset

• A target, passing a row to a stored procedure to write

• A transform, invoking logic within the database

The Stored Procedure stage supports input and output parameters or

arguments. It can process the returned value after the stored procedure is run.

Also provides status codes indicating whether the stored procedure completed

successfully and, if not, allowing for error handling. Currently supports DB2,

Oracle, and Sybase.

� Complex Flat File – As a source stage it imports data from one or more

complex flat files, including MVS datasets with QSAM and VSAM files. A

complex flat file may contain one or more GROUPs, REDEFINES, OCCURS or

OCCURS DEPENDING ON clauses. When used as a target, the stage exports

data to one or more complex flat files. It does not write to MVS datasets.

166

Page 167: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3A: Flat File Import

167

Page 168: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3A Objectives

� Learn to create a table definition to match the

contents of the flat file

� Read in the flat file using the Sequential File stage

and the table definition just created.

168

Page 169: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

The Data Files

� There are 4 data files you will be importing. You will be using

these files for future labs. The files contains Major League

Baseball data.

• Batting.csv – player hitting statistics

• Pitching.csv – pitcher statistics

• Salaries.csv – player salaries

• Master.csv – player details

� The files all have the following format

• 1st row in each file contains the column names

• Data is in ASCII format

• Records are newline delimited

• Columns are comma separated

169

Page 170: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Batters File

� The layout of the Batting.csv file is:

� Open the file using vi or any other text editor to view contents

– note the contents and data types

� Create a table definition for this data, save it as batting.

Column Name Description

playerID Player ID codeyearID Year

teamID Team

lgID LeagueG Games

AB At BatsR Runs

H HitsDB Doubles

TP Triples

HR HomerunsRBI Runs Batted In

SB Stolen BasesIBB Intentional walks

Tips:

1. Use a data type that most closely

matches the data. For example, for the

Games column, use Integer instead of

Char or VarChar!

2. When using a VarChar type, always fill

in a ‘maximum’ length by filling in a

number in the length column

3. When defining numerical types such as

Integer or Float, there’s no need to fill in

length or scale values. You only do this

for Decimal types.

170

Page 171: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Pitchers File

� The layout of the Pitching.csv file is:

� Open the file using vi or any other text editor to view contents

– note the contents and data types

� Create a table definition for this data, save it as pitching.

Column Name Description

playerID Player ID code

yearID Year

teamID TeamlgID League

W Wins

L Losses

SHO ShutoutsSV Saves

SO Strikeouts

ERA Earned Run Average

Tips:

1. Be careful to choose the right data type

for the ERA column. Your choices

should boil down to Float vs Decimal…

171

Page 172: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Salary File

� The layout of the Salaries.csv file is:

� Open the file using vi or any other text editor to view contents

– note the contents and data types

� Create a table definition for this data, save it as salaries.

Column Name Description

yearID YearteamID Team

lgID League

playerID Player ID codesalary Salary

Tips:

1. Salary value is in whole dollars. Again

be sure to select the best data type.

While it may be tempting to use Decimal,

the Framework is more efficient at

processing Integer and Float types. Those

are considered ‘native’ to the Framework.

172

Page 173: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Master File

� The layout of the Master.csv file is:

� Open the file using vi or any other text editor to view contents

– note the contents and data types

� Create a table definition for this data, save it as master.

Tips:

1. Treat birthYear, birthMonth, &

birthDay as Integer types for now.

2. Be sure to specify the correct Date

format string: %mm/%dd/%yyyy

Column Name Description

playerID A unique code asssigned to each player. birthYear Year player was born

birthMonth Month player was born

birthDay Day player was bornnameFirst Player's first name

nameLast Player's last namedebut Date player made first major league appearance

finalGame Date player made last major league appearance

173

Page 174: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Testing the Table Definitions

� Create the following flow by linking a Sequential File

stage to a Peek stage:

� Next, find the batting table

definition you created,

click and drag the table

onto the link

� On the link:

• Look for the icon that signifies the presence of a table

definition

174

Page 175: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Testing the Table Definition

� In the Sequential File stage properties:

• Fill in the File option with the correct path and filename. For

example: C:\student01\training\data\Batting.csv

• Click on the Format tab and review the settings. Are these

consistent with what you see in the Batting.csv data file?

• In the Columns tab, you will note that the table definition you

previously selected and dragged onto the link is now

present. Alternatively, you could have used the Load button

to bring it in or typed it it all over again!

• Next, click on the View Data button to see if you got

everything correct!Click OK to view data!

175

Page 176: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Viewing Data

� If everything went well, you should see the View Data

window pop up:

� If you get an error instead, take a look at the error

message to determine the location and nature of the

error. Make the necessary corrections and try again.

176

Page 177: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Testing lab3a

� Save the job as ‘lab3a_batting’

� Compile the job and then click on the run button.

� Go into the Director and take a look at the job log.

• Look out for Warnings and Errors !!!

• Errors are “fatal” and must be resolved.

• Warnings can be an issue. In this case, it could be warning

you that certain records failed to import. This is a bad thing!

� Typical mistakes include formatting and data type

mismatches

• Verify that the column delimiter is correct. Everything should

be comma separated

• Are you using the correct data types?

177

Page 178: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

lab3a_batting Results

� For your ‘lab3a_batting’ job:

• You should see Import complete. 25076 records imported

successfully, 0 rejected.

• There should be no rejected records!

• Find the Peek output line in the Director’s Log. Double-click on it.

It should look like the following:

178

Page 179: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Importing Rest of the Files

� Repeat the process for the Pitching,

Salaries, and Master files.

• Save the jobs as lab3a_pitching,

lab3a_salaries, and lab3a_master

accordingly

� When finished, your job should

resemble one of the diagrams on

the right.

• Be sure to rename the stages accordingly.

� Make sure that View Data works for

each and every input file.

179

Page 180: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Validating Results

� For your ‘lab3a_pitching’ job:

• You should see Import complete. 11917 records

imported successfully, 0 rejected.

• There should be no rejected records!

� For your ‘lab3a_salaries’ job:

• You should see Import complete. 17277 records

imported successfully, 0 rejected.

• There should be no rejected records!

� For your ‘lab3a_master’ job:

• You should see Import complete. 3817 records

imported successfully, 0 rejected.

• There should be no rejected records!

180

Page 181: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3B: Exporting to a Flat File

181

Page 182: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3B Objective

� Write out the imported data files to ASCII flat files and

parallel datasets

� Use different formatting properties

182

Page 183: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Create Lab 3B Using Lab 3A

� Open the jobs you created in Lab 3A – lab3a_batting,

lab3a_pitching, lab3a_salaries, and lab3a_master

� Save each job again using Save As – use the names

lab3b_batting, lab3b_pitching, lab3b_salaries, and

lab3b_master accordingly.

183

Page 184: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Edit lab3a_batting_out

� Go to lab3b_batting and edit the job to look like the

following:

� To do so, perform the following steps:

• Click on the Peek stage and delete it

• Attach the Copy stage in its place

• Place a Sequential File stage and a Dataset stage after the

copy

• Draw a link between the copy and the 2 output stages

• Update the link and stage names accordingly

184

Page 185: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Edit lab3b_batting

� In the Copy stage’s Output Mapping tab, map the

source columns to the target columns for both output

links:

185

Page 186: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Source to Target Mapping

right-click

� Once the mapping is complete, you should see the

table definition icon present on the links.

� An easier way to do this would be to right-click on the

Copy stage and use the Auto-map columns feature.

186

Page 187: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Output Stage Properties

� In the Parallel Dataset stage options, fill in the

appropriate path and filename for where the dataset

descriptor file should reside.

• For example:

• Use Batting.ds as the filename

� For the Sequential File stage, fill in the appropriate

path and filename for where the data file should

reside.

• For example:

• Use Batting.txt as the filename187

Page 188: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Sequential File Formatting

� In the Sequential File stage properties Format tab,

change the Delimiter option to “|” (pipe character):

� Save and compile lab3b_batting

� Run the job and view the results in the Director

188

Page 189: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Viewing the Results

� In the Director’s Log, there should be no warnings or

errors.

� Look for the following output in the Log:

� View the Batting.txt from the Unix prompt and verify

that the columns are now pipe “|” delimitedaasedo01|1985|BAL|AL|54|0|0|0|0|0|0|0|0|0

ackerji01|1985|TOR|AL|61|0|0|0|0|0|0|0|0|0

agostju01|1985|CHA|AL|54|0|0|0|0|0|0|0|0|0

aikenwi01|1985|TOR|AL|12|20|2|4|1|0|1|5|0|0

alexado01|1985|TOR|AL|36|0|0|0|0|0|0|0|0|0

allenga01|1985|TOR|AL|14|34|2|4|1|0|0|3|0|0

allenne01|1985|NYA|AL|17|0|0|0|0|0|0|0|0|0

armasto01|1985|BOS|AL|103|385|50|102|17|5|23|64|0|4

armstmi01|1985|NYA|AL|9|0|0|0|0|0|0|0|0|0

atherke01|1985|OAK|AL|56|0|0|0|0|0|0|0|0|0

189

Page 190: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Understanding the Output

� How does DataStage know which columns to write

out and in what order?

� If you look in the Columns tab under the output

Sequential File stage properties, you will see that the

table definition from the source has been propagated

to the target:

• DataStage uses

this along with

the formatting

to create the

output file

190

Page 191: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Exporting Rest of the Files

� Repeat the process for lab3b_pitching,

lab3b_salaries, and lab3b_master accordingly

� The number of records read in each time should

match the number of records written out

� Make sure there are no warnings or errors in the

Director’s Log for each job.

191

Page 192: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab3A and Lab3B Review

Congratulations! You have successfully:

� Created table definitions to describe existing data

layouts

� Imported data from flat files using the newly created

table definitions and the Sequential File stage

� Exported data to flat files using the Sequential File

stage

� Written data to Parallel Datasets using the Parallel

Dataset stage.

192

Page 193: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

����PLACEHOLDER SLIDE

Insert appropriate set of database

connectivity slides here depending on

customer environment:

� DB2

� Oracle

� Teradata

� Sybase

193

Page 194: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

ConnectivityConnectivity

DataStage DB2 Enterprise Stage

194

Page 195: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3C: Inserting Into RDBMS

195

Page 196: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3C Objective

� Insert the data stored within the Datasets created in

Lab 3B into the database

196

Page 197: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating Jobs for Lab 3C

� In this lab you will use the data you wrote out to the

Datasets in Lab 3B as the source data to be loaded

into the target database table.

� The instructor should have pre-configured the

necessary settings for database connectivity.

• Confirm this with the instructor

• If database is not pre-configured, then obtain the necessary

connectivity details such as database server, database

name, location, etc…

197

Page 198: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Setting Up Parameters

� To make things easier, we will use job parameters.

� Go to the Administrator and open your project properties.

� Access the Environment Variable settings and create the

following 3 parameters:

Use your own directory

path, userid, and password

(NOTE: userid and

password may not be

necessary, depending on

DB2 setup)

198

Page 199: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3C – DB2

� Create a new job and pull together the following

stages (Dataset, Peek, and DB2 Enterprise):

� Rename the links and stages accordingly

� In the Dataset stage:

• Use the FILEPATH

parameter along with the Dataset filename created

earlier in Lab 3B

• Load the ‘Batting’ table definition in the Columns tab. While

this step is optional, it does provide design time metadata.

199

Page 200: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3C – DB2 (continued)

� In the DB2 Enterprise stage:

• For the Table option, precede the table name with your

initials. If using a shared database environment, this will

prevent conflicts.

200

Page 201: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3C – DB2 (continued)

� Make sure to map source columns to target columns,

going from Dataset to Peek to DB2.

� Save the job as lab3c_batting

� Compile and run the job

� Verify that there are no errors in the Director’s Log.

� Use the Job Performance Statistics to verify that

25076 records were loaded:

• Log into the database and issue a

select count(*) from tablename

query to double-check the record count

201

Page 202: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3C – DB2 (continued)

� Repeat the process for Pitching, Salaries, and Master

in jobs lab3c_pitching, lab3c_salaries, and

lab3c_master respectively

� The number of records read in each time should

match the number of records written out

� Make sure there are no errors in the Director’s Log

for each job.

• There may be some warnings concerning data conversion.

These are not critical.

202

Page 203: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3D: Extracting From RDBMS

203

Page 204: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3D Objective

� Extract Batting, Pitching, Salaries, and Master tables

from the Database

204

Page 205: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating Jobs for Lab 3D

� In this lab you will extract the data you loaded into the

database in Lab 3C

� You will leverage the same setup for database

connectivity as Lab 3C

• Confirm this with the instructor

• If database is not pre-configured, then obtain the necessary

connectivity details such as database server, database

name, location, etc…

� Use the same USERID and PASSWORD job

parameters (if needed) from Lab 3C

205

Page 206: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3D – DB2

� Create a new job and pull together the following

stages (DB2 Enterprise and Peek):

� For the DB2 Enterprise stage:

• Use the Table Read Method instead of SQL-Builder

206

Page 207: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3D – DB2 (continued)

� Optionally, you can load the Batting table definition in

the Columns tab for the DB2 Enterprise stage

properties.

• DataStage automatically extracts the table definition from the

database and uses RCP to propagate it downstream

• You can try the job with and without the table definition

loaded in the Columns tab

• If you load the Batting table definition, you may receive some

type conversion warnings at runtime in the Director’s Log

which can be ignored for this job.

207

Page 208: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3D – DB2 (continued)

� Save the job as lab3d_batting

� Compile and run the job

� Verify that there are no errors in the Director’s Log.

� Use the Job Performance Statistics to verify that

25076 records were extracted.

208

Page 209: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3D – DB2 (continued)

� Repeat the process for Pitching, Salaries, and Master

in jobs lab3d_pitching, lab3d_salaries, and

lab3d_master respectively

� The number of records read in each time should

match the number of records written out

� Make sure there are no errors in the Director’s Log

for each job.

• There may be some warnings concerning data conversion.

These are not critical.

209

Page 210: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab3C and Lab3D Review

Congratulations! You have successfully:

� Loaded Batting, Pitching, Salaries, and Master data

into the respective database tables

� Extracted Batting, Pitching, Salaries, and Master

tables from the database

210

Page 211: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

ConnectivityConnectivity

DataStage Oracle Enterprise Stage

211

Page 212: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3C: Inserting Into RDBMS

212

Page 213: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3C Objective

� Insert the data stored within the Datasets created in

Lab 3B into the database

213

Page 214: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating Jobs for Lab 3C

� In this lab you will use the data you wrote out to the

Datasets in Lab 3B as the source data to be loaded

into the target database table.

� The instructor should have pre-configured the

necessary settings for database connectivity.

• Confirm this with the instructor

• If database is not pre-configured, then obtain the necessary

connectivity details such as database server, database

name, location, etc…

214

Page 215: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Setting Up Parameters

� To make things easier, we will use job parameters.

� Go to the Administrator and open your project properties.

� Access the Environment Variable settings and create the

following 3 parameters:

Use your own directory

path, userid, and password

215

Page 216: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3C – Oracle

� Create a new job and pull together the following

stages (Dataset, Peek, and Oracle Enterprise):

� Rename the links and stages accordingly

� In the Dataset stage:

• Use the FILEPATH

parameter along with the Dataset filename created

earlier in Lab 3B

• Load the ‘Batting’ table definition in the Columns tab. While

this step is optional, it does provide design time metadata.

216

Page 217: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3C – Oracle (continued)

� In the Oracle Enterprise stage:

• Configure settings as shown below, using the USERID and

PASSWORD job parameters as shown below

• For the Table option, precede the table name with your

initials. If using a shared database environment, this will

prevent conflicts.

217

Page 218: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3C – Oracle (continued)

� Make sure to map source columns to target columns,

going from Dataset to Peek to Oracle.

� Save the job as lab3c_batting

� Compile and run the job

� Verify that there are no errors in the Director’s Log.

� Use the Job Performance Statistics to verify that

25076 records were loaded:

• Log into the database and

issue a select count(*) from tablename query to

double-check the record count

218

Page 219: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3C – Oracle (continued)

� Repeat the process for Pitching, Salaries, and Master

in jobs lab3c_pitching, lab3c_salaries, and

lab3c_master respectively

� The number of records read in each time should

match the number of records written out

� Make sure there are no errors in the Director’s Log

for each job.

• There may be some warnings concerning data conversion.

These are not critical.

219

Page 220: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3D: Extracting From RDBMS

220

Page 221: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3D Objective

� Extract Batting, Pitching, Salaries, and Master tables

from the Database

221

Page 222: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating Jobs for Lab 3D

� In this lab you will extract the data you loaded into the

database in Lab 3C

� You will leverage the same setup for database

connectivity as Lab 3C

• Confirm this with the instructor

• If database is not pre-configured, then obtain the necessary

connectivity details such as database server, database

name, location, etc…

� Use the same USERID and PASSWORD job

parameters from Lab 3C

222

Page 223: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3D – Oracle

� Create a new job and pull together the following

stages (Oracle Enterprise and Peek):

� For the Oracle Enterprise stage:

• Configure settings as shown below, using the USERID and

PASSWORD job parameters as shown below

• Use the

Table Read

Method

instead of

SQL-Builder

223

Page 224: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3D – Oracle (continued)

� Optionally, you can load the Batting table definition in

the Columns tab for the Oracle Enterprise stage

properties.

• DataStage automatically extracts the table definition from the

database and uses RCP to propagate it downstream

• You can try the job with and without the table definition

loaded in the Columns tab

• If you load the Batting table definition, you may receive some

type conversion warnings at runtime in the Director’s Log

which can be ignored for this job.

224

Page 225: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3D – Oracle (continued)

� Save the job as lab3d_batting

� Compile and run the job

� Verify that there are no errors in the Director’s Log.

� Use the Job Performance Statistics to verify that

25076 records were extracted:

225

Page 226: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3D – Oracle (continued)

� Repeat the process for Pitching, Salaries, and Master

in jobs lab3d_pitching, lab3d_salaries, and

lab3d_master respectively

� The number of records read in each time should

match the number of records written out

� Make sure there are no errors in the Director’s Log

for each job.

• There may be some warnings concerning data conversion.

These are not critical.

226

Page 227: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab3C and Lab3D Review

Congratulations! You have successfully:

� Loaded Batting, Pitching, Salaries, and Master data

into the respective database tables

� Extracted Batting, Pitching, Salaries, and Master

tables from the database

227

Page 228: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

ConnectivityConnectivity

DataStage Teradata Enterprise Stage

228

Page 229: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3C: Inserting Into RDBMS

229

Page 230: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3C Objective

� Insert the data stored within the Datasets created in

Lab 3B into the database

230

Page 231: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating Jobs for Lab 3C

� In this lab you will use the data you wrote out to the

Datasets in Lab 3B as the source data to be loaded

into the target database table.

� The instructor should have pre-configured the

necessary settings for database connectivity.

• Confirm this with the instructor

• If database is not pre-configured, then obtain the necessary

connectivity details such as database server, database

name, location, etc…

231

Page 232: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Setting Up Parameters

� To make things easier, we will use job parameters.

� Go to the Administrator and open your project properties.

� Access the Environment Variable settings and create the

following 3 parameters:

Use your own directory

path, userid, and password

232

Page 233: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3C – Teradata

� Create a new job and pull together the following

stages (Dataset, Peek, and Teradata Enterprise):

� Rename the links and stages accordingly

� In the Dataset stage:

• Use the FILEPATH

parameter along with the Dataset filename created

earlier in Lab 3B

• Load the ‘Batting’ table definition in the Columns tab. While

this step is optional, it does provide design time metadata.

233

Page 234: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3C – Teradata (continued)

� In the Teradata Enterprise stage:

• For the Table option, precede the table name with your

initials. If using a shared database environment, this will

prevent conflicts.

NOTE: You may also need to

specify a Database option and

provide the name of the

database to connect to.

234

Page 235: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3C – Teradata (continued)

� Make sure to map source columns to target columns,

going from Dataset to Peek to Teradata.

� Save the job as lab3c_batting

� Compile and run the job

� Verify that there are no errors in the Director’s Log.

� Use the Job Performance Statistics to verify that

25076 records were loaded:

• Log into the database and issue a

select count(*) from tablename

query to double-check the record count

235

Page 236: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3C – Teradata (continued)

� Repeat the process for Pitching, Salaries, and Master

in jobs lab3c_pitching, lab3c_salaries, and

lab3c_master respectively

� The number of records read in each time should

match the number of records written out

� Make sure there are no errors in the Director’s Log

for each job.

• There may be some warnings concerning data conversion.

These are not critical.

236

Page 237: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3D: Extracting From RDBMS

237

Page 238: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3D Objective

� Extract Batting, Pitching, Salaries, and Master tables

from the Database

238

Page 239: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating Jobs for Lab 3D

� In this lab you will extract the data you loaded into the

database in Lab 3C

� You will leverage the same setup for database

connectivity as Lab 3C

• Confirm this with the instructor

• If database is not pre-configured, then obtain the necessary

connectivity details such as database server, database

name, location, etc…

� Use the same USERID and PASSWORD job

parameters from Lab 3C

239

Page 240: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3D – Teradata

� Create a new job and pull together the following

stages (Teradata Enterprise and Peek):

� For the Teradata Enterprise stage:

• Configure settings as shown below, using the USERID and

PASSWORD job parameters as shown below

• Use the

Table Read

Method

• May need

to specify

the Database

option240

Page 241: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3D – Teradata (continued)

� Optionally, you can load the Batting table definition in

the Columns tab for the Teradata Enterprise stage

properties.

• DataStage automatically extracts the table definition from the

database and uses RCP to propagate it downstream

• You can try the job with and without the table definition

loaded in the Columns tab

• If you load the Batting table definition, you may receive some

type conversion warnings at runtime in the Director’s Log

which can be ignored for this job.

241

Page 242: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3D – Teradata (continued)

� Save the job as lab3d_batting

� Compile and run the job

� Verify that there are no errors in the Director’s Log.

� Use the Job Performance Statistics to verify that

25076 records were extracted.

242

Page 243: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3D – Teradata (continued)

� Repeat the process for Pitching, Salaries, and Master

in jobs lab3d_pitching, lab3d_salaries, and

lab3d_master respectively

� The number of records read in each time should

match the number of records written out

� Make sure there are no errors in the Director’s Log

for each job.

• There may be some warnings concerning data conversion.

These are not critical.

243

Page 244: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab3C and Lab3D Review

Congratulations! You have successfully:

� Loaded Batting, Pitching, Salaries, and Master data

into the respective database tables

� Extracted Batting, Pitching, Salaries, and Master

tables from the database

244

Page 245: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3E: Importing a COBOL

Copybook

245

Page 246: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3E Objective

� Import a COBOL copybook and save it as a table

definition

� Compare the DataStage table definition to the

copybook

246

Page 247: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 3E – Copybook

� We will import the following COBOL copybook:

01 CLIENT-RECORD.

05 FIRST-NAME PIC X(16).

05 LAST-NAME PIC X(20).

05 GENDER PIC X(1).

05 BIRTH-DATE PIC X(10).

05 INCOME PIC 9999999V99 COMP-3.

05 STATE PIC X(2).

05 RECORD-ID PIC 999999999 COMP.

� The copybook is located in a file called customer.cfd

� You will need to have a copy of this file locally on

your computer in order to import it into DataStage

247

Page 248: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Importing the Copybook

� To import a COBOL copybook:

• Right-click on Table Definitions

• Select Import

• Select COBOL File Definitions…

Click on Importto translate the Copybook intoa DataStagetable definition.

248

Page 249: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Copybook Imported

� If no errors occur, then your copybook has

successfully imported and has been translated into

a DataStage table definition.

� Double-click on the

newly created table definition

to view it

249

Page 250: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Viewing the Translated Copybook

� You can click on the

Layout tab to view the

copybook in its original

format as well as its

newly translated

format!

� This is the DataStage

internal schema format

• Clicking on the

Columns tab will show

the table definition in

a grid format

250

Page 251: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Agenda

1. DataStage Overview

2. Parallel Framework Overview

3. Data Import and Export

4. Data Partitioning, Sorting, and Collection

5. Data Transformation and Manipulation

6. Data Combination

7. Custom Components: Wrappers

8. Custom Components: Buildops

9. Additional Topics

10. Glossary

251

Page 10

Page 73

Page 116

Page 252

Page 309

Page 364

Page 420

Page 450

Page 477

Page 526

Page 252: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Data Partitioning, Sorting and Collection

In this section we will discuss:

� Data Partitioning

� Sorting Data

� Duplicate Removal

� Data Collection

� Funnel Stage

252

Page 253: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Data Partitioning

� In Chapter 2 we very briefly touched upon the topic of

partitioning:

� We also discussed the fact that parallelism would not

be possible without partitioning.

• For example, how would the following be accomplished

without partitioning the data as it comes in from the

sequential input file?

DataFile

Part

itio

ner Records 1 - 1000

Records 1001 - 2000

Records 2001 - 3000

Records 3001 - 4000

A

Inputfile

B C RDBMS

253

Page 254: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

What Is Data Partitioning?

� Data Partitioning, simply put, is a means of distributing records

amongst partitions.

• A partition is like a division or logical grouping

• Several partitioning algorithms exist

� When you sit down at a card game, how are the cards dealt out?

• The dealer typically distributes the cards evenly to all players.

• Each player winds up with an equivalent amount of cards.

� When partitioning data, it is often desirable to achieve a balance

of records in each partition

• Too many records in any given partition is referred to as a data

skew.

• Data skews cause overall processing times to take longer to finish.

254

Page 255: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Data Partitioner Overview

� In DataStage, there are many options for partitioning

data:

• Auto (Default)

• Random

• Roundrobin

• Same

• Entire

• Modulus

• Range

• Hash

• DB2

255

Page 256: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Auto Partitioning

� By default, partitioning is always set to Auto

• Auto means the Framework will decide the most optimal

partitioning algorithm based on what the job is doing.

� Partitioning is accessed from the same location for

any given stage with an input link attached:

256

Page 257: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Random Partitioning

� The records are partitioned randomly, based on the output of a

random number generator. No further information is required.

� Suppose we have the following record:

� The randomly partitioned records may look like:

playerID - varchar yearID - integer teamID - char[3] ERA - float

behenri01 1985 CLE 7.78

blackbu02 1985 KCA 4.33

blylebe01 1985 MIN 3.00

bordiri01 1985 NYA 3.21

butchjo01 1985 MIN 4.98

candejo01 1985 CAL 3.80

clancji01 1985 TOR 3.78

clarkbr01 1985 CLE 6.32

clarkst02 1985 TOR 4.50

coopedo01 1985 NYA 5.40

aasedo01 1985 BAL 3.78

armstmi01 1985 NYA 3.07

beckwjo01 1985 KCA 4.07

boddimi01 1985 BAL 4.07

brownma02 1985 MIN 6.89

brownmi01 1985 BOS 21.6

camacer01 1985 CLE 8.10

caudibi01 1985 TOR 2.99

ackerji01 1985 TOR 3.23

alexado01 1985 TOR 3.45

atherke01 1985 OAK 4.30

barklje01 1985 CLE 5.27

birtsti01 1985 OAK 4.01

blylebe01 1985 CLE 3.26

agostju01 1985 CHA 3.58

allenne01 1985 NYA 2.76

bairdo01 1985 DET 6.24

bannifl01 1985 CHA 4.87

barojsa01 1985 SEA 5.98

beattji01 1985 SEA 7.29

beller01 1985 BAL 4.76

berenju01 1985 DET 5.59

bestka01 1985 SEA 1.95

Partition #1 Partition #2 Partition #3 Partition #4

257

Page 258: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Roundrobin Partitioning

� Records are distributed very evenly amongst all

partitions. Use this method (or Auto) when in doubt.

� The roundrobin partitioned records may look like:

aasedo01 1985 BAL 3.78

allenne01 1985 NYA 2.76

bannifl01 1985 CHA 4.87

beckwjo01 1985 KCA 4.07

bestka01 1985 SEA 1.95

blylebe01 1985 MIN 3.00

boydoi01 1985 BOS 3.79

burrira01 1985 ML4 4.81

camacer01 1985 CLE 8.10

cerutjo01 1985 TOR 5.40

ackerji01 1985 TOR 3.23

armstmi01 1985 NYA 3.07

barklje01 1985 CLE 5.27

behenri01 1985 CLE 7.78

birtsti01 1985 OAK 4.01

boddimi01 1985 BAL 4.07

brownma02 1985 MIN 6.89

burttde01 1985 MIN 3.81

candejo01 1985 CAL 3.80

clancji01 1985 TOR 3.78

agostju01 1985 CHA 3.58

atherke01 1985 OAK 4.30

barojsa01 1985 SEA 5.98

Beller01 1985 BAL 4.76

blackbu02 1985 KCA 4.33

boggsto01 1985 TEX 11.57

brownmi01 1985 BOS 21.6

butchjo01 1985 MIN 4.98

Carych01 1985 DET 3.42

clarkbr01 1985 CLE 6.32

alexado01 1985 TOR 3.45

bairdo01 1985 DET 6.24

beattji01 1985 SEA 7.29

berenju01 1985 DET 5.59

blylebe01 1985 CLE 3.26

bordiri01 1985 NYA 3.21

burnsbr01 1985 CHA 3.96

bystrma01 1985 NYA 5.71

caudibi01 1985 TOR 2.99

clarkst02 1985 TOR 4.50

Partition #1 Partition #2 Partition #3 Partition #4

258

Page 259: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Same Partitioning

� Preserves whatever partitioning that is already in

place.

• Data remains in the same partition throughout the flow (aka

partitioned parallelism), or until data becomes repartitioned

on purpose

• Does not care about how data was previously partitioned

• Sets the Preserve Partitioning flag to prevent automatic

repartitioning later

� Same partitioning is most useful for preserving sort

order

259

Page 260: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Entire Partitioning

� Places a complete copy of the data into each partition:

� Entire partitioning is useful for making a copy of the

data available on all processing nodes of a shared

nothing environment.

• No shared memory between processing nodes

• Entire forces a copy to be pushed out to each node

• Lookup stage does thisE

nti

re

Records 1 – 100,000

Records 1 – 100,000

Records 1 – 100,000

Records 1 – 100,000

DataFile

Au

toRecords 1 – 25,000

Records 25,001 – 50,000

Records 50,001 – 75,000

Records 75,001 – 100,000

260

Page 261: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Modulus Partitioning

� Distributes records using a modulus function on the key column

selected from the available list.

• (Field value) mod (no. of partitions, n) = 0,1,…,n, n-1

� Using our previous example record,

we will perform a modulus partition on yearID.

� Results would look like this (Modulus+1=part#):

ERA - floatteamID - char[3]yearID - integerplayerID - varchar ERA - floatteamID - char[3]yearID - integerplayerID - varchar

aasedo01 1988 BAL 4.05

alexado01 1988 DET 4.32

allenne01 1988 NYA 3.84

anderal02 1988 MIN 2.45

anderri02 1988 KCA 4.24

aquinlu01 1988 KCA 2.79

atherke01 1988 MIN 3.41

augusdo01 1988 ML4 3.09

bailesc01 1988 CLE 4.9

bairdo01 1988 TOR 4.05

. . .

aasedo01 1985 BAL 3.78

ackerji01 1985 TOR 3.23

agostju01 1985 CHA 3.58

alexado01 1985 TOR 3.45

allenne01 1985 NYA 2.76

armstmi01 1985 NYA 3.07

atherke01 1985 OAK 4.3

bairdo01 1985 DET 6.24

bannifl01 1985 CHA 4.87

barklje01 1985 CLE 5.27

. . .

aasedo01 1986 BAL 2.98

ackerji01 1986 TOR 4.35

agostju01 1986 CHA 7.71

agostju01 1986 MIN 8.85

akerfda01 1986 OAK 6.75

alexado01 1986 TOR 4.46

allenne01 1986 CHA 3.82

anderal02 1986 MIN 5.55

andujjo01 1986 OAK 3.82

aquinlu01 1986 TOR 6.35

. . .

aasedo01 1987 BAL 2.25

akerfda01 1987 CLE 6.75

aldrija01 1987 ML4 4.94

alexado01 1987 DET 1.53

allenne01 1987 CHA 7.07

allenne01 1987 NYA 3.65

anderal02 1987 MIN 10.95

anderri02 1987 KCA 13.85

andersc01 1987 TEX 9.53

andujjo01 1987 OAK 6.08

. . .

Partition #1 Partition #2 Partition #3 Partition #4

261

Page 262: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Range Partitioning

� Partitions data into approximately equal size

partitions based on one or more partitioning keys.

• Range partitioning is often a preprocessing step to

performing a total sort on a dataset

• Requires extra pass through the data to create range map

Suppose we want to rangepartition based on a baseballpitcher’s Earned Run Average(ERA). We 1st have to createa range map file as shown here.

262

Page 263: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Range Partitioning (Continued)

� Once the range map is created, it must also be referenced from

within the job that is performing the range partitioning:

� Sorting typically occurs whenever range partitioning is

performed, in order best group records belonging in the same

range.

• Actual range values are determined by the Framework using an

algorithm that attempts to achieve an optimal distribution of

records.263

Page 264: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Range Partitioning (Continued)

� Using the same example record, once the data has

been range partitioned and sorted on just ERA, the

results would resemble the following:

� Range partitioning is very effective for producing

balanced partitions and can be efficient if data

characteristics do not change over time.

benneer01 1995 CAL 0.00

mercejo02 2003 MON 0.00

dascedo01 1990 CHN 0.00

chenbr01 2002 NYN 0.00

stricsc01 2002 MON 0.00

brownke03 1990 NYN 0.00

seoja01 2002 NYN 0.00

gaettga01 1997 SLN 0.00

damicje01 1999 MIL 0.00

ceronri01 1987 NYA 0.00

. . .

clarkma01 1996 NYN 3.43

moyerja01 2001 SEA 3.43

downske01 1990 SFN 3.43

ballaje01 1989 BAL 3.43

bonesri01 1994 ML4 3.43

wehrmda01 1985 CHA 3.43

tomkobr01 1997 CIN 3.43

loiseri01 1998 PIT 3.44

coneda01 1999 NYA 3.44

remlimi01 2004 CHN 3.44

. . .

desseel01 2001 CIN 4.48

cruzne01 2002 HOU 4.48

marotmi01 2002 DET 4.48

towerjo01 2003 TOR 4.48

zitoba01 2004 OAK 4.48

tomkobr01 2005 SFN 4.48

roberna01 2005 DET 4.48

clancji01 1988 TOR 4.49

hudsoch02 1988 NYA 4.49

johnto01 1988 NYA 4.49

. . .

foulkke01 2005 BOS 5.91

searara01 1985 ML4 5.92

willica01 1994 MIN 5.92

smallma01 1996 HOU 5.92

welleto01 2004 CHN 5.92

bairdo01 1987 PHI 5.93

staplda02 1988 ML4 5.93

wengedo01 1998 SDN 5.93

broxtjo01 2005 LAN 5.93

smithmi03 1987 MIN 5.94

. . .

Partition #1 Partition #2 Partition #3 Partition #4

264

Page 265: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Hash Partitioning

� Partitions records based on the value of a key

column or columns

• All records with the same key column value will wind up in

the same partition

• Hash partitioning is often a preprocessing step to performing

a total sort on a dataset

• Poorly chosen partition key(s) can result in a data skew –

that is, majority of the records wind up in one or two

partitions while the rest of the partitions receive no data.

o For example, hash partitioning on gender would result in a data

skew where the majority of records will be spread between 2

partitions. Skews are bad!

265

Page 266: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Hash Partitioning (Continued)

� To select the key(s) to perform a Hash partitioning

on, click on the Input Tab found in stage properties.

� Select the Hash

partition type and

then select the

partitioning key(s)

� Check the Sort

box to also sort the data once it has been partitioned

� NOTE: Selection order matters! The first key

selected will always act as the primary key

266

Page 267: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Hash Partitioning (Continued)

� Using the same example record, once the data has

been hash partitioned and sorted on playerID and

teamID, the results would resemble the following:

� Note that all records with the same playerID and

teamID value are now in the same partition.

aasedo01 1986 BAL 2.98

aasedo01 1985 BAL 3.78

aasedo01 1988 BAL 4.05

aasedo01 1987 BAL 2.25

aasedo01 1990 LAN 4.97

abbotpa01 1991 MIN 4.75

abbotpa01 1990 MIN 5.97

abbotpa01 1992 MIN 3.27

abbotpa01 2002 SEA 11.96

abbotpa01 2000 SEA 4.22

. . .

aasedo01 1989 NYN 3.94

abbotky01 1991 CAL 4.58

abbotky01 1996 CAL 20.25

abbotky01 1992 PHI 5.13

abbotky01 1995 PHI 3.81

aceveju01 2003 TOR 4.26

ackerji01 1990 TOR 3.83

ackerji01 1985 TOR 3.23

ackerji01 1991 TOR 5.20

ackerji01 1986 TOR 4.35

. . .

aardsda01 2004 SFN 6.75

abbotji01 1989 CAL 3.92

abbotji01 1991 CAL 2.89

abbotji01 1995 CAL 4.15

abbotji01 1992 CAL 2.77

abbotji01 1996 CAL 7.48

abbotji01 1990 CAL 4.51

abbotji01 1999 MIL 6.91

abbotpa01 1993 CLE 6.38

abbotpa01 2003 KCA 5.29

. . .

abbotji01 1998 CHA 4.55

abbotji01 1995 CHA 3.36

abbotji01 1993 NYA 4.37

abbotji01 1994 NYA 4.55

abbotpa01 2004 TBA 6.70

abregjo01 1985 CHN 6.38

aceveju01 2001 FLO 2.54

aceveju01 2003 NYA 7.71

aceveju01 1998 SLN 2.56

aceveju01 1999 SLN 5.89

. . .

Partition #1 Partition #2 Partition #3 Partition #4

267

Page 268: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DB2 Partitioning

� Distributes the data using the same partitioning

algorithm as DB2.

• Must supply specific DB2 table via the Partition Properties

NOTE: DB2 Enterprise stage automatically invokes the DB2 partitioner prior to loading data in parallel.

268

Page 269: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Partitioning… And Then Sorting!

� When sorting data, always make sure it’s pre-

partitioned using either Range or Hash partitioners.

• Why? What happens when data is sorted without first either

Range or Hash partitioning it? The result will be useless in

steps like de-duping or merging, since the data will not be

truly sorted – records that belong together in the same

partition would likely be found on different partitions.

• When running sequentially, partitioning is not necessary!

� Partitioning and Sorting are very expensive

operations

• Requires lots of CPU and disk I/O

• Do not unnecessarily partition and sort the data!

269

Page 270: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Sorting Techniques

� There are 2 ways to sort data

• Easiest way is to specify the sort key(s) on the input link

properties for any given stage that supports an input link.

• The actual Sort stage is also an option for sorting data

within a flow. Functionality wise, it is identical.

� Sorting requires data to be pre-partitioned using

either Range or Hash

� Sorting sets the Preserve Partitioning flag which

forces Same partitioning to occur downstream.

• This avoids messing up the sorted order of the records

270

Page 271: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Sort Properties

� Common properties for sorting include

• Unique – removes duplicates where duplicates are

determined by the specified key fields being sorted on. For

example, if sorting on playerID and teamID, then all records

with the same playerID and teamID will be considered

identical and only 1 will be kept.

• Stable – indicates that incoming data is already pre-sorted

and not to re-sort the data. For example, if the data is

already sorted on playerID and teamID, and the new sort

key is ERA, then the data will now be sorted on playerID,

teamID, and ERA. If the option is not set, then the data will

only be sorted on ERA.

271

Page 272: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Sort Stage Properties

� The Sort Stage is more flexible than the sort option on the stage

input link properties.

� Sort Stage allows for more advanced options to be leveraged:

• Restrict Memory Usage –

by default, DS uses 20MB

of memory per partition for

sorting. It is a good idea to

increase this amount if there

is plenty of memory

available. Sorting on disk is

much slower than sorting in memory.

• Create Cluster Key Change Column – creates a marker to

indicate each time a key change occurs.

o Useful for applications needing to identify key changes in order to apply

business logic to individual record groups

272

Page 273: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Removing Duplicates

� The Remove Duplicates stage removes duplicate

records based on specified keys.

• Records must be Hash and Sorted on the same keys

• Must specify at least 1 key

• Selected key columns define a duplicate record

• Can choose to keep first or last record

• Similar to using the

Unique option under

Sort options

273

Page 274: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Data Collection

� Data collection is the opposite of data partitioning:

� All records from all partitions are gathered into a

single partition.

� Collectors are used when:

• Writing out to a sequential file

• Processing data through a stage that runs sequentially

DataFile

Records 1 - 1000

Records 1001 - 2000

Records 2001 - 3000

Records 3001 - 4000 Co

llecto

r

274

Page 275: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Data Collection Methods

� In DataStage, there are a few options for collecting

data:

• Auto (Default)

• Roundrobin

• Ordered

• Sort Merge

275

Page 276: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Auto Collector

� By default, collecting is always set to Auto

• Auto means the Framework will decide the most optimal

collecting algorithm based on what the job is doing.

� Collector type is accessed from the same location for

any given sequential stage with an input link

attached:

276

Page 277: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Roundrobin Collector

� Collects records from multiple partitions in a

roundrobin manner.

• Collects a record from the first partition, then the second,

then the third, etc… until it reaches the last partition and then

starts over again.

• Extremely fast!

• Typically same as Auto

277

Page 278: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Ordered Collector

� Reads all records from the first partition, then all

records from the second partition, and so on until all

partitions have been read.

• Useful for maintaining sort order – if data was previously

partition-sorted, then the outcome will be a sorted single

partition.

• Could be slow if some partitions get ‘backed up’.

278

Page 279: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Sort Merge Collector

� Reads records in an order based on one or more key columns

of the record.

• Will maintain the sorted order

• Must select at least 1 collecting key column

• Collector key(s) should match partition-sorting key(s)

• Similar to Ordered collector

� Sort Merge not only acts as a collector, but also manages data

flow from many partitions to fewer partitions

• For example, a job can run 8-way parallel and then slow down to 4-

way parallel. To accomplish this, the Framework leverages the

Sort Merge to maintain the sort order and partitioning strategy.

279

Page 280: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Link Indicators

� The icons found on links are an indicator of what is

happening in terms of partitioning or collecting.

• Auto partitioning

• Sequential to Parallel, data is being partitioned

• Data re-partitioning

• Same partitioning

• Partition and Sort

• Sort and Collect data

• Collect data

280

Page 281: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Funnel Stage

� Collects many links and outputs only 1 link

• All input links must possess the exact same data layout

� Do not confuse with Collectors!!!

• Funnel keeps everything running

in parallel – many links come

in, one link goes out. Each link

represents many partitions

• Collectors go from parallel to sequential – one link with many

partitions come in, and all data is put into a single sequential

partition on the output.

281

Page 282: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 4A: Data Partitioning

282

Page 283: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 4A Objectives

� Learn more about the Peek stage

� Learn to invoke different partitioners

� Observe outcome from different partitioners –

Roundrobin, Entire, and Hash

283

Page 284: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Changing Default Settings

� Go to the Designer and use the Configuration File editor to create

a 4 node configuration file as was done in Lab2A.

• You may be able to skip this step if one was already previously

created.

• Be sure to Save and Check the configuration file accordingly

� In the Administrator, select Project Properties and enter the

Environment editor

• Find and set APT_CONFIG_FILE to the 4node configuration file you

just created. This makes it the default for your project.

• Make sure APT_DUMP_SCORE is set to True.

� Click OK button when finished editing Environment Variables.

� Click OK and then Close to exit the Administrator.

284

Page 285: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Peek Stage Behavior

� We will be using the Peek stage throughout this lab.

Keep in mind the following:

• Peek outputs 10 rows per partition

by default in the Director log.

• You can specify for it to output as

many or as few records per

partition as you would like to see

• Peek displays the column names by default – this can be

disabled

• Peek displays all columns by default – you can specify only

the columns you would like to look at.

285

Page 286: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating Lab3A - Roundrobin

� Open lab3a_batting

� Re-name and Save-As lab4a

� Under the Peek stage Input properties, change the

Partition type to ‘Roundrobin’

• Save and compile

• Run the job

• Open the Director and view

the job log

• Compare your output to the

one on the next slide. Should

be similar. Note that there is no distinct pattern.

286

Page 287: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Roundrobin Partitioning OutputBatting_Peek,0: playerID:aasedo01 yearID:1985 teamID:BAL lgID:AL G:54 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:alexado01 yearID:1985 teamID:TOR lgID:AL G:36 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:armstmi01 yearID:1985 teamID:NYA lgID:AL G:9 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:bairdo01 yearID:1985 teamID:DET lgID:AL G:21 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:bandoch01 yearID:1985 teamID:CLE lgID:AL G:73 AB:173 R:11 H:24 DB:4 TP:1 HR:0 RBI:13 SB:0 IBB:0

Batting_Peek,0: playerID:barklje01 yearID:1985 teamID:CLE lgID:AL G:21 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:beattji01 yearID:1985 teamID:SEA lgID:AL G:18 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:beller01 yearID:1985 teamID:BAL lgID:AL G:4 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:berenju01 yearID:1985 teamID:DET lgID:AL G:31 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:bestka01 yearID:1985 teamID:SEA lgID:AL G:15 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,1: playerID:ackerji01 yearID:1985 teamID:TOR lgID:AL G:61 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,1: playerID:allenga01 yearID:1985 teamID:TOR lgID:AL G:14 AB:34 R:2 H:4 DB:1 TP:0 HR:0 RBI:3 SB:0 IBB:0

Batting_Peek,1: playerID:atherke01 yearID:1985 teamID:OAK lgID:AL G:56 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,1: playerID:bakerdo01 yearID:1985 teamID:DET lgID:AL G:15 AB:27 R:4 H:5 DB:1 TP:0 HR:0 RBI:1 SB:0 IBB:0

Batting_Peek,1: playerID:bannial01 yearID:1985 teamID:TEX lgID:AL G:57 AB:122 R:17 H:32 DB:4 TP:1 HR:1 RBI:6 SB:8 IBB:0

Batting_Peek,1: playerID:barojsa01 yearID:1985 teamID:SEA lgID:AL G:17 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,1: playerID:beckwjo01 yearID:1985 teamID:KCA lgID:AL G:49 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,1: playerID:bellge02 yearID:1985 teamID:TOR lgID:AL G:157 AB:607 R:87 H:167 DB:28 TP:6 HR:28 RBI:95 SB:21 IBB:6

Batting_Peek,1: playerID:bergmda01 yearID:1985 teamID:DET lgID:AL G:69 AB:140 R:8 H:25 DB:2 TP:0 HR:3 RBI:7 SB:0 IBB:0

Batting_Peek,1: playerID:biancbu01 yearID:1985 teamID:KCA lgID:AL G:81 AB:138 R:21 H:26 DB:5 TP:1 HR:1 RBI:6 SB:1 IBB:0

Batting_Peek,2: playerID:agostju01 yearID:1985 teamID:CHA lgID:AL G:54 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,2: playerID:allenne01 yearID:1985 teamID:NYA lgID:AL G:17 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,2: playerID:ayalabe01 yearID:1985 teamID:CLE lgID:AL G:46 AB:76 R:10 H:19 DB:7 TP:0 HR:2 RBI:15 SB:0 IBB:1

Batting_Peek,2: playerID:bakerdu01 yearID:1985 teamID:OAK lgID:AL G:111 AB:343 R:48 H:92 DB:15 TP:1 HR:14 RBI:52 SB:2 IBB:0

Batting_Peek,2: playerID:bannifl01 yearID:1985 teamID:CHA lgID:AL G:34 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,2: playerID:barrema02 yearID:1985 teamID:BOS lgID:AL G:156 AB:534 R:59 H:142 DB:26 TP:0 HR:5 RBI:56 SB:7 IBB:3

Batting_Peek,2: playerID:behenri01 yearID:1985 teamID:CLE lgID:AL G:4 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,2: playerID:beniqju01 yearID:1985 teamID:CAL lgID:AL G:132 AB:411 R:54 H:125 DB:13 TP:5 HR:8 RBI:42 SB:4 IBB:3

Batting_Peek,2: playerID:bernato01 yearID:1985 teamID:CLE lgID:AL G:153 AB:500 R:73 H:137 DB:26 TP:3 HR:11 RBI:59 SB:17 IBB:2

Batting_Peek,2: playerID:birtsti01 yearID:1985 teamID:OAK lgID:AL G:29 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,3: playerID:aikenwi01 yearID:1985 teamID:TOR lgID:AL G:12 AB:20 R:2 H:4 DB:1 TP:0 HR:1 RBI:5 SB:0 IBB:0

Batting_Peek,3: playerID:armasto01 yearID:1985 teamID:BOS lgID:AL G:103 AB:385 R:50 H:102 DB:17 TP:5 HR:23 RBI:64 SB:0 IBB:4

Batting_Peek,3: playerID:baineha01 yearID:1985 teamID:CHA lgID:AL G:160 AB:640 R:86 H:198 DB:29 TP:3 HR:22 RBI:113 SB:1 IBB:8

Batting_Peek,3: playerID:balbost01 yearID:1985 teamID:KCA lgID:AL G:160 AB:600 R:74 H:146 DB:28 TP:2 HR:36 RBI:88 SB:1 IBB:4

Batting_Peek,3: playerID:barfije01 yearID:1985 teamID:TOR lgID:AL G:155 AB:539 R:94 H:156 DB:34 TP:9 HR:27 RBI:84 SB:22 IBB:5

Batting_Peek,3: playerID:baylodo01 yearID:1985 teamID:NYA lgID:AL G:142 AB:477 R:70 H:110 DB:24 TP:1 HR:23 RBI:91 SB:0 IBB:6

Batting_Peek,3: playerID:bellbu01 yearID:1985 teamID:TEX lgID:AL G:84 AB:313 R:33 H:74 DB:13 TP:3 HR:4 RBI:32 SB:3 IBB:1

Batting_Peek,3: playerID:bentobu01 yearID:1985 teamID:CLE lgID:AL G:31 AB:67 R:5 H:12 DB:4 TP:0 HR:0 RBI:7 SB:0 IBB:2

Batting_Peek,3: playerID:berrada01 yearID:1985 teamID:NYA lgID:AL G:48 AB:109 R:8 H:25 DB:5 TP:1 HR:1 RBI:8 SB:1 IBB:0

Batting_Peek,3: playerID:blackbu02 yearID:1985 teamID:KCA lgID:AL G:33 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

287

Page 288: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Entire Partitioning

� Go back into the Peek stage Input properties, and

change the Partition type to ‘Entire’

• Save and compile the job

• Run the job

• Go back to the Director and view the job log

• Compare your output to the one on the next slide. Note that

each output is identical, since Entire places a copy of the

data into each partition.

288

Page 289: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Roundrobin Partitioning OutputBatting_Peek,0: playerID:aasedo01 yearID:1985 teamID:BAL lgID:AL G:54 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:ackerji01 yearID:1985 teamID:TOR lgID:AL G:61 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:agostju01 yearID:1985 teamID:CHA lgID:AL G:54 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:aikenwi01 yearID:1985 teamID:TOR lgID:AL G:12 AB:20 R:2 H:4 DB:1 TP:0 HR:1 RBI:5 SB:0 IBB:0

Batting_Peek,0: playerID:alexado01 yearID:1985 teamID:TOR lgID:AL G:36 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:allenga01 yearID:1985 teamID:TOR lgID:AL G:14 AB:34 R:2 H:4 DB:1 TP:0 HR:0 RBI:3 SB:0 IBB:0

Batting_Peek,0: playerID:allenne01 yearID:1985 teamID:NYA lgID:AL G:17 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:armasto01 yearID:1985 teamID:BOS lgID:AL G:103 AB:385 R:50 H:102 DB:17 TP:5 HR:23 RBI:64 SB:0 IBB:4

Batting_Peek,0: playerID:armstmi01 yearID:1985 teamID:NYA lgID:AL G:9 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:atherke01 yearID:1985 teamID:OAK lgID:AL G:56 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,1: playerID:aasedo01 yearID:1985 teamID:BAL lgID:AL G:54 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,1: playerID:ackerji01 yearID:1985 teamID:TOR lgID:AL G:61 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,1: playerID:agostju01 yearID:1985 teamID:CHA lgID:AL G:54 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,1: playerID:aikenwi01 yearID:1985 teamID:TOR lgID:AL G:12 AB:20 R:2 H:4 DB:1 TP:0 HR:1 RBI:5 SB:0 IBB:0

Batting_Peek,1: playerID:alexado01 yearID:1985 teamID:TOR lgID:AL G:36 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,1: playerID:allenga01 yearID:1985 teamID:TOR lgID:AL G:14 AB:34 R:2 H:4 DB:1 TP:0 HR:0 RBI:3 SB:0 IBB:0

Batting_Peek,1: playerID:allenne01 yearID:1985 teamID:NYA lgID:AL G:17 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,1: playerID:armasto01 yearID:1985 teamID:BOS lgID:AL G:103 AB:385 R:50 H:102 DB:17 TP:5 HR:23 RBI:64 SB:0 IBB:4

Batting_Peek,1: playerID:armstmi01 yearID:1985 teamID:NYA lgID:AL G:9 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,1: playerID:atherke01 yearID:1985 teamID:OAK lgID:AL G:56 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,2: playerID:aasedo01 yearID:1985 teamID:BAL lgID:AL G:54 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,2: playerID:ackerji01 yearID:1985 teamID:TOR lgID:AL G:61 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,2: playerID:agostju01 yearID:1985 teamID:CHA lgID:AL G:54 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,2: playerID:aikenwi01 yearID:1985 teamID:TOR lgID:AL G:12 AB:20 R:2 H:4 DB:1 TP:0 HR:1 RBI:5 SB:0 IBB:0

Batting_Peek,2: playerID:alexado01 yearID:1985 teamID:TOR lgID:AL G:36 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,2: playerID:allenga01 yearID:1985 teamID:TOR lgID:AL G:14 AB:34 R:2 H:4 DB:1 TP:0 HR:0 RBI:3 SB:0 IBB:0

Batting_Peek,2: playerID:allenne01 yearID:1985 teamID:NYA lgID:AL G:17 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,2: playerID:armasto01 yearID:1985 teamID:BOS lgID:AL G:103 AB:385 R:50 H:102 DB:17 TP:5 HR:23 RBI:64 SB:0 IBB:4

Batting_Peek,2: playerID:armstmi01 yearID:1985 teamID:NYA lgID:AL G:9 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,2: playerID:atherke01 yearID:1985 teamID:OAK lgID:AL G:56 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,3: playerID:aasedo01 yearID:1985 teamID:BAL lgID:AL G:54 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,3: playerID:ackerji01 yearID:1985 teamID:TOR lgID:AL G:61 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,3: playerID:agostju01 yearID:1985 teamID:CHA lgID:AL G:54 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,3: playerID:aikenwi01 yearID:1985 teamID:TOR lgID:AL G:12 AB:20 R:2 H:4 DB:1 TP:0 HR:1 RBI:5 SB:0 IBB:0

Batting_Peek,3: playerID:alexado01 yearID:1985 teamID:TOR lgID:AL G:36 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,3: playerID:allenga01 yearID:1985 teamID:TOR lgID:AL G:14 AB:34 R:2 H:4 DB:1 TP:0 HR:0 RBI:3 SB:0 IBB:0

Batting_Peek,3: playerID:allenne01 yearID:1985 teamID:NYA lgID:AL G:17 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,3: playerID:armasto01 yearID:1985 teamID:BOS lgID:AL G:103 AB:385 R:50 H:102 DB:17 TP:5 HR:23 RBI:64 SB:0 IBB:4

Batting_Peek,3: playerID:armstmi01 yearID:1985 teamID:NYA lgID:AL G:9 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,3: playerID:atherke01 yearID:1985 teamID:OAK lgID:AL G:56 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

289

Page 290: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Hash Partitioning (and Sorting)

� Go back into the Peek stage Input properties, and

change the Partition type to ‘Hash’ and click on Sort

• Select RBI as the key to

Hash and Sort on

• Save and compile the job

• Run the job

• Go back to the Director

and view the job log

• Compare your output to

the one on the next slide. Note that data is grouped and

sorted by RBI. All records with the same RBI value will be

in the same partition.

290

Page 291: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Hash and Sort Partitioning OutputBatting_Peek,0: playerID:brunato01 yearID:1994 teamID:ML4 lgID:AL G:16 AB:28 R:2 H:6 DB:2 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:sageraj01 yearID:1995 teamID:COL lgID:NL G:10 AB:3 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:burkeja02 yearID:2005 teamID:CHA lgID:AL G:1 AB:1 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:levinal01 yearID:2005 teamID:SFN lgID:NL G:9 AB:2 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:ortizra01 yearID:2005 teamID:CIN lgID:NL G:30 AB:54 R:1 H:4 DB:2 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:percitr01 yearID:2005 teamID:DET lgID:AL G:26 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:shielsc01 yearID:2005 teamID:LAA lgID:AL G:78 AB:1 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:seleaa01 yearID:2005 teamID:SEA lgID:AL G:21 AB:3 R:1 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:schoesc01 yearID:2005 teamID:TOR lgID:AL G:80 AB:0 R:0 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,0: playerID:washbja01 yearID:2005 teamID:LAA lgID:AL G:29 AB:4 R:1 H:0 DB:0 TP:0 HR:0 RBI:0 SB:0 IBB:0

Batting_Peek,1: playerID:lawleto01 yearID:1986 teamID:SLN lgID:NL G:46 AB:39 R:5 H:11 DB:1 TP:0 HR:0 RBI:3 SB:8 IBB:0

Batting_Peek,1: playerID:liefeje01 yearID:2003 teamID:TBA lgID:AL G:9 AB:25 R:4 H:3 DB:1 TP:0 HR:1 RBI:3 SB:0 IBB:1

Batting_Peek,1: playerID:howitda01 yearID:1991 teamID:OAK lgID:AL G:21 AB:42 R:5 H:7 DB:1 TP:0 HR:1 RBI:3 SB:0 IBB:0

Batting_Peek,1: playerID:perezol01 yearID:2005 teamID:PIT lgID:NL G:20 AB:33 R:1 H:6 DB:0 TP:0 HR:0 RBI:3 SB:1 IBB:0

Batting_Peek,1: playerID:nievejo01 yearID:2001 teamID:ANA lgID:AL G:29 AB:53 R:5 H:13 DB:3 TP:1 HR:2 RBI:3 SB:0 IBB:0

Batting_Peek,1: playerID:humphmi01 yearID:1991 teamID:NYA lgID:AL G:25 AB:40 R:9 H:8 DB:0 TP:0 HR:0 RBI:3 SB:2 IBB:0

Batting_Peek,1: playerID:ortizjo02 yearID:2001 teamID:OAK lgID:AL G:11 AB:42 R:4 H:7 DB:0 TP:0 HR:0 RBI:3 SB:1 IBB:0

Batting_Peek,1: playerID:youngge02 yearID:1994 teamID:SLN lgID:NL G:16 AB:41 R:5 H:13 DB:3 TP:2 HR:0 RBI:3 SB:2 IBB:0

Batting_Peek,1: playerID:lopezme01 yearID:1999 teamID:KCA lgID:AL G:7 AB:20 R:2 H:8 DB:0 TP:1 HR:0 RBI:3 SB:0 IBB:0

Batting_Peek,1: playerID:loretma01 yearID:1995 teamID:ML4 lgID:AL G:19 AB:50 R:13 H:13 DB:3 TP:0 HR:1 RBI:3 SB:1 IBB:0

Batting_Peek,2: playerID:landrce01 yearID:1991 teamID:CHN lgID:NL G:56 AB:86 R:28 H:20 DB:2 TP:1 HR:0 RBI:6 SB:27 IBB:0

Batting_Peek,2: playerID:drabedo01 yearID:1992 teamID:PIT lgID:NL G:35 AB:89 R:5 H:14 DB:3 TP:0 HR:0 RBI:6 SB:0 IBB:0

Batting_Peek,2: playerID:friasha01 yearID:2000 teamID:ARI lgID:NL G:75 AB:112 R:18 H:23 DB:5 TP:0 HR:2 RBI:6 SB:2 IBB:0

Batting_Peek,2: playerID:gilkebe01 yearID:2000 teamID:ARI lgID:NL G:38 AB:73 R:6 H:8 DB:1 TP:0 HR:2 RBI:6 SB:0 IBB:2

Batting_Peek,2: playerID:hamilda02 yearID:2000 teamID:NYN lgID:NL G:43 AB:105 R:20 H:29 DB:4 TP:1 HR:1 RBI:6 SB:2 IBB:0

Batting_Peek,2: playerID:cowenal01 yearID:1986 teamID:SEA lgID:AL G:28 AB:82 R:5 H:15 DB:4 TP:0 HR:0 RBI:6 SB:1 IBB:0

Batting_Peek,2: playerID:deloslu01 yearID:1989 teamID:KCA lgID:AL G:28 AB:87 R:6 H:22 DB:3 TP:1 HR:0 RBI:6 SB:0 IBB:0

Batting_Peek,2: playerID:lopesda01 yearID:1987 teamID:HOU lgID:NL G:47 AB:43 R:4 H:10 DB:2 TP:0 HR:1 RBI:6 SB:2 IBB:2

Batting_Peek,2: playerID:mckayco01 yearID:2004 teamID:SLN lgID:NL G:35 AB:74 R:7 H:17 DB:2 TP:0 HR:0 RBI:6 SB:0 IBB:0

Batting_Peek,2: playerID:zambrca01 yearID:2005 teamID:CHN lgID:NL G:34 AB:80 R:8 H:24 DB:6 TP:2 HR:1 RBI:6 SB:0 IBB:0

Batting_Peek,3: playerID:weaveje01 yearID:2002 teamID:DET lgID:AL G:2 AB:7 R:0 H:2 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0

Batting_Peek,3: playerID:tocajo01 yearID:2001 teamID:NYN lgID:NL G:13 AB:17 R:3 H:3 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0

Batting_Peek,3: playerID:valdeis01 yearID:1997 teamID:LAN lgID:NL G:28 AB:57 R:0 H:5 DB:1 TP:0 HR:0 RBI:1 SB:1 IBB:0

Batting_Peek,3: playerID:valdema01 yearID:1997 teamID:MON lgID:NL G:47 AB:19 R:0 H:2 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0

Batting_Peek,3: playerID:valenfe01 yearID:1997 teamID:SDN lgID:NL G:13 AB:17 R:0 H:3 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0

Batting_Peek,3: playerID:valenfe01 yearID:1997 teamID:SLN lgID:NL G:5 AB:5 R:1 H:1 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0

Batting_Peek,3: playerID:wadete01 yearID:1997 teamID:ATL lgID:NL G:12 AB:12 R:0 H:3 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0

Batting_Peek,3: playerID:germaes01 yearID:2003 teamID:OAK lgID:AL G:5 AB:4 R:0 H:1 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0

Batting_Peek,3: playerID:whitega01 yearID:1997 teamID:CIN lgID:NL G:11 AB:9 R:0 H:1 DB:0 TP:0 HR:0 RBI:1 SB:0 IBB:0

Batting_Peek,3: playerID:greenga01 yearID:1991 teamID:TEX lgID:AL G:8 AB:20 R:0 H:3 DB:1 TP:0 HR:0 RBI:1 SB:0 IBB:0

291

Page 292: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Changing the Sort Order

� The results from the Hash and Sort were fairly boring, after all, it

would be interesting to see who had the highest RBI.

� Go back into the Peek stage Input properties, and change the

Sort direction to Descending instead of Ascending.

• Change the Sort Direction for the

RBI by right-clicking on it:

• Save and compile the job

• Run the job

• Go back to the Director and view the job log

• Compare your output to the one on the next slide. Note that data is

still grouped and sorted by RBI, however, you can now see the

players with the highest RBI statistics.

292

Page 293: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Hash and Sort Partitioning OutputBatting_Peek,0: playerID:sosasa01 yearID:1998 teamID:CHN lgID:NL G:159 AB:643 R:134 H:198 DB:20 TP:0 HR:66 RBI:158 SB:18 IBB:14

Batting_Peek,0: playerID:gonzaju03 yearID:1998 teamID:TEX lgID:AL G:154 AB:606 R:110 H:193 DB:50 TP:2 HR:45 RBI:157 SB:2 IBB:9

Batting_Peek,0: playerID:belleal01 yearID:1996 teamID:CLE lgID:AL G:158 AB:602 R:124 H:187 DB:38 TP:3 HR:48 RBI:148 SB:11 IBB:15

Batting_Peek,0: playerID:palmera01 yearID:1999 teamID:TEX lgID:AL G:158 AB:565 R:96 H:183 DB:30 TP:1 HR:47 RBI:148 SB:2 IBB:14

Batting_Peek,0: playerID:ortizda01 yearID:2005 teamID:BOS lgID:AL G:159 AB:601 R:119 H:180 DB:40 TP:1 HR:47 RBI:148 SB:1 IBB:9

Batting_Peek,0: playerID:castivi02 yearID:1998 teamID:COL lgID:NL G:162 AB:645 R:108 H:206 DB:28 TP:4 HR:46 RBI:144 SB:5 IBB:7

Batting_Peek,0: playerID:teixema01 yearID:2005 teamID:TEX lgID:AL G:162 AB:644 R:112 H:194 DB:41 TP:3 HR:43 RBI:144 SB:4 IBB:5

Batting_Peek,0: playerID:ramirma02 yearID:2005 teamID:BOS lgID:AL G:152 AB:554 R:112 H:162 DB:30 TP:1 HR:45 RBI:144 SB:1 IBB:9

Batting_Peek,0: playerID:sweenmi01 yearID:2000 teamID:KCA lgID:AL G:159 AB:618 R:105 H:206 DB:30 TP:0 HR:29 RBI:144 SB:8 IBB:5

Batting_Peek,0: playerID:gonzaju03 yearID:1996 teamID:TEX lgID:AL G:134 AB:541 R:89 H:170 DB:33 TP:2 HR:47 RBI:144 SB:2 IBB:12

Batting_Peek,1: playerID:ramirma02 yearID:1999 teamID:CLE lgID:AL G:147 AB:522 R:131 H:174 DB:34 TP:3 HR:44 RBI:165 SB:2 IBB:9

Batting_Peek,1: playerID:thomafr04 yearID:2000 teamID:CHA lgID:AL G:159 AB:582 R:115 H:191 DB:44 TP:0 HR:43 RBI:143 SB:1 IBB:18

Batting_Peek,1: playerID:vaughmo01 yearID:1996 teamID:BOS lgID:AL G:161 AB:635 R:118 H:207 DB:29 TP:1 HR:44 RBI:143 SB:2 IBB:19

Batting_Peek,1: playerID:rodrial01 yearID:2002 teamID:TEX lgID:AL G:162 AB:624 R:125 H:187 DB:27 TP:2 HR:57 RBI:142 SB:9 IBB:12

Batting_Peek,1: playerID:willima04 yearID:1999 teamID:ARI lgID:NL G:154 AB:627 R:98 H:190 DB:37 TP:2 HR:35 RBI:142 SB:2 IBB:9

Batting_Peek,1: playerID:palmera01 yearID:1996 teamID:BAL lgID:AL G:162 AB:626 R:110 H:181 DB:40 TP:2 HR:39 RBI:142 SB:8 IBB:12

Batting_Peek,1: playerID:gonzalu01 yearID:2001 teamID:ARI lgID:NL G:162 AB:609 R:128 H:198 DB:36 TP:7 HR:57 RBI:142 SB:1 IBB:24

Batting_Peek,1: playerID:thomafr04 yearID:1996 teamID:CHA lgID:AL G:141 AB:527 R:110 H:184 DB:26 TP:0 HR:40 RBI:134 SB:1 IBB:26

Batting_Peek,1: playerID:delgaca01 yearID:1999 teamID:TOR lgID:AL G:152 AB:573 R:113 H:156 DB:39 TP:0 HR:44 RBI:134 SB:1 IBB:7

Batting_Peek,1: playerID:bellge02 yearID:1987 teamID:TOR lgID:AL G:156 AB:610 R:111 H:188 DB:32 TP:4 HR:47 RBI:134 SB:5 IBB:9

Batting_Peek,2: playerID:tejadmi01 yearID:2004 teamID:BAL lgID:AL G:162 AB:653 R:107 H:203 DB:40 TP:2 HR:34 RBI:150 SB:4 IBB:6

Batting_Peek,2: playerID:galaran01 yearID:1996 teamID:COL lgID:NL G:159 AB:626 R:119 H:190 DB:39 TP:3 HR:47 RBI:150 SB:18 IBB:3

Batting_Peek,2: playerID:griffke02 yearID:1997 teamID:SEA lgID:AL G:157 AB:608 R:125 H:185 DB:34 TP:3 HR:56 RBI:147 SB:15 IBB:23

Batting_Peek,2: playerID:heltoto01 yearID:2000 teamID:COL lgID:NL G:160 AB:580 R:138 H:216 DB:59 TP:2 HR:42 RBI:147 SB:5 IBB:22

Batting_Peek,2: playerID:mcgwima01 yearID:1998 teamID:SLN lgID:NL G:155 AB:509 R:130 H:152 DB:21 TP:0 HR:70 RBI:147 SB:1 IBB:28

Batting_Peek,2: playerID:mcgwima01 yearID:1999 teamID:SLN lgID:NL G:153 AB:521 R:118 H:145 DB:21 TP:1 HR:65 RBI:147 SB:0 IBB:21

Batting_Peek,2: playerID:griffke02 yearID:1998 teamID:SEA lgID:AL G:161 AB:633 R:120 H:180 DB:33 TP:3 HR:56 RBI:146 SB:20 IBB:11

Batting_Peek,2: playerID:heltoto01 yearID:2001 teamID:COL lgID:NL G:159 AB:587 R:132 H:197 DB:54 TP:2 HR:49 RBI:146 SB:7 IBB:15

Batting_Peek,2: playerID:martied01 yearID:2000 teamID:SEA lgID:AL G:153 AB:556 R:100 H:180 DB:31 TP:0 HR:37 RBI:145 SB:3 IBB:8

Batting_Peek,2: playerID:delgaca01 yearID:2003 teamID:TOR lgID:AL G:161 AB:570 R:117 H:172 DB:38 TP:1 HR:42 RBI:145 SB:0 IBB:23

Batting_Peek,3: playerID:sosasa01 yearID:2001 teamID:CHN lgID:NL G:160 AB:577 R:146 H:189 DB:34 TP:5 HR:64 RBI:160 SB:0 IBB:37

Batting_Peek,3: playerID:belleal01 yearID:1998 teamID:CHA lgID:AL G:163 AB:609 R:113 H:200 DB:48 TP:2 HR:49 RBI:152 SB:6 IBB:10

Batting_Peek,3: playerID:galaran01 yearID:1997 teamID:COL lgID:NL G:154 AB:600 R:120 H:191 DB:31 TP:3 HR:41 RBI:140 SB:15 IBB:2

Batting_Peek,3: playerID:griffke02 yearID:1996 teamID:SEA lgID:AL G:140 AB:545 R:125 H:165 DB:26 TP:2 HR:49 RBI:140 SB:16 IBB:13

Batting_Peek,3: playerID:gonzaju03 yearID:2001 teamID:CLE lgID:AL G:140 AB:532 R:97 H:173 DB:34 TP:1 HR:35 RBI:140 SB:1 IBB:5

Batting_Peek,3: playerID:ortizda01 yearID:2004 teamID:BOS lgID:AL G:150 AB:582 R:94 H:175 DB:47 TP:3 HR:41 RBI:139 SB:0 IBB:8

Batting_Peek,3: playerID:dawsoan01 yearID:1987 teamID:CHN lgID:NL G:153 AB:621 R:90 H:178 DB:24 TP:2 HR:49 RBI:137 SB:11 IBB:7

Batting_Peek,3: playerID:bondsba01 yearID:2001 teamID:SFN lgID:NL G:153 AB:476 R:129 H:156 DB:32 TP:2 HR:73 RBI:137 SB:13 IBB:35

Batting_Peek,3: playerID:delgaca01 yearID:2000 teamID:TOR lgID:AL G:162 AB:569 R:115 H:196 DB:57 TP:1 HR:41 RBI:137 SB:0 IBB:18

Batting_Peek,3: playerID:giambja01 yearID:2000 teamID:OAK lgID:AL G:152 AB:510 R:108 H:170 DB:29 TP:1 HR:43 RBI:137 SB:2 IBB:6

293

Page 294: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 4B: Data Collection

294

Page 295: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 4B Objective

� Use collectors to process data sequentially

� View difference between SortMerge, Ordered, and

Roundrobin collectors

295

Page 296: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab4B – Collectors

� Open ‘lab4a’ and Save-As ‘lab4b’

� Edit the job and add

a second Peek stage:

� Go to the Advanced Stage

properties for the 2nd Peek

• Change the Execution Mode

to Sequential

• Click OK, save and compile

lab4b.

• Run lab4b and view the

results in the Director log.

296

Page 297: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Auto Collector Output

� Instead of 4 outputs, Sequential_Peek will only

produce 1 output in the Director log. This is because

the 2nd Peek stage was running sequentially.

� Output should look like the following:

� Note that in Auto mode, the Collector maintained the

sort order on RBI

• This suggests that the Framework decided to use the

SortMerge Collector

Sequential_Peek,0: playerID:ramirma02 yearID:1999 teamID:CLE lgID:AL G:147 AB:522 R:131 H:174 DB:34 TP:3 HR:44 RBI:165 SB:2 IBB:9

Sequential_Peek,0: playerID:sosasa01 yearID:2001 teamID:CHN lgID:NL G:160 AB:577 R:146 H:189 DB:34 TP:5 HR:64 RBI:160 SB:0 IBB:37

Sequential_Peek,0: playerID:sosasa01 yearID:1998 teamID:CHN lgID:NL G:159 AB:643 R:134 H:198 DB:20 TP:0 HR:66 RBI:158 SB:18 IBB:14

Sequential_Peek,0: playerID:gonzaju03 yearID:1998 teamID:TEX lgID:AL G:154 AB:606 R:110 H:193 DB:50 TP:2 HR:45 RBI:157 SB:2 IBB:9

Sequential_Peek,0: playerID:belleal01 yearID:1998 teamID:CHA lgID:AL G:163 AB:609 R:113 H:200 DB:48 TP:2 HR:49 RBI:152 SB:6 IBB:10

Sequential_Peek,0: playerID:tejadmi01 yearID:2004 teamID:BAL lgID:AL G:162 AB:653 R:107 H:203 DB:40 TP:2 HR:34 RBI:150 SB:4 IBB:6

Sequential_Peek,0: playerID:galaran01 yearID:1996 teamID:COL lgID:NL G:159 AB:626 R:119 H:190 DB:39 TP:3 HR:47 RBI:150 SB:18 IBB:3

Sequential_Peek,0: playerID:belleal01 yearID:1996 teamID:CLE lgID:AL G:158 AB:602 R:124 H:187 DB:38 TP:3 HR:48 RBI:148 SB:11 IBB:15

Sequential_Peek,0: playerID:palmera01 yearID:1999 teamID:TEX lgID:AL G:158 AB:565 R:96 H:183 DB:30 TP:1 HR:47 RBI:148 SB:2 IBB:14

Sequential_Peek,0: playerID:ortizda01 yearID:2005 teamID:BOS lgID:AL G:159 AB:601 R:119 H:180 DB:40 TP:1 HR:47 RBI:148 SB:1 IBB:9

297

Page 298: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

SortMerge Collector

� Go back into the Sequential_Peek stage Input

properties, and change the Collector type to SortMerge

• Be sure to set the Sort

direction to Descending

• No need to click on Sort

• Save and compile the job

• Run the job

• Go to the Director and

view the job log

• Compare the output of

the Sequential_Peek stage to the output on the previous slide.

The output should be the same.

298

Page 299: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Ordered Collector

� Go back into the Sequential_Peek stage Input

properties, and change the Collector type to Ordered

• Save and compile the job

• Run the job

• Go back to the Director

and view the job log

• Compare your output to

the one on the next slide.

299

Page 300: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Ordered Collector Output

� Output from the Sequential_Peek should look like the

following:

� Ordered Collector takes all records from the 1st

partition, then the 2nd, then the 3rd, and finally the 4th.

• Compare this output with the output from partition 0 for the

Hash and Sort exercise in lab4a

• If the records were originally range partitioned, then the

resulting output would show up sorted.

Sequential_Peek,0: playerID:sosasa01 yearID:1998 teamID:CHN lgID:NL G:159 AB:643 R:134 H:198 DB:20 TP:0 HR:66 RBI:158 SB:18 IBB:14

Sequential_Peek,0: playerID:gonzaju03 yearID:1998 teamID:TEX lgID:AL G:154 AB:606 R:110 H:193 DB:50 TP:2 HR:45 RBI:157 SB:2 IBB:9

Sequential_Peek,0: playerID:belleal01 yearID:1996 teamID:CLE lgID:AL G:158 AB:602 R:124 H:187 DB:38 TP:3 HR:48 RBI:148 SB:11 IBB:15

Sequential_Peek,0: playerID:palmera01 yearID:1999 teamID:TEX lgID:AL G:158 AB:565 R:96 H:183 DB:30 TP:1 HR:47 RBI:148 SB:2 IBB:14

Sequential_Peek,0: playerID:ortizda01 yearID:2005 teamID:BOS lgID:AL G:159 AB:601 R:119 H:180 DB:40 TP:1 HR:47 RBI:148 SB:1 IBB:9

Sequential_Peek,0: playerID:castivi02 yearID:1998 teamID:COL lgID:NL G:162 AB:645 R:108 H:206 DB:28 TP:4 HR:46 RBI:144 SB:5 IBB:7

Sequential_Peek,0: playerID:teixema01 yearID:2005 teamID:TEX lgID:AL G:162 AB:644 R:112 H:194 DB:41 TP:3 HR:43 RBI:144 SB:4 IBB:5

Sequential_Peek,0: playerID:ramirma02 yearID:2005 teamID:BOS lgID:AL G:152 AB:554 R:112 H:162 DB:30 TP:1 HR:45 RBI:144 SB:1 IBB:9

Sequential_Peek,0: playerID:sweenmi01 yearID:2000 teamID:KCA lgID:AL G:159 AB:618 R:105 H:206 DB:30 TP:0 HR:29 RBI:144 SB:8 IBB:5

Sequential_Peek,0: playerID:gonzaju03 yearID:1996 teamID:TEX lgID:AL G:134 AB:541 R:89 H:170 DB:33 TP:2 HR:47 RBI:144 SB:2 IBB:12

300

Page 301: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Roundrobin Collector

� Go back into the Sequential_Peek stage Input

properties, and change the Collector type to

Roundrobin

• Save and compile the job

• Run the job

• Go back to the Director and view the job log

• Compare your output to the one below:

Sequential_Peek,0: playerID:sosasa01 yearID:1998 teamID:CHN lgID:NL G:159 AB:643 R:134 H:198 DB:20 TP:0 HR:66 RBI:158 SB:18 IBB:14

Sequential_Peek,0: playerID:ramirma02 yearID:1999 teamID:CLE lgID:AL G:147 AB:522 R:131 H:174 DB:34 TP:3 HR:44 RBI:165 SB:2 IBB:9

Sequential_Peek,0: playerID:tejadmi01 yearID:2004 teamID:BAL lgID:AL G:162 AB:653 R:107 H:203 DB:40 TP:2 HR:34 RBI:150 SB:4 IBB:6

Sequential_Peek,0: playerID:sosasa01 yearID:2001 teamID:CHN lgID:NL G:160 AB:577 R:146 H:189 DB:34 TP:5 HR:64 RBI:160 SB:0 IBB:37

Sequential_Peek,0: playerID:gonzaju03 yearID:1998 teamID:TEX lgID:AL G:154 AB:606 R:110 H:193 DB:50 TP:2 HR:45 RBI:157 SB:2 IBB:9

Sequential_Peek,0: playerID:thomafr04 yearID:2000 teamID:CHA lgID:AL G:159 AB:582 R:115 H:191 DB:44 TP:0 HR:43 RBI:143 SB:1 IBB:18

Sequential_Peek,0: playerID:galaran01 yearID:1996 teamID:COL lgID:NL G:159 AB:626 R:119 H:190 DB:39 TP:3 HR:47 RBI:150 SB:18 IBB:3

Sequential_Peek,0: playerID:belleal01 yearID:1998 teamID:CHA lgID:AL G:163 AB:609 R:113 H:200 DB:48 TP:2 HR:49 RBI:152 SB:6 IBB:10

Sequential_Peek,0: playerID:belleal01 yearID:1996 teamID:CLE lgID:AL G:158 AB:602 R:124 H:187 DB:38 TP:3 HR:48 RBI:148 SB:11 IBB:15

Sequential_Peek,0: playerID:vaughmo01 yearID:1996 teamID:BOS lgID:AL G:161 AB:635 R:118 H:207 DB:29 TP:1 HR:44 RBI:143 SB:2 IBB:19

301

Page 302: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 4C: Funnel

302

Page 303: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 4C Objective

� Create a job to illustrate Funnel stage operation

303

Page 304: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab4C – Building the Job

� Create the following flow which consists of

• 3 Row Generator stages

• 1 Funnel stage

• 1 Peek stage

� Enter the following

table definition under the Row Generator stage

properties Output Columns tab.

304

Page 305: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab4C – Saving the Table Definition

� Once entered, save the table definition as shown

below:

� This allows us to re-use the table definition in other

Row Generator stages.

305

Page 306: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab4C – Completing the Job

� Load the saved table definition into the other 2 Row Generator

stages

� Edit the Row Generator

properties such that

Generator_1 generates

100 rows, Generator_2

200 rows, and Generator_3

300 rows.

� Once your job looks like the one on the right, save the job as

lab4c.

• Compile the job

• Run the job

306

Page 307: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab4C – Results

� Verify that the record count going to the Peek stage

is 600 rows (100+200+300):

� Remember, links Input1, Input2, and Input3 get

combined in the Funnel stage, which outputs only 1

link while maintaining same number of partitions.

307

Page 308: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Agenda

1. DataStage Overview

2. Parallel Framework Overview

3. Data Import and Export

4. Data Partitioning, Sorting, and Collection

5. Data Transformation and Manipulation

6. Data Combination

7. Custom Components: Wrappers

8. Custom Components: Buildops

9. Additional Topics

10. Glossary

308

Page 10

Page 73

Page 116

Page 252

Page 309

Page 364

Page 420

Page 450

Page 477

Page 526

Page 309: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Data Transformation and Manipulation

In this section we will discuss the following:

� Modify

� Switch

� Filter

� Transform

� Related Stages

309

Page 310: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Modify Stage

� Modify stage is useful and effective for light

transformations:

• Drop columns – permanently remove columns that are not

needed from the record structure

• Keep columns – specify which columns to keep (opposite of

drop columns)

• Null handling – specify alternative null representation

• Substring – obtain only a subset of bytes from a Char

column.

• Change data types – alter column data types. Data must be

compatible between data types.

o For example, a column of type Char[3] with a

value of ‘ABC’ cannot be changed to become

an Integer type.310

Page 311: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Modify Stage Properties

� The Modify stage does not offer much support in terms of correct syntax to use for the transformations it supports.

� To successfully usethe Modify stage, you will need to consult the user’s guide for the correct syntax.

� In general, the formatis as follows:

• DROP columnname [, columnname]

• KEEP columnname [, columnname]

• new_columnname [:new_type] = [explicit_conversion_function] old_columnname

311

Page 312: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Null Handling Using Modify

Source Field Destination Field Result

Not Nullable Not Nullable Source value propagates to

destination

Not Nullable Nullable Source value propagates to

destination

Nullable Not Nullable If source value is not null, source

value propagates. If source value is

null, an error occurs, unless the

Modify stage is used to handle null

representation

Nullable Nullable Source value propagates to

destination

312

Page 313: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Modify Stage Examples

� Example transformations available in the Modify

stage:

• playerID = substring[0,3] (playerID) will start at the first

character and grab the first 3 bytes.

• startDate = year_from_date (startDate) will only retain the

year value and discard month and day.

• salary = handle_null (salary,0000000.00) will populate

‘0000000.00’ into the salary column for any incoming salary

column containing a null.

• salary2 = string_from_decimal (salary) will copy the value

from the salary column into the salary2 column. If salary2

did not previously exist, it will create it.

313

Page 314: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Switch Stage

� Switch stage is useful for

splitting up records and

sending them down different

links based on a key value.

• Similar in behavior to

switch/case statement in C

• Must provide a

Selector field to perform

the switch operation

• Must specify case value

and corresponding link

number (starts at 0)

314

Page 315: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Switch Stage Properties

� Selector Mode

• User-defined Mapping (default) – user provides explicit

mapping for values to outputs

• Auto – can be used when there is as many distinct selector

values as output links.

• Hash – input rows are hashed on the selector column

modulo the number of output links and assigned to an output

link accordingly. In this case, the selector column must be of

a type that is convertible to Unsigned Integer and may not be

nullable.

315

Page 316: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Switch Stage Properties (Continued)

� If Not Found

• Fail (default) – an invalid selector value will cause the job to

fail

• Drop – drops the offending record containing an invalid

selector value

• Output – sends offending record containing an invalid

selector value to a reject link.

� Discard Value

• Specifies an integer value of the selector column, or the

value to which it was mapped using Case, that causes a row

to be dropped (not rejected).

• Optional316

Page 317: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Filter Stage

� Filter stage acts like the WHERE clause in a SQL SELECT

statement

• Supports 1 input, and multiple output links, similar to Switch stage

• Can attach a reject link

• Valid WHERE clause operations:

o six comparison operators: =, <>, <, >, <=, >=

o true / false

o is null / is not null

o like 'abc' (the second operand must be a regular expression)

o between (for example, A between B and C is equivalent to B <= A and

A => C)

o is true / is false / is not true / is not false

o and / or / not

317

Page 318: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Other Filter Stage Properties

� Output Rejects – set to False by default. When set

to True, values which do not meet one of the Filter

criteria will be sent to a reject link

� Output Row Only Once – set to False by default.

When set to True, records are only output to the first

successful WHERE clause, whereas when set to

False the record will be output to all successful

WHERE clauses.

� Nulls Value – Determines whether a null value is

treated as 'Greater Than' or 'Less Than' other values.

318

Page 319: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Filter Stage Example

� In this example, we filter on the

pitcher’s ERA statistics

• ERA values below 3.25 will be sent

down the first link

• ERA values between

3.25 and 5.00 will be

sent down the second

link

• ERA values greater

than 5.00 will be sent

down the third link

319

Page 320: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Transform Stage

� Transformer stage provides an extensible interface for

defining data transformations

• Supports 1 input and

multiple outputs,

including reject

• Different user interface

from other stages

o Source to target

mapping is primary

interface

• Contains several

pre-built transformations

o Transformations can be

combined

• Supports Contraints – very similar to Filter stage functionality

Source:

Target:

Source Columns Target Columns

320

Page 321: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Transform Stage Features

� As mentioned on the previous slide, the Transform

stage contains many pre-built transformations.

• To access these, do the following:

1. Double-click in the column derivation area to bring up thederivation editor

2. Right-click to access the menu.

4. If the cursor is at the end of the linewhen you right-click, you will get thefollowing:

3. If the cursor is at the beginning of the linewhen you right-click, you will get thefollowing:

select Function

to accesspre-builttransforms 321

Page 322: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Transform Stage Features (Continued)

� The Operand Menu provides easy access to

other features besides the many pre-built

transformations.

• DS Macro – returns job related information

• DS Routine – calls a function from a UNIX shared library, and may

be used anywhere an expression can be defined.

• Job Parameter – insert any pre-defined DataStage Job Parameter

• Input Column – provides a list of columns from the input link to

choose from

• Stage Variable – globally defined variable that can be derived

using any supported transformation, and then re-used or

referenced within any Derivation the Transformer stage.

322

Page 323: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Transform Stage Features (Continued)

• System Variable –Framework level information that can be

referenced as part of the derivation or constraint

o @FALSE The value is replaced with 0.

o @TRUE The value is replaced with 1.

o @INROWNUM Input row counter.

o @OUTROWNUM Output row counter (per link).

o @NUMPARTITIONS The total number of partitions for the stage.

o @PARTITIONNUM The partition number for the particular instance.

• String – enter a string value which will become a hardcoded value

assigned to the column

• () Parantheses – inserts a pair of parentheses into the derivation

field.

• If Then Else – inserts “If Then Else” into the derivation field.

323

Page 324: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Transform Stage – Stage Variables

� Stage Variables offer the following advantages:

• Similar to global program variables

o Scope is limited to the Transformer

• Use to simplify derivations and constraints

• Use to avoid duplicate coding

• Retain values across reads

• Use to accumulate values and compare current values with

prior reads

324

Page 325: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Transform Stage – Derivations

� Derivations can be applied to each output column or

Stage Variable.

• Specifies the value to be moved to a output column

• Every output column must have a derivation.

o This can include a 1 to 1 map of the input

value/column.

• An output column does not require an input column

o Can hard code specific values

o Can include derivations based on built-in or

user-defined functions

325

Page 326: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

User Defined Transformer Routines

� The Transformer stage provides an interface for incorporating user created functions written in C++

• External Function Type: This calls a function from a UNIX shared libraryand may be used anywhere an expression can be defined. Any external function defined appear in the expression editor operand menu under Parallel Routines.

• External Before/After Routine Type:This calls a routine from a UNIX shared library, and can be specified in the Triggers page of a transformer stage Properties dialog box.

� Note: Functions must be compiled with a C++ compiler (not a C compiler).

326

Page 327: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Transform Stage Sample Derivations

• Stage Variable derivationexpands AL to “AmericanLeague” and NL to “NationalLeague”, stores value to Stage Variable called league

• league is mapped to newlyintroduced league_namecolumn on both outputs –transform defined only once.

• Constraint separates “AL”records from “NL” records

• yearID column is mapped toyear_in_league column

• DownCase() makes allcharacters lower case

Job design:

327

Page 328: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Transform Stage Constraints

� Constraints are used to filter out records

• A constraint applies to the entire record.

• A constraint specifies a condition under which incoming rows

of data will be written to an output link

• If no constraint is specified, all records are passed through

the link

• Constraints are defined for each individual output link

o Checking the ‘Reject Row’ box will force only

those records that did not meet the condition

specified in the constraints.

o No constraint is required for ‘Reject’ link.

328

Page 329: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Transformer Stage Execution Order

� Within the Transformer stage, there is an order of

execution:

1. Stage Variables derivations are

executed first

2. Constraints are executed before

derivations

3. Column derivations in earlier links

are executed before later links

4. Derivations in higher columns are

executed before lower columns

1

2

4

3

329

Page 330: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Transformer Null Handling

Source Field Destination Field Result

Nullable Not Nullable If source value is not null, source value

propagates. If source value is null, an

error occurs, unless the Modify stage is

used to handle null representation

� We had previously discussed null handling via the

Modify stage. Specifically, null handling should

occur for the following condition:

� The Transformer stage can also be used to perform

null handling.

330

Page 331: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Transformer Null Handling (Continued)

� The Transformer provides several built-in null

handling functions:

• IsNotNull() – returns ‘true’ if expression or input

value does not evaluate to Null

• IsNull() – returns ‘true’ if expression or input

value does evaluate to Null

• NullToEmpty() – sets input value to an empty

string if it was Null on input

• NullToValue() – sets input value to a specific

value if it was Null on input

• NullToZero() – sets input value to zero if it was Null on input

• SetNull() – assign a Null to the target column

331

Page 332: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Transformer Type Conversions

� Similar to the Modify stage, the Transformer also

offers a variety of built-in type conversion functions

• Some type conversions are handled

automatically by the framework, as

indicated by ‘d’ in the table.

• An ‘m’ indicates

manual conversion.

• Manual conversions

available in the

Transformer are

listed here.

332

Page 333: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Other Useful Transformer Functions

Date and Time Transformations

� Date, Time, and Timestamp

manipulations

String Transformations

� Trim off spaces, characters

� Compare values

� Pad characters

� Etc…

333

Page 334: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Transform Stage Usage Sample

� The best way to learn the Transformer is to use it

extensively.

• Use the Row Generator to generate data to test against

• Test the built-in transformations against the generated data

• Insert a Peek stage before and after the Transformer to

compare before and after results

o BeforePeek will show the records before they

are transformed

o AfterPeek will show the records as a result of

the transformations applied in the Transformer

stage.334

Page 335: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Relevant Stages

� Change Capture – Compares two input data sets, denoted before and after, and outputs a singledata set whose records represent the changes made to the before data set to obtain the after data

set. An extra column is put on the output dataset, containing a change code with values encoding the four actions: insert, delete, copy, and edit.

� Change Apply – Reads a record from the change data set (produced by Change Capture) and fromthe before data set, compares their key column values, and acts accordingly: If the before keys

come before the change keys in the specified sort order, the before record is copied to the output. The change record is retained for the next comparison. If the before keys are equal to the change keys, the behavior depends on the code in the change_code column of the change record:

• Insert: The change record is copied to the output; the stage retains the same before record for the next comparison. If key columns are not unique, and there is more than one consecutive insert with the same key, then Change Apply applies all the consecutive inserts before existing records. This record order may be different from the after data set given to Change Capture.

• Delete: The value columns of the before and change records are compared. If the value columns are the same or if the Check Value Columns on Delete is specified as False, the change and before records are both discarded; no record is transferred to the output. If the value columns are not the same, the before record is copied to the output and the stage retains the same change record for the next comparison. If key columns are not unique, the value columns ensure that the correct record is deleted. If more than one record with the same keys have matching value columns, the first-encountered record is deleted. This may cause different record ordering than in the after data set given to the Change Capture stage. A warning is issued and both change record and before record are discarded, i.e. no output record results.

• Edit: The change record is copied to the output; the before record is discarded. If key columns are not unique, then the first before record encountered with matching keys will be edited. This may be a different record from the one that was edited in the after data set given to the Change Capture stage. A warning is issued and the change record is copied to the output; but the stage retains the same before record for the next comparison..

• Copy: The change record is discarded. The before record is copied to the output

335

Page 336: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

� Difference – Takes 2 presorted datasets as inputs and outputs a single data setwhose records represent the difference between them. The comparison is performed

based on a set of difference key columns. Two records are copies of one another if they have the same value for all difference keys. You can also optionally specify change values. If two records have identical key columns, you can compare the value columns to see if one is an edited copy of the other. The stage generates an extra column, DiffCode, which indicates the result of each record comparison.

• The Difference stage is similar, but not identical, to the Change Capture stage. The Change Capture stage is intended to be used in conjunction with the Change Apply stage. The Difference stage outputs the before and after rows to the output data set, plus a code indicating if there are differences. Usually, the before and after data will have the same column names, in which case the after data set effectively overwrites the before data set and so you only see one set of columns in the output. If your before and after data sets have different column names, columns from both data sets are output; note that any key and value columns must have the same name.

� Compare – Performs a column-by-column comparison of records in two presortedinput data sets. You can restrict the comparison to specified key columns. The

Compare stage does not change the table definition, partitioning, or content of the records in either input data set. It transfers both data sets intact to a single output data set generated by the stage. The comparison results are also recorded in the output data set. It is recommended that you use runtime column propagation (RCP) in this stage to allow DataStage to define the output column schema. The stage outputs three columns:

• Result: Carries the code giving the result of the comparison.

• First: A subrecord containing the columns of the first input link.

• Second: A subrecord containing the columns of the second input link.

Relevant Stages

336

Page 337: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Relevant Stages� Encode – use any command available from the Unix command line to encode / mask

data. The stage converts a data set from a sequence of records into a stream of raw binary data. An encoded data set is similar to an ordinary one, and can be written to a data set stage. You cannot use an encoded data set as an input to stages that performs column-based processing or re-orders rows, but you can input it to stages such as Copy. You can view information about the data set in the data set viewer, but not the data itself. You cannot repartition an encoded data set.

� Decode – use any command available from the Unix command line to decode /unmask data. It converts a data stream of raw binary data into a data set. As the

input is always a single stream, you do not have to define meta data for the input link.

� Compress – use either Unix compress or GZip utility to compress data. It converts adata set from a sequence of records into a stream of raw binary data. A compressed

data set is similar to an ordinary data set and can be stored in a persistent form by a DataSet stage. However, a compressed data set cannot be processed by many stages until it is expanded. Stages that do not perform column-based processing or reorder the rows can operate on compressed data sets. For example, you can use the copy stage to create a copy of the compressed data set.

� Expand – use either Unix uncompress or GZip utility to de-compress data. It convertsa previously compressed data set back into a sequence of records from a stream of

raw binary data.

337

Page 338: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Relevant Stages

� Surrogate Key – generates a unique key column for an existing data set.Can specify certain characteristics of the key sequence. The stage

generates sequentially incrementing unique integers from a given starting point. The existing columns of the data set are passed straight through the stage. Can be executed in parallel.

� Column Generator – generates additional column(s) of data and appendsit onto an incoming record structure

� Head – outputs first N records in each partition. Can optionally selectrecords from certain partitions or skip certain number of records.

� Tail – outputs last N records in each partition. Can optionally selectrecords from certain partitions.

� Sample – outputs a sample of the incoming data. Can be configured toperform either a percentage (random) or periodic sampling. Can

distribute samples to multiple output links.

338

Page 339: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 5A: Modify vs Transformer

339

Page 340: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 5A Objectives

� Learn more about the Modify and Transformer stage

� Use the Modify stage to perform date field

manipulations

� Use the Transformer stage to perform the same date

field manipulations

� Compare results by using the Compare stage

� Verify results are the same

340

Page 341: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab5a – High Level Overview

� As a 1st step, lay out the following flow:

� Make sure to label the stages and links accordingly.

� Use the Master.ds dataset in the Dataset stage, which was

created in lab3b_master

• In the Output Columns tab, click on Load to load the Master table

definition (previously saved in lab3).

� Head and Copy stages will use their default properties

• Head will only keep first 10 records per partition

• Copy creates an identical copy of the data for each output link

341

Page 342: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab5a – Job Design

� In the Modify stage, enter the following specifications:

� This splits the date column into separate Year, Month, and Day

� In the Output Columns tab for Modify, click on Load to load the Master

table definition again.

Add the following 3 additionalcolumn definitions:• debutYear – Integer• debutMonth – Integer• debutDay – IntegerModify will create these as part ofthe transformations defined bythe specifications.

Source to target mapping – tellsModify to ‘keep’ the debut column.

342

Page 343: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab5a – SimpleTransform Mappings

� In the SimpleTransform stage, map the following:

Create a new Integer column and name it ‘debutID’

343

Page 344: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab5a – SimpleTransform Transformation

� Next, double-click in the Derivation area next to

debutID:

� Enter the following Derivation for the target debutID

column:

• This will calculate a debutID field based on the Year, Month,

and Date that the baseball player played his first game

• Note: ‘fromModify’ in the derivation is the name of the input

link. If you used a different link name, then use that one.

Double-click hereto access Derivations

fromModify.debutYear+fromModify.debutDay*fromModify.debutMonth

344

Page 345: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab5a – ComplexTransform Mappings

� In the ComplexTransform stage, map the following:

Create a new Integer column and name it ‘debutID’

345

Page 346: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab5a – ComplexTransform Stage Variable

� Next, right-click under Stage Variables and select ‘Insert New Stage

Variable’

• A new Stage Variable named StageVar should be created

� Double-click the Derivation field next to StageVar and enter the

following Derivation:

• Each of these functions can be accessed by right-clicking and selecting

‘Function…’ from the menu. Look under Date & Time category.

• Column names can be accessed by right-clicking and selecting ‘Input

Column…’ or typing in ‘toTransformer.’ and selecting from the list.

• You can keep it all on one line. Hit Enter when finished.

• Note: ‘toTransformer’ in the derivation is the name of the input link. If you

used a different link name, use that one.

YearFromDate(toTransformer.debut)

+MonthDayFromDate(toTransformer.debut)

*MonthFromDate(toTransformer.debut)

346

Page 347: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab5a – ComplexTransform Transformation

� Click on ‘StageVar’ and drag it into the Derivation

field for debutID.

When finished, your Derivation for debutID should appear similar to the one on the left.

Stage Variables allow for the same derivation to be mapped to many different fields, but only calculated once.

In this example, we are only mapping it to debutID, but we could have mapped it to any other Integer column.

347

Page 348: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab5a – Save, Compile, Run

� Once finished editing the transformations, save the job

as ‘lab5a.’

� Compile the job

� Run the job and open the Director to view the log

• Compare the Peek output from the Modify and

SimpleTransformer combination against the Peek output from

the ComplexTransformer

• The debutID’s for a given playerID should be identical.

• You can limit the number of columns output from the Peek (via

Peek stage properties) to make it easier to read in the output log.

� Once the job runs correctly, save a copy as ‘lab5a_2.’

348

Page 349: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab5a – Compare the Outputs

� Instead of manually comparing the 2 outputs, there is

an easier way to do this.

� Modify lab5a to look like the following:

349

Page 350: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab5a – Compare Stage

� In the Compare stage properties, setup the following

options:

• We are assuming that all records will be unique in the

Master file

• We are comparing records based on playerID and debutID.

Any record with the same playerID and debutID value will be

compared.

• If a different record shows up, then the job will abort.

� For both input links, Hash and Sort on playerID and

debutID.

350

Page 351: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab5a – Save, Compile, Run

� Once finished editing, save the job, still keeping it as

‘lab5a.’

� Compile the job

� Run the job and open the Director to view the log

• Did the job run to completion or did it fail?

• Assuming that the job was assembled correctly, then the job

should have run to completion successfully with no errors or

warnings, implying that the Compare worked and all records

are identical

• In the Peek output, you will notice a new column called ‘result’.

A value of ‘0’ in ‘result’ indicates that the records were identical.

� Once the job runs correctly, save a copy as lab5a_2.

351

Page 352: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab5a – Using Full Data Volume

� Now let’s test against the full data volume by

removing the Head stage:

� Be sure to either retain the link labelled ‘toCopy’ or

rename the link going into the Copy stage to ‘toCopy’

• Not doing this will ‘break’ the source to target mapping on

the output of the Copy stage.

352

Page 353: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab5a – Save, Compile, Run

� Once finished editing, save the job, still keeping it as

‘lab5a’

� Compile the job

� Run the job and open the Director to view the log

• Did the job run to completion or did it fail?

• The job should have run to completion successfully with no

errors or warnings, implying that the Compare worked and

all records are identical

• Final output should show 3817 records processed. This can

be seen from the Performance Monitor. It can be accessed

from the Director � Tools � New Monitor.

353

Page 354: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 5B: Filter vs Switch vs Transformer !

354

Page 355: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 5B Objective

� Separate records in the Master file into 3 groups:

• Records where birthYear <= 1965

• Records where birthYear between 1966 and 1975

• Records where birthYear >= 1976

� Use Filter, Switch, and Transformer stages to

accomplish the same task and achieve same results!

355

Page 356: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab5b – The BIG Picture!

Here’s how we willachieve this!

Use the same Master.ds and tabledefinition as lab5a.

When you are done,this is what your jobwill look like.

Note that stages canbe resized!

356

Page 357: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab5b – Design Strategy

Here’s the best wayto approach buildingsuch a complex job:1. Build little by little2. Test progress

along the way

In this example, youwould first build #1,test it, then add #2,test it, then add #3,and test it. This wayyou minimize yourdebug efforts!

1

2

3

357

Page 358: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab5b – Filter Properties

� In the Filter stage properties, you will need to define 3

outputs – one for each birthYear range:

Note that output link numberingstarts at ‘0’.

How do you figure out which link numbercorresponds to which output link???Solution: Click the Output Link Ordering tab.

358

Page 359: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab5b – Switch Properties

� In the Switch stage properties, you will need to define

at least 2 outputs – optionally a 3rd:

• While cumbersome, you will need to

explicitly define which values flow down

which output link.

• What about values where

birthYear >= 1976 ? You have 2 options:

o Explicitly define the Case mappings

o Send all values where birthYear >= 1976

down a reject link as shown on the right

Note: Reject links do not allow mapping!

• Note that output link numbering here also

starts at ‘0’.359

Page 360: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab5b – Transformer Properties

� In the Transformer you will

need to have 3 outputs links,

each with a unique

constraint:

• birthYear <= 1965

• birthYear >=1966 And

birthYear<=1975

• birthYear >= 1976

� Note that column names

need to be preceeded by

input link names.

� These Constraint Derivations

could be entered by either

manually typing it in or using

the GUI interface.

360

Page 361: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab5b – Save, Compile, Run

� As you complete each section, save the job ‘lab5b’

� Compile and run the job.

� For each date range, you should consistently see the

following record counts (also shown on next page)

• 1955 to 1965: 862

• 1966 to 1975: 1903

• 1976 to 1986: 1052

• Total records: 3817

� There should be no warnings or errors reported.

361

Page 362: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab5b - Results

Verify that yourrecord count results match thoseshown on the right.

Also make sure theresults are consistentfor the Filter, Switch, and Transformer.

362

Page 363: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Agenda

1. DataStage Overview

2. Parallel Framework Overview

3. Data Import and Export

4. Data Partitioning, Sorting, and Collection

5. Data Transformation and Manipulation

6. Data Combination

7. Custom Components: Wrappers

8. Custom Components: Buildops

9. Additional Topics

10. Glossary

363

Page 10

Page 73

Page 116

Page 252

Page 309

Page 364

Page 420

Page 450

Page 477

Page 526

Page 364: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Data Transformation and Manipulation

In this section we will discuss the following:

� Join

• InnerJoin

• LeftOuterJoin

• RightOuterJoin

• FullOuterJoin

� Merge

� Lookup

• DataSet

• RDBMS

� Aggregator

364

Page 365: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Performing Joins

� There are 4 different types of joins explicitly supported by DataStage:

• InnerJoin (default)

• LeftOuterJoin

• RightOuterJoin

• FullOuterJoin

� To illustrate the functionality, we will use the following 2 sets of record

inputs. Note that all columns in this example are of character type.

KeyField FirstName

123 Randy

456 Roger

789 Nolan

789 Ken

LastName KeyField

Ryan 789

Maddux 012

Clemens 456

Right InputLeft Input

365

Page 366: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

InnerJoin

� InnerJoin will result in records containing LeftField

and RightField where KeyField is an exact match

LastName KeyField FirstName

Clemens 456 Roger

Ryan 789 Nolan

Ryan 789 Ken

InnerJoin Output

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Right InputLeft Input

366

Page 367: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

LeftOuterJoin

� LeftOuterJoin will result in all records from the left input and only the

records from the right input where KeyField is an exact match

LastName KeyField FirstName

Ryan 789 Nolan

Ryan 789 Ken

Maddux 012

Clemens 456 Roger

LeftOuterJoin Output

What happened here?Because there was no match, ablank is populated instead. If the fieldwas numeric, a zero would have beeninserted. If the field was nullable,then a null would have been inserted.

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Right InputLeft Input

367

Page 368: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

RightOuterJoin

� RightOuterJoin will result in all records from the right input and only the

records from the left input where KeyField is an exact match

LastName KeyField FirstName

123 Randy

Clemens 456 Roger

Ryan 789 Nolan

Ryan 789 Ken

RightOuterJoin Output

What happened here?Because there was no match, ablank is populated instead. If the fieldwas numeric, a zero would have beeninserted. If the field was nullable,then a null would have been inserted.

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Right InputLeft Input

368

Page 369: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Note: A blank is populated where there’s no match. If the field was numeric type, a zero would have been inserted. If the field was nullable, then a null would have been inserted.

FullOuterJoin

� FullOuterJoin will result in all

records from both inputs and

the records where KeyField

is an exact match

LastName leftRec_KeyField rightRec_KeyField FirstName

Ryan 789 789 Nolan

Ryan 789 789 Ken

Maddux 012

123 Randy

Clemens 456 456 Roger

FullOuterJoin Output

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Right InputLeft Input

369

Page 370: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Join Stage Properties

� Join stage

• Must have at least 2 inputs, but supports more

• Must specify at least 1 join key

o Join key(s) must have same

name on both inputs

• All inputs must be pre hashed

and sorted by the join key(s).

• No reject capability

o Need to perform

post-processing to detect

failed matches (check for nulls, blanks, or 0’s) – applicable for

LeftOuterJoin, RightOuterJoin, and FullOuterJoin.

• Always use Link Ordering to differentiate between Left and Right

input Links!

o Label your links accordingly

370

Page 371: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Merge Stage Properties

� Merge combines a Master and an Update record based on matching key

field(s)

• Must have at least 2 inputs – 1 Master input and 1+ Update input(s)

• Can have multiple Updates, but only

1 Master input

o Update records are consumed

once matched with Master record

o Master records must be duplicate free

o Update records can contain duplicates

• Key needs to have same field name

• All inputs must be pre hashed and

sorted by the merge key(s)

• Supports optional reject record

processing – simply attach reject link

• Always use the Link Ordering tab to verify correct Master and Update order

o Label your links accordingly

371

Page 372: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Merge Stage

� To illustrate the Merge stage functionality, we will use the following 2 sets

of record inputs. Note that all columns in this example are of character

type.

KeyField FirstName

123 Randy

456 Roger

789 Nolan

789 Ken

LastName KeyField

Ryan 789

Maddux 012

Clemens 456

Update InputMaster Input

372

Page 373: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Merge Stage Action – Keep Unmatched

� Merge with ‘Unmatched Masters Mode = Keep’

LastName KeyField FirstName

Ryan 789 Nolan

Ryan 789 Ken

Maddux 012

Clemens 456 Roger

Merge Output

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Update InputMaster Input

Merge Properties:

What happened here?Because there was no Update, a blank is populated instead. If the field was numeric, a zero would have been inserted. If the field was nullable, then a null would have been inserted.

KeyField FirstName

123 Randy

Optional Reject Output

373

Page 374: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Merge Stage Action – Drop Unmatched

� Merge with ‘Unmatched Masters Mode = Drop’

LastName KeyField FirstName

Ryan 789 Nolan

Ryan 789 Ken

Clemens 456 Roger

Merge Output

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Update InputMaster Input

Merge Properties:

KeyField FirstName

123 Randy

Optional Reject Output

LastName KeyField

Maddux 012

Dropped Record

374

Page 375: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lookup Stage Properties

� Lookup maps record(s) from the lookup table to an input

record with matching lookup key field(s)

• Must have at least 2 inputs – 1 Primary input and

1+ Lookup Table input(s)

• Can have multiple Lookup Tables,

but only 1 Primary input

o Lookup tables are expected to be

duplicate free, but duplicates are allowed

o Update records can contain duplicates

• Inputs do not need to be partitioned or

sorted

• Lookup Tables are pre-loaded into

shared memory

o Always make sure that your lookup table fits in available shared memory

• Uses interface very similar to that of the Transformer stage

375

Page 376: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lookup Stage Properties (Continued)

� Lookup stage supports conditional lookups

• Derivations for conditional lookup entered similar to Transformer derivations:

• Supports various error handling modes:

o Continue – pass input records that fail the lookup and/or condition through to the

output.

o Drop – permanently drop records that fail the lookup and/or condition

o Fail – default option, causes entire job to fail if lookup and/or condition are not met

o Reject – output records that fail the lookup and/or condition to a reject link

Allow Duplicatesin Lookup Table

376

Page 377: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

How to Perform a Lookup

Step #1: Identify the lookup key(s).Lookup key(s) can be designated bychecking the ‘key’ column.

Step #2: Map the input key to the corresponding lookup key. Fieldnames do not need to match

Step #3: Map the input columns from both the input and the lookuptable to the output.

377

Page 378: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lookup Table Sources

� Lookup tables can be from virtually any source

• The reference link going into the Lookup stage can be from a

larger flow, not just a data source such as flat files or parallel

Datasets

• Lookup Filesets

o Allows lookup tables to be persistent

o Must pre-define lookup key(s)

o Creates a persistent indexed lookup table

o Uses a .fs extension

• Sparse Lookups – RDBMS

o Database lookup – instead of loading the lookup table into

shared memory, the lookup table remains inside the database

o Good for situations where lookup table already resides inside

the database and is much larger than the primary input data378

Page 379: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Performing Sparse Lookups

� Sparse Lookups

• Supported for the following Enterprise stages:

o Oracle, DB2, Sybase, and ODBC

• Lookup table source must be one of the

supported RDBMS stages above

o Specify ‘Lookup Type = Sparse’ in the

RDBMS stage

o Optionally specify your

own lookup SQL by

using the User Defined

SQL option instead of

Table Read Method

• Lookup stage still works

the same way as before

379

Page 380: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lookup Example – Continue Mode (No Dups)

� Lookup with Lookup Failure Mode set to Continue and Duplicates to

false will result in all records from the Primary input and only the

records from the lookup table where KeyField is an exact match

LastName KeyField FirstName

Ryan 789 Nolan

Maddux 012

Clemens 456 Roger

Lookup with Continue Mode Results

What happened here?Because the lookup failed, ablank is populated instead. If the fieldwas numeric, a zero would have beeninserted. If the field was nullable,then a null would have been inserted.

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Lookup TablePrimary Input

380

Page 381: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lookup Example – Continue Mode (With Dups)

� Lookup with Lookup Failure Mode set to Continue and Duplicates to

true will result in all records from the Primary input and only the records

from the lookup table where KeyField is an exact match

LastName KeyField FirstName

Ryan 789 Nolan

Ryan 789 Ken

Maddux 012

Clemens 456 Roger

Lookup with Continue Mode Results

What happened here?Because the lookup failed, ablank is populated instead. If the fieldwas numeric, a zero would have beeninserted. If the field was nullable,then a null would have been inserted.

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Lookup TablePrimary Input

381

Page 382: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lookup Example – Drop Mode (No Dups)

� Lookup with Lookup Failure Mode set to Drop and Duplicates to false

will result in only records from the Primary input and the corresponding

records from the lookup table where KeyField is an exact match

LastName KeyField FirstName

Ryan 789 Nolan

Clemens 456 Roger

Lookup with Drop Mode Results

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Lookup TablePrimary Input

382

Page 383: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lookup Example – Drop Mode (With Dups)

� Lookup with Lookup Failure Mode set to Drop and Duplicates to true

will result in only records from the Primary input and the corresponding

records from the lookup table where KeyField is an exact match

LastName KeyField FirstName

Ryan 789 Nolan

Ryan 789 Ken

Clemens 456 Roger

Lookup with Drop Mode Results

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Lookup TablePrimary Input

383

Page 384: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lookup Example – Reject Mode (No Dups)

� Lookup with Lookup Failure Mode set to Reject and Duplicates to false

will result in only records from the Primary input and the corresponding

records from the lookup table where KeyField is an exact match

LastName KeyField FirstName

Ryan 789 Nolan

Clemens 456 Roger

Lookup with Drop Mode Results

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Lookup TablePrimary Input

LastName KeyField

Maddux 012

Reject Record

384

Page 385: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lookup Example – Reject Mode (With Dups)

� Lookup with Lookup Failure Mode set to Reject and Duplicates to true

will result in only records from the Primary input and the corresponding

records from the lookup table where KeyField is an exact match

LastName KeyField FirstName

Ryan 789 Nolan

Ryan 789 Ken

Clemens 456 Roger

Lookup with Drop Mode Results

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

789

789

456

123

KeyField

Nolan

Ken

Roger

Randy

FirstName

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Clemens

Maddux

Ryan

LastName

456

012

789

KeyField

Lookup TablePrimary Input

LastName KeyField

Maddux 012

Reject Record

385

Page 386: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Aggregator Stage Properties

� Aggregator stage performs aggregations based on

user-defined grouping criteria

• Must specify at least 1 key

for grouping criteria

• Input data must be minimally

hash partitioned by the

grouping key.

• Optionally define a column or

columns for calculation

o Over 15 aggregation functions

available

o If aggregation function is not selected, all functions will be

performed.

386

Page 387: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Aggregator Stage Properties (Continued)

� Aggregation types

• Calculation – performs an aggregation function against one or

more selected columns

• Re-Calculation – Similar to Calculate but performs the specified

aggregation function(s) on a set of data that had already been

previously aggregated, using the Summary Output Column

property to produce a subrecord containing the summary data that

is then included with the data set. Select the column to be

aggregated, then specify the aggregation functions to perform

against it, and the output column to carry the result.

• Row Count – count the total number of unique records within each

group as defined by the grouping key criteria.

387

Page 388: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Aggregator Stage Properties (Continued)

� Aggregation Methods

• Hash – Default option. Use this mode when the number of

unique groups, as defined by grouping key(s), are relatively

small; generally, fewer than about 1000 groups per

megabyte of memory.

o Input data should be previously hash partitioned by the

grouping key(s)

o Memory intensive

• Sort – Use this mode when there are a large number of

unique groups as defined by grouping key(s), or if unsure

about the number of groups

o Input data should be previously hash partitioned and sorted by

the grouping key(s)

o Uses less memory, but more disk I/O

388

Page 389: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Aggregator Example

� Suppose we would like to find out, based on the data

in our baseball Salaries.ds dataset, the following:

• How many players are on each team each year

• What the average salary is per team each year.

� The flow would look like the following

389

Page 390: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Aggregator Example (Continued)

Output Mapping:

Output Mapping:

Calculate the player count and averagesalary separately and join the results together afterwards.

Note: Data is being hash and sorted priorto the copy. Why?

Default output data typeis double

390

Page 391: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 6A: Join & Lookup

391

Page 392: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 6A Objectives

� Use Join stage to map a baseball player’s first and

last name to his corresponding Batting record(s)

� Repeat the above functionality using the Lookup

stage

� Repeat the above functionality using the Merge stage

392

Page 393: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab6a – Data Sources

� Review the table definitions for the data we will be working with:

Column Name Description

playerID Player ID codeyearID Year

teamID Team

lgID League

G GamesAB At Bats

R Runs

H Hits

DB Doubles

TP TriplesHR Homeruns

RBI Runs Batted In

SB Stolen Bases

IBB Intentional walks

Column Name Description

playerID A unique code asssigned to each player. birthYear Year player was born

birthMonth Month player was bornbirthDay Day player was born

nameFirst Player's first namenameLast Player's last name

debut Date player made first major league appearancefinalGame Date player made last major league appearance

Batting Data

Master Data

These are the 2 columnsyou are interested in…

We will leverage the playerID key thatexists in both datasets to identify and mapthe correct nameFirst and nameLast columns.

Note that a given playerID value will likely appearin many records, based on how many years heplayed in the league. While the playerID will bethe same, yearID should always be different.

393

Page 394: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab6a – Job Design

� The objective is to leverage the Join stage to map the player’s correct

first and last names to his batting record

� Build a job similar to the following

� Which Join type will you use to ensure that all records from your

Batting.ds file make it through to the output?

� We only care about picking up nameFirst and nameLast columns from

the Master data

• Only map these two columns on the output of the Join stage, and remember

to disable RCP for this stage so that other columns are not propagated

along.

Hash & Sort on Join key

Make sure this is the ‘left’or primary input.

394

Page 395: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

lab6a_join – Job Results

� Save the job as lab6a_join

� Compile and Run

� How many records did the Job Monitor report on the output of the Join?

• 25076 player batting records (4720 unique player batting records)

• 3817 master player records

� In the Director Job Log, what did the Peek stage report?

• Here’s an example of the output:

• Based on above record counts and Peek output, it’s obvious that we don’t

have master data for all players in the batting data.

395

Page 396: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

lab6a_join – Design Update

� Next, let’s separate out Batting records for which no Master is

available

� Append onto the original flow a Filter stage to separate out

records where Master data is not available:

� What kind of Filter criteria can you use to accomplish this?

• If there’s no Master data match, what happens to the nameFirst

and nameLast columns?

• If there’s no match, wouldn’t nameFirst = nameLast?

396

Page 397: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

lab6a_join – Updated Job Results

� Save the job again as lab6a_join

� Compile and Run

� How many records did the Job Monitor report on each output of the

Filter?

• 19877 player batting records where there was a Master record match

• 5199 player batting records where there was no Master record match

397

Page 398: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

lab6a_lookup – Overview

� Save lab6a_join as lab6a_lookup

� Next, replace the Join stage with the Lookup stage

• Make sure to have your links setup correctly

• Use the same lookup key as join key

• Make sure that the Fail condition is set to ‘Continue’ so that the job

does not fail when a lookup failure is encountered

398

Page 399: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

lab6a_lookup – Job Results

� Save the job as lab6a_lookup

� Compile and Run

� How many records did the Job Monitor report on each output of

the Filter?

• Should be the same as lab6a_join

• 19877 player batting records where there was a Master record

match

• 5199 player batting records where there was no Master record

match

399

Page 400: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

lab6a_lookup – Design Update

� Is there another way to achieve the same results without using

the Filter stage?

� Try creating the following flow to see if you can replicate the

same results:

• Save As ‘lab6a_lookup2’

• 19877 player batting records

where there was a Master

record match

• 5199 player batting records

where there was no Master

record match

400

Page 401: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab6a – What about the Merge?

� Could you have used the Merge stage to replicate the

same functionality in lab6a_join and lab6a_lookup?

• The Batting data contains multiple records with the same

playerID value. That is because a given player typically

stays in the league multiple years.

• The Master data contains unique records

• Merge expects its Master input to contain no duplicates,

where duplicates records are defined by the merge keys only

• Merge also consumes the Update record once a match

occurs against an incoming Master record.

• Answer: Yes! What if you used the Master data as the

Master input and the Batting data as the Update?

401

Page 402: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

lab6a_merge – Design

�Save job lab6a_lookup2 as lab6a_merge

�Edit the job to reflect the

following

�Save, Compile, and Run

�Your results should match:

• 19877 player batting records

where there was a Master

record match

• 5199 player batting records

where there was no Master record match

402

Page 403: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 6B: Aggregator

403

Page 404: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 6B Objectives

� Use the Aggregator stage to perform the following:

• Find the pitcher with the best ERA per team per year

• Find the pitcher(s) with the highest salary per team per year

o Note: Some pitchers may have the same salary

• Determine if it’s the same person!

404

Page 405: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab6b – Data Sources

� Review the table definitions for the data we will be working with:

Column Name Description

playerID Player ID code

yearID YearteamID Team

lgID LeagueW Wins

L Losses

SHO ShutoutsSV Saves

SO StrikeoutsERA Earned Run Average

Pitching Data

Salaries Data

We will leverage the playerID, yearID, lgID, and teamID keys that exists in both datasets to identify and map the correct salary column.

Column Name Description

yearID Year

teamID Team

lgID League

playerID Player ID code

salary Salary

405

Page 406: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab6b – Job Design

� Here’s what the job will look like once you are finished building it!

• To ease development, you will build this job one part at a time and test the

results along the way.

406

Page 407: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

� Use the Aggregator stage to find the pitcher with the lowest ERA on

each team, each year.

• Use the Filter stage to eliminate records where ERA < 1 AND W < 5

o It’s not likely for a pitcher to have a legitimate season ERA less than 1.00 and

have won fewer than 5 games

• In the Aggregator stage, isolate the record with

the lowest ERA per team per year

o Group by teamID and yearID keys

o Calculate minimum value for ERA

lab6b_aggregator – Step 1

Should be <=, >=

407

Page 408: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

lab6b_aggregator – Step 1 (Continued)

� Step 1 continued…

• Aggregator will produce 1 record per team per year, containing the lowest

ERA value for that year.

o Will not contain playerID

• playerID Lookup – Need to use Lookup, Join, or Merge to map playerID and

other relevant columns back to the output of the Aggregator

o For Example:

� Note: Be sure to disable

RCP if not mapping all

columns across

408

Page 409: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

lab6b_aggregator – Step 1 (Continued)

� Save the job as ‘lab6b_aggregator’

� Compile

� Run the job

� Verify, that your record counts match the following:

409

Page 410: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

� Use another Aggregator stage to find the pitcher with the highest salary

on each team, each year. Extend the flow as shown below:

lab6b_aggregator – Step 2

410

Page 411: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

lab6b_aggregator – Step 2 (Continued)

� Salary Lookup – first, you will need to remove all records from the

Salaries data which does not belong to a pitcher (contains both batter

and pitcher data)

• Use the Pitching data

to identify the pitchers

• Perform a Lookup

against the Salaries

data and only the

pitcher salaries will be

returned!

411

Page 412: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

lab6b_aggregator – Step 2 (Continued)

� Calculate Top Salary – Use the Aggregator stage to find the highest

paid pitcher on each team for each year

• Group by teamID and yearID keys

• Calculate maximum value for salary

� playerID Lookup – Need to use Lookup,

Join, or Merge to map playerID and

other relevant columns back to

the output of the Aggregator.

• For Example:

� Note: Be sure to disable RCP if

not mapping all columns across

412

Page 413: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

lab6b_aggregator – Step 2 (Continued)

� Save the job, keeping it as ‘lab6b_aggregator’

� Compile and Run the job

� Verify, that your record counts match the following:

NOTE:

It is likely that salarydata was not availablefor all pitchers in the Pitching data set

Also, some pitchersmay have the samesalary.

413

Page 414: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

� Finally, determine whether or not the pitchers with the best ERA

records are also the ones who are being paid the most

• Extend the flow as shown below:

lab6b_aggregator – Step 3

414

Page 415: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

lab6b_aggregator – Step 3 (Continued)

� The Answer – use the Lookup stage to find all records where

best ERA = highest salary

• Lookup will send all

matching records where

the pitcher with the best

ERA also had the highest

salary

• Enable rejects and all

records which do not

match the above criteria

will flow down the reject

link instead

415

Page 416: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

lab6b_aggregator – Step 3 (Continued)

� Calculate Top Salary – Use the Aggregator stage to find the highest

paid pitcher on each team for each year

• Group by teamID and yearID keys

• Calculate maximum value for salary

� playerID Lookup – Need to use Lookup,

Join, or Merge to map playerID and

other relevant columns back to

the output of the Aggregator.

• For Example:

� Note: Be sure to disable RCP if

not mapping all columns across

416

Page 417: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

lab6b_aggregator – Final Results

� Save the job, keeping it as ‘lab6b_aggregator’

� Compile and Run the job

� Verify, that your record counts match the following:

Answer:

Having the best ERA doesnot correlate to being thebest paid pitcher on the team!

417

Page 418: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

lab6b_aggregator – Optimization� A simple optimization that can be performed in this job is to hash and

sort the data only once, before Copy1, instead of doing it twice as

before. Remember, the data needs to be hash and sorted for the

Aggregator stage to function properly when using the Sort mode.

When processing large volumes of

data, eliminating unecessary hash

and sorts will improve your

performance!

418

Page 419: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Agenda

1. DataStage Overview

2. Parallel Framework Overview

3. Data Import and Export

4. Data Partitioning, Sorting, and Collection

5. Data Transformation and Manipulation

6. Data Combination

7. Custom Components: Wrappers

8. Custom Components: Buildops

9. Additional Topics

10. Glossary

419

Page 10

Page 73

Page 116

Page 252

Page 309

Page 364

Page 420

Page 450

Page 477

Page 526

Page 420: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Wrappers

In this section we will discuss the following:

� Wrappers

• What is a wrapper

• Use case

• How to create

420

Page 421: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

What is a Wrapper?

� DataStage allows users to leverage existing applications within

a job by providing the means to call the executable from within a

‘Wrapper’.

� Wrappers can be:

• Any executable application (C, C++, COBOL, Java, etc…) which

supports standard input and standard output, or named pipes

• Executable scripts (Korn shell, awk, PERL, etc…)

• Unix commands

� DataStage treats wrappers as a black box

• Does not manage or know about what goes on within a wrapper.

� Wrappers support from 0 to many input and output links

• Should match the action of the executable being wrappered

421

Page 422: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

What is a Wrapper? (Continued)

� Wrappers are considered to be ‘external’ applications

• Data must be 1st exported from DataStage before it’s passed

onto the wrapped executable.

• Once processed by the wrapped executable, data must be

imported back into DataStage before further processing can

occur.

Data is processed

through Unix’s grep command

Data is imported from

Unix to DataStage

Data is exported from

DataStageto Unix

Example

Wrapper

422

Page 423: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Why Wrapper?

� Wrappers allow existing executable functionality to be

redeployed as a stage within DataStage

• Re-use the logic in 1 or many different jobs

• Achieve higher scalability and better performance than

running it sequentially

o Some applications cannot or should not be executed in parallel.

o Some applications require the entire dataset in a single

partition, thus inhibiting its ability to process in parallel

• Avoid re-hosting of complex logic by creating a Wrapper

instead of a complete DataStage job.

423

Page 424: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Wrapper Use Case

� An existing legacy COBOL application performs the following:

• Reads data in from a flat file

• Performs lookups against a database table

• Scores the data

• Writes the results out to a flat file

� Because this COBOL application does not need to be processed

sequentially and can support named pipes for the input, it becomes an

ideal candidate for becoming a Wrapper.

• The Wrapper will appear as a stage and can be used in any applicable job

A

Inputfile

w B RDBMS

COBOL Wrapper

424

Page 425: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating a Wrapper – Step 1

� To create a Wrapper, 1st do the following:

• Stage Type Name will be

the name that shows up

on the palette

• Command is where the

name of the executable

is entered.

• Execution Mode is

parallel by default, but can be sequential.

‘grep’ is a Unix command for searching – inthis example, search for any text containing the string “NL”

425

Page 426: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating a Wrapper – Step 2

� Next, you should define the Input and Output Interfaces

• The Input Interface tells DataStage how to export the data in a format

that is digestible by the wrappered application – remember, the data is

being sent to the wrappered executable to be processed

• The Output Interface tells DataStage how to re-import the data that has

been processed by the wrappered executable. This action is very

similar to what happens when DataStage is reading in a flat file.

• For multiple Inputs and/or Outputs, define an interface for each

o Note: Link numbering starts at 0

426

Page 427: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating a Wrapper – Step 3

� For both Input and Output Interfaces, be sure to

specify the Stream properties

• Defines characteristic of both

input and output data

• Also supports the use of

named pipes for input/output

427

Page 428: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating a Wrapper – Step 4

� Next, define any

environment

variables and/or

exit codes which

may occur from

the wrappered

executable.

• For example, an

executable may

return a ‘1’ when

finished successfully

428

Page 429: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating a Wrapper – Step 5

� The Wrapper also supports command line arguments

and options to be defined

• This step is optional

• Use the Properties tab to enter this information

• Will see an example of this in the lab.

429

Page 430: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating a Wrapper – Step 6

� Finally, to create the Wrapper:

• 1st click on the Generate button• 2nd click on OK

This will create the Wrapper and store it under the Category you specify.

430

Page 431: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Locating the Wrapper

� Once created, the Wrapper will be accessible from

the List View or the Palette

• Category name can be user-defined

or changed later

• Wrappers can be exported via the

Manager and re-imported into any

other DataStage project.

• Use Wrappers just like any other stage in a job

• Double-Click the Wrapper in the List View to change its

properties or definition

431

Page 432: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 7A: Simple Wrapper

432

Page 433: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 7A Objectives

� Create a simple Wrapper using the Unix ‘sort’

command

� Apply the Wrapper in a DataStage job

433

Page 434: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Unix Sort

� To learn how to use the

Unix sort utility, simply

type in ‘man sort’ at the

Unix command line to

bring up the online help

• It should look similar to the

screenshot to the right:

• Sort utility can take data

from standard input and

write the sorted data to

standard output

434

Page 435: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating a New Wrapper

� To create a new Wrapper,

right-click on Stage Types,

select New Parallel Stage,

and click on Wrapped…

� Enter the following

information in the

Wrapper stage editor

435

Page 436: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Defining the Input Interface

� Click on the Wrapped � Interfaces � Input tabs

� Select the appropriate table definition

• Use the Batting

table definition we

created in Lab 3

� Specify ‘Standard

Input’ as the stream

property

436

Page 437: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Defining the Output Interface

� Click on the Wrapped � Interfaces � Output tabs

� Select the appropriate table definition

• Use the Batting

table definition we

created in Lab 3

� Specify ‘Standard

Output’ as the stream

property

437

Page 438: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Generating the Wrapper

� Click on Generate button and then the OK button at

the bottom of the Wrapper stage editor

� Look in the Repository View under the

Stage Types � Wrapper category

• Verify that the newly created UnixSort Wrapper

is there

1 2

438

Page 439: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Testing the Wrapper

� To test the newly created UnixSort Wrapper stage,

assemble the following job:

• Use Batting.ds dataset

• Use the Batting table definition created in Lab 3

• Use the Input Partitioning tab on the UnixSort stage to

specify Hash on playerID – do not click on the Sort box!

o Remember, you must hash on the sort key!

439

Page 440: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

UnixSort Test Output

� Save the job as ‘lab7a’

� Compile the job

� Run the job and view the results in the Director log

• The Peek output should reveal that all records are sorted

based on playerID

• By default, the Unix Sort utility will use the first column as the

column to perform the sort on.

440

Page 441: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 7B: Advanced Wrapper

441

Page 442: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 7B Objectives

� Create a Wrapper using the Unix ‘sort’ command

which supports user-defined options

� Apply Wrapper in a DataStage job

442

Page 443: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Advanced UnixSort Wrapper

� The Unix Sort utility supports user-defined options

• One option is to specify the column delimiter – it looks for

whitespace delimiter by default

• Another option is to specify the column to perform the sort

on

• Example:

sort -t , +1 -2

will look for ‘,’ as the column delimiter and perform a sort

using column #2 as the sort key

� Create a new Wrapper that allows you to specify a

column delimiter and a key to perform the sort on

443

Page 444: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Defining the AdvancedUnixSort Wrapper

� Right-click on the UnixSort Wrapper stage in the

Repository View and

select ‘Create Copy’

� Edit the copied

Wrapper and change

the name to

‘AdvancedUnixSort’

� Keep everything else

the same on the

General tab

444

Page 445: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

AdvancedUnixSort Wrapper Options

� Click on the Properties tab and define the following:

• t – used by sort to define column delimiter. Use ‘,’ as the default

value, as this is what DataStage uses when exporting data

• start – defines the start position for the sort key, based on column

number reference (+1 = end of 1st column)

• stop – defines the stop position for the sort key, based on column

number reference (-2 = end of 2nd column)

• Specify the Conversion values as shown above

445

Page 446: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Generating the Wrapper

� Click on Generate button and then the OK button at

the bottom of the Wrapper stage editor

� Look in the Repository View under the

Stage Types � Wrapper category

• Verify that the newly created

AdvancedUnixSort Wrapper is there

1 2

446

Page 447: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Testing the Advanced Wrapper

� To test the newly created AdvancedUnixSort

Wrapper stage, assemble the following job:

• Use Batting.ds dataset

• Use the Batting table definition created in Lab 3

• Edit the properties for the AdvancedUnixSort stage

o Specify the Column Delimiter to be ‘,’

o Set the End Position to ‘-3’ (i.e. teamID)

o Set the Start Position to ‘+2’

o Specify the Hash key to be teamID

447

Page 448: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

AdvancedUnixSort Test Output

� Save the job as ‘lab7b’

� Compile the job

� Run the job and view the results in the Director log

• The Peek output should reveal that all records are now

sorted based on teamID

• By default, the Unix Sort utility would have used the first

column as the column to perform the sort on.

448

Page 449: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Agenda

1. DataStage Overview

2. Parallel Framework Overview

3. Data Import and Export

4. Data Partitioning, Sorting, and Collection

5. Data Transformation and Manipulation

6. Data Combination

7. Custom Components: Wrappers

8. Custom Components: Buildops

9. Additional Topics

10. Glossary

449

Page 10

Page 73

Page 116

Page 252

Page 309

Page 364

Page 420

Page 450

Page 477

Page 526

Page 450: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Buildops

In this section we will discuss the following:

� Buildops

• What is a Buildop

• Use cases

• How to create

• Example

Note: Buildop is considered advanced functionality

within DataStage. In this section you will learn

the basics for how to create a simple Buildop.

450

Page 451: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

What is a Buildop?

� DataStage allows users create a new stage using standard

C/C++ - this stage is called a Buildop

� Buildops can be any ANSI compliant C/C++ code

• Code must be syntactically correct

• Code must be able to be compiled by C/C++ compiler

• If code does not work outside of DataStage, it will not work within

DataStage!

� DataStage Framework treats Buildops as a native stage

• Stage created using Buildop is native to the Framework, whereas a

Wrapper is calling an external executable which is non-native.

• Does not need to export / import data like the Wrapper did

• Does not manage or know about what goes on within the custom

written code itself, but does manage the parallel execution of the

code

451

Page 452: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Why Buildop?

� Buildops allow users to extend the functionality

provided by DataStage out of the box

� Buildops offer a high performance means of

integrating custom logic into DataStage

� Once created, a Buildop can be reused in any job

and shared across projects

� Buildops only require the core business logic to be

written in C/C++

• DataStage will take care of creating the necessary

infrastructure to execute the business logic

452

Page 453: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Buildop Use Case

� A custom scoring algorithm does the following:

• Identifies customers who live within a certain area AND have

household income above some fixed amount

• Each record which meets the above criteria is identified by a

special value populated in the ‘status’ column

� Because this logic does not need to be processed sequentially,

it becomes an ideal candidate for becoming a Buildop.

A

Inputfile

b B RDBMS

Buildop

453

Page 454: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Buildop vs Transformer

� The use case scenario described on the previous slide

could also have been easily implemented in the

transformer.

� Buildop Advantages:

• Use standard C/C++ code, allows existing logic to be re-used

• High performance – buildops are considered native to the

Framework, whereas the Transformer must generate code

• Supports multiple inputs and outputs

� Transformer Advantages:

• Simple graphical interface

• Pre-defined functions and derivations are easy to access

• No need to pre-define input and output interface

454

Page 455: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating a Buildop – Step 1

� To create a Buildop, 1st do the following:

• Stage Type Name will be

the name that shows up

on the palette

• Operator is the reference

name that the Framework

will use – this is often the

kept same as Stage Name

• Execution Mode is

parallel by default, but can be sequential.

455

Page 456: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating a Buildop – Step 2

� Next, you must define the Input and Output Interfaces

• The Input Interface describes to the Buildop the column(s) being

operated on. Note: Only specify the columns that will be used

within the Buildop. Any column defined must be referenced in the

code!

• The Output Interface describes to the Buildop the column(s) being

written out. Note: Only specify the columns that will be used within

the Buildop. Any column defined must be referenced in the code!

• For multiple Inputs and/or Outputs, define an interface for each

o Define Port Names in order to track inputs / outputs

o When there’s only 1 input/output, there’s no need to define Port Name

456

Page 457: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating a Buildop – Step 2 (Continued)

� Auto Read/Write

• This defaults to True which means the stage will

automatically read and/or write records from/to the port. If set

to False, you must explicitly control the read and write

operations in the code.

• Once a record is written out, it can no longer be accessed

from within the Buildop

� RCP – Runtime Column Propagation

• False by default. Set this to True to force all columns not

defined in the Input Interface to propagate through and be

available downstream.

• If set to False, no columns other than those defined in the

Output Interface will show up downstream.

457

Page 458: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating a Buildop – Step 3 (Optional)

� The Transfer tab allows users to define record transfer behavior

between input and output links.

• Useful when there’s more than 1 input or output link

� Auto Transfer

• Defaults to False, which means that you have to include code which

manages the transfer. Set to True to have the transfer carried out

automatically.

� Separate

• Defaults to False, which means the transfer will be combined with other

transfers to the same port. Set to True to specify that the transfer should be

separate from other transfers.

458

Page 459: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating a Buildop – Step 4 (Optional)

� In the Build � Logic � Definitions tab, you can

define variables which will be used as part of the

business logic.

� Variables can be

standard C types

or Framework

data types

• Some examples Framework data types include:

APT_String, APT_Int32, APT_Date, APT_Decimal

459

Page 460: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating a Buildop – Step 5 (Optional)

� In the Build � Logic � Pre-Loop or Post-Loop tab,

you can define logic using C/C++ to be executed

before and after

� Pre-Loop

• Logic that is processed before any records have been

processed

� Post-Loop

• Logic that is processed after all records have been

processed

460

Page 461: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating a Buildop – Step 6

� The core C/C++ business logic is entered in the

Per-Record tab

• Use any standard

ANSI C/C++ code

• Leverage built-in

Framework function

and macro calls

� Per-Record processing

• Logic is executed against each record.

• Once record has been written out, it cannot be recalled

• Does allow buffering of records and management of record

input and output flow – advanced topics.

Directly reference columns!

461

Page 462: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating a Buildop – Step 7

� Finally, to create the Buildop, click on the Generate

button to compile the logic into a stage

If there are no syntax errors or other violations in the Buildop definition, you should obtain an Operator Generation Succeeded status window similar to the one below:

462

Page 463: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

� Once created, the Buildop will be accessible from the

List View or the Palette

• Category name can be user-defined

or changed later

• Buildops can be exported via the

Designer and re-imported into any

other DataStage project.

• Use Buidops just like any other stage in a job

• Double-Click the Buildop in the List View to change its

properties or definition

Locating the Buildop

463

Page 464: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Buildop Usage Example

� Here’s an example of the Buildop in action:

Input

Interface

Output

Interface

Note the

new column

Sample Output:

Input Columns Output Columns

464

Page 465: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 8A: Buildop

465

Page 466: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 8A Objectives

� Create a Buildop to perform the following:

• Derive the pitcher’s Win-Loss percentage based on his Win

Loss record for the season and populate result into new

column

• Expand lgID value to either ‘National League’ or ‘American

League’ and populate result into new column

466

Page 467: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 8A – Overview

� Here’s the simple job we will be creating to test out

the Buildop:

� Overview:

• Use the Batting.ds dataset

• Use the Batting table definition created in Lab 3

• Use the following formula to calculate Win-Loss Percentage:

o WLPercent = (Wins / (Wins+Losses) ) x 100

• If lgID = “AL”, then set leagueName = “American League”

• If lgID = “NL”, then set leagueName = “National League”

467

Page 468: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating a New Buildop

� To create a new Buildop,

right-click on Stage Types,

select New Parallel Stage,

and click on Build…

� Enter the following

information in the

Buildop stage editor

468

Page 469: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Input / Output Buildop Table Definitions

� Create an input and an output table definition for

Buildop

• Remember, only specify the columns that will be referenced

within the Buildop code itself.

• For the input, create the following and save it as

/Labs/Lab8/Buildop_Input

• For the output, create the following and save it as

/Labs/Lab8/Buildop_Output

469

Page 470: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Defining the Input Interface

� Click on the Build � Interfaces � Input tabs

� Select the Buildop_Input table definition you just

created

• Do not use the

Pitching table

definition!

• Set Auto Read to

True

• Set RCP to True

470

Page 471: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Defining the Output Interface

� Click on the Buildop � Interfaces � Output tabs

� Select the Buildop_Output table definition you just

created

• Do not use the

Pitching table

definition!

• Set Auto Write to

True

• Set RCP to True

471

Page 472: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Defining the Per-Record Logic

� Enter the following C code into the Per-Record section

472

Page 473: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Generating the Buildop

� When finished editing the

Per-Record section, click on

the Generate button

� If everything was entered

correctly, you should get a

similar success dialogue:

� Click Close and

then OK

473

Page 474: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Locating the Buildop

� Once successfully created, the buildop will be

accessible from the Repository View and the Palette

(under the Buildop category).

474

Page 475: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Testing the Buildop

� Test the newly created Buildop using the flow discussed

at the beginning of this lab:

� Save as ‘lab8a’

� Compile and run lab8a - a random sample output is

shown

here:

475

Page 476: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Agenda

1. DataStage Overview

2. Parallel Framework Overview

3. Data Import and Export

4. Data Partitioning, Sorting, and Collection

5. Data Transformation and Manipulation

6. Data Combination

7. Custom Components: Wrappers

8. Custom Components: Buildops

9. Additional Topics

10. Glossary

476

Page 10

Page 73

Page 116

Page 252

Page 309

Page 364

Page 420

Page 450

Page 477

Page 526

Page 477: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Additional Topics

In this section we will provide a brief overview of:

� Job Report Generator

� Containers

• Local

• Shared

� Job Sequencer

� DataStage Designer

• Export

• Import

477

Page 478: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Job Report Generator

� Developers often neglect to

document the applications they

develop.

� Fortunately, DataStage jobs are, for

the most part, self documenting and

fairly easy to decipher.

� Another built-in feature – the Report

Generator – offers an automated

way to document DataStage jobs

478

Page 479: DataStage Quality Stage Fundamentals

� Access the Generate Report option from the File

menu in the Designer

• Make sure that the job is open in the Designer

ValueCap Systems - Proprietary

Generating a Job Report

Then click on OK to create the report.

Specify name for the report.

479

Page 480: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Viewing the Job Report

� Once the report is

generated successfully,

a dialog will appear that will

let you view the report, or

open the Report Console

so that you can view all

reports.

� You can also open the Report Console by opening the Web

Console from the Start menu, and selecting the Reporting tab.

480

Page 481: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Viewing the Job Report

The report is a hyperlinked

document which allows you to

access information about

details of the job.

481

Page 482: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Job Report Details

� For example, clicking on the SimpleTransform stage

will show the following documentation:

All derivations will be listed

482

Page 483: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Containers

� Containers are used in DataStage to visually simplify

jobs and create re-usable logic flows

• Containers can contain 1 or more stages and have

input/output links

• Local Containers are only accessible from within the job

where it was created

o Local Container can be converted into a

Shared Container

o Local Containers can be ‘deconstructed’ back

into the original stages within the flow

• Shared Containers are accessible to any job within a project

o Shared Container can be converted into a

Local Container

o Shared Containers cannot be deconstructed

483

Page 484: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating a Container

� First, draw a line around the

specific stages that you

would like to place into a

container.

� Make sure that only the

stages you want are

selected!

• In this example, we are only

selecting the Transformer and

the Funnel

484

Page 485: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating a Container (Continued)

� Next, click on the Edit menu, select Construct

Container, and then either Local or Shared.

• You can also use the icons on the toolbar…

485

Page 486: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating a Container

� Once created, the job with a shared container will

look like the following:

The contents of the Container can be viewed in a separate window, by right clicking on the Container and selecting ‘Properties’ the option

486

Page 487: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Shared vs Local Containers

� The primary difference between Shared and Local

Containers is that Shared containers can be re-used

in other jobs.

487

Page 488: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Job Sequencer

� The Job Sequencer provides an interface for

managing the execution of multiple jobs

To create a Job Sequence, select it from the ‘New’ menu.

Next, drag and drop the jobs onto the canvas and link them as you would with any 2 stages.

In this example, lab5a_1 will execute 1st, and then lab5a_2, and then lab5a_3.

488

Page 489: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Sequencer Stages

� The Job Sequencer has a lot of built-in

function to assist with job flow management

• Handle exceptions such as errors and warnings

• Send message via email or pager

• Execute external applications or scripts

• Wait for file activity prior to executing job

o Useful for batch applications which are

dependent on arrival of input data

• Control execution based on completion and

condition of executed jobs

489

Page 490: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Job Sequencer Example

� Once a Job Sequence has been created, it behaves

just like any other DataStage job

• If the DataStage jobs use Job Parameters, you must pass in

the value for those parameters from within the Sequencer

o Can define Job Parameters for a Job Sequence

and pass those parameters into the interface

for each job being called.

• Need to Save the job, Compile, and Run it.

• Sequencer Job can be scheduled just like any other

DataStage job. 490

Page 491: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

DataStage Manager

� Use the DataStage Designer client to import or

export:

• Entire Project

• 1 or Many Jobs

• Shared Containers

• Buildops

• Wrappers

• Routines

• Table Definitions

• Executables

� Supports internal DSX format or XML for imports and

exports491

Page 492: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Export Interface

� Specify name and

location for export

� Specify whole

project (backup) or

individual objects

� Append or Overwrite

existing DSX or XML

export files

� Note: Items should

not be open in the

Designer when

performing exports

492

Page 493: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

When to Export

� Use the Designer to perform job / project exports

• When upgrading DataStage, it’s considered a good practice to

1. Export the projects

2. Delete the projects

3. Perform the upgrade

4. Re-import the projects.

• Upgrades will proceed much faster

• Export jobs, containers, stages, etc… and check the DSX or

XML file into source control

• Export to a DSX or XML in order to migrate items between

DataStage servers

• Export the entire project as a means of creating a backup493

Page 494: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Import Interface

� The import interface is simpler than that of the export

� Specify location of the DSX or XML

� Use the Perform Usage Analysis feature to ensure

nothing gets accidentally overwritten during import

� You can also select only specific items to import by

using the Import Selected option

494

Page 495: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 9A: Job Report

495

Page 496: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 9A Objectives

� Generate a Job Report:

• Open job ‘lab5a_1’

• Use the Job Report utility to generate a report

• Examine the results

496

Page 497: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 9A – Overview

� Open the job ‘lab5a_1’

497

Page 498: DataStage Quality Stage Fundamentals

� Access the Generate Report option from the File

menu in the Designer

• Make sure that lab5a_1 is open in the Designer

ValueCap Systems - Proprietary

Generating a Job Report

1. Specify location for report to be generated and saved to.

2. Click on OK to create the report.

498

Page 499: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Viewing Reports

3. After the report is generated, you should see the dialog box shown above. Click on the “Reporting Console link.

499

Page 500: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Viewing Reports

4. This should take you to the “Reporting” tab of the Information Server Web Console, shown above. Starting with the “Reports” option in the Navigation pane on the left, navigate to the folder containing the job report you just created.

500

Page 501: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Viewing Reports

Your Web Console should now look something like this:

501

Page 502: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Viewing Reports

5. Select the report you just created, and click “View Report Result” in the pane on the right. You should now see a job report similar to the one shown on the left. Try clicking on the stage icons and see what happens.

502

Page 503: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 9B: Shared Containers

503

Page 504: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 9B Objectives

� Create a shared container using a subset of logic

from previously created job

� Edit the Shared Container to make it more generic

� Reuse Shared Container in a separate job

504

Page 505: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 9B – Overview

� Open the job ‘lab6a_lookup’

� Left-click and drag your cursor around the stages as

shown below by the red box:

505

Page 506: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Creating the Shared Container

� Next, click on the Shared Container icon on the

toolbar

• You can also click on the Edit menu, select Construct

Container, and then select Shared.

• Save the Shared Container as ‘MasterLookup’

� Your flow should now look similar to this:

506

Page 507: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Modify the Job

� Modify your job by adding 2 copy stages as shown

below:

� This is to work around an issue with performance

statistics.

• The Peek stages will only report the number of records

output to the Director log

• Adding the Copy stage will display an accurate record count

507

Page 508: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Testing the Shared Container

� Save the job as ‘lab9b_1’

� Compile the job and run

� There should be the following output

• 19877 player batting records going to Copy1, where there

was a Master record match

• 5199 player batting records going to Copy2, where there

was no Master record match

508

Page 509: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Editing the Shared Container

� In order to make the Shared Container useful for other data

sources, we will need to edit the Input and Output Table

Definitions and leverage RCP

� Open the MasterLookup Shared Container:

• Edit the Input and Output Table Definitions and remove all columns

except for playerID, nameFirst and nameLast

• Make sure RCP is enabled everywhere

� Save the Shared Container and close the window

509

Page 510: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Shared Container Re-Use

� Create the following job flow using the Pitching.ds

dataset and Table Definition

• Be sure to have RCP enabled throughout your job

• Table Definitions on the output of the Shared Container is

optional because of RCP

510

Page 511: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Testing the Shared Container

� Save the job as ‘lab9b_2’

� Compile the job and run

� There should be the following output

• 9691 pitching records going to Copy1, where there was a

Master record match

• 2226 pitching records going to Copy2, where there was no

Master record match

� You can also try processing the Salaries dataset

using the Shared Container created in this lab.

511

Page 512: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 9C: Job Sequencer

512

Page 513: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 9C Objectives

� Use the Job Sequencer to run jobs lab9b_1 and

lab9b_2 back to back

513

Page 514: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 9C – Overview

� Create a Job Sequence by selecting File � New �

Job Sequence

� To create a Job Sequence, click on and select job

lab9a_1 and drag it onto the canvas. Next, click on

and drag lab9a_2

� Right-click on Job_Activity_0 stage and drag the link

to the Job_Activity_1 stage.

• This will run lab9a_1 first and then lab9a_2 next

514

Page 515: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Job Parameters

� Before the jobs can be run, you must specify the

values to be passed to the Job Parameters

• Both lab9b_1 and

lab9b_2 use

$APT_CONFIG_FILE

and $FILEPATH

515

Page 516: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Defining Job Sequencer Parameters

� Go to the Job Properties � Parameters tab and click

on Add Environment Variable

• Select $APT_CONFIG_FILE and $FILEPATH from the list

Click on OK when finished

516

Page 517: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Inserting Parameter Values

� Next, go back into the Job_Activity stage properties

and for each Parameter, click on Insert Parameter

Value to insert the

Parameters you just

defined

• Do this for both

stages

517

Page 518: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

� When finished, save the job as ‘lab9c’

� Compile the job but do not run it yet.

• First, make sure that both lab9b_1 and lab9b_2 are compiled

and ready to run

� Run lab9c and view the results in the Director log

• There should have been no errors

• The results from each individual job can be viewed from the

Director by selecting the log for that job

518

Page 519: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 9D: Project Export

519

Page 520: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Lab 9D Objectives

� Use the DataStage Manager to export your entire

project

• This will provide you with a backup of the work you have

done this week

520

Page 521: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Export Your Entire Project

� Save all of your work.

� Close all open jobs.

� In the Designer, under Export menu, select “DataStage

Components.”

� In the Repository Export dialog, click on “Add.”

521

Page 522: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Export Your Entire Project

� In the “Select Items” dialog, click on the project, which is the top

level of the hierarchy.

� Click on OK.

� Now, you will probably have to wait a couple of minutes.

522

Page 523: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Export Your Entire Project

� When control returns to

the dialog, fill in the

“Export to file” path.

� Click “Export”

� A progress bar will

appear.

� Eventually, you will see

a dialog that will tell you

that “some read-only

objects were not

exported.”

523

Page 524: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Congratulations!

� You have successfully completed all of your labs!

� You have created a backup of your labs which you

can take with you and later import into your own

project elsewhere.

524

Page 525: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Agenda

1. DataStage Overview

2. Parallel Framework Overview

3. Data Import and Export

4. Data Partitioning, Sorting, and Collection

5. Data Transformation and Manipulation

6. Data Combination

7. Custom Components: Wrappers

8. Custom Components: Buildops

9. Additional Topics

10. Glossary

525

Page 10

Page 73

Page 116

Page 252

Page 309

Page 364

Page 420

Page 450

Page 477

Page 526

Page 526: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Glossary

� Administrator – DataStage client used to control project global settings and permissions.

� Collector – Gather records from all partitions and place them into a single partition. Forces sequential processing to occur.

� Compiler – Used by DataStage Designer to validate contents of a job and prepare it for execution.

� Configuration File – file used to describe to the Framework how many ways parallel a job should be run.

• Node – virtual name for the processing node

• Fastname – hostname or ip address of the processing box

• Pool – virtual label used to group processing nodes and resources in the config file

• Resource Disk – designates where Parallel Datasets are to be written to

• Resource Scratchdisk – designates where DataStage should create temporary files

� Dataset – DataStage data storage mechanism which allows for data to be stored across multiple files on multiple disks. This is often used to spread out I/O and expedite file reads and writes.

� Designer – DataStage client used primarily to design, create, execute, and maintain jobs.

526

Page 527: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Glossary

� Director – DataStage client used to manage the execution of DataStage jobs and to import/export objects from the metadata repository. These include table definitions, jobs, and custom built stages.

� Export – Process by which data is written out of DataStage to any supported target.

� Funnel – Stage used to gather many links, where each link contains many partitions, into a single parallel link. All input links must have the same layout.

� Generator – Stage used to create rows of data based on table definition and parameters provided. Often useful for testing applications where real data is not available.

� Grid – Large collection of computing resources which allow for MPP-style processing of data. Grid computing often allows for dynamic configuration of available computing resources.

� Import – Process by which data is read into DataStage and translated to DataStage internal format.

� Job – A collection of stages arranged in a logical manner to represent a particular business logic. Jobs must be first compiled before they can be executed.

527

Page 528: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Glossary

� Link - A conduit between 2 stages which enables data transfer from the upstream stage to the downstream stage.

� Manager – DataStage client used to import/export objects from the DataStage server repository. These include table definitions, jobs, and custom built stages.

� MPP – Massively Parallel Processing. Computing architecture where memory and disk is not shared across hardware processing nodes.

� Operator – Same as a stage. Operators are represented by stages in the Designer, but referenced directly by the Framework.

� Partition – Division of data into parts for the purpose of parallel processing.

� Parallelism – Concurrent processing of data

• Partitioned Parallelism – divide and conquer approach to processing data. Data is divided into partitions and processed concurrently. Data remains in the same partition throughout the entire life of the job.

• Pipelined Parallelism – parallel data processing similar to partitioned parallelism, except data does not have to remain within the same partition throughout the life of the job. This allows records to be processed across various partitions, helping eliminate potential bottlenecks.

528

Page 529: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Glossary

� Peek – Stage which allows users to view a subset of records (default 10 per

partition) as they pass through.

� Pipelining – The ability to process data and pass data between processes in

memory instead of having to first land data to disk.

� RCP – Runtime Column Propagation. Feature which allows columns to be

automatically propagated at runtime without user having to manually perform

source to target mapping at design time.

� RDBMS – Relational Database Management System. A database that is

organized and accessed according to the relationships between data values

� Reject – Record that is rejected by a stage because it does not meet a specific

condition.

� Scalability – From a DataStage perspective, it’s the ability for an application, to process the same amount of data in less time as additional hardware resources are added to the computing platform.

� SMP – Symmetric Multi-Processing. Computing architecture where memory and disk is shared by all processors.

529

Page 530: DataStage Quality Stage Fundamentals

ValueCap Systems - Proprietary

Glossary

� Stage – A component in DataStage that performs a predetermined action against the data. For example, the Sort stage will sort all records based on a chosen column or set of columns.

� Table Definition – A schema containing field names and their associated data types and properties. Can also contain descriptions about the content of the field(s).

� Wrapper – An external application, command, or other independently executable object that can be called from within DataStage as a stage. Wrappers can accept many inputs and many outputs, but the inputs and outputs must be pre-defined.

530