student guide sap data services xi 30 - data integrator

Student Guide

SAP Data Services XI 3.0 –

Data Integrator

SAP Data Services – Data Integrator XI 3.0

2

iii


C O N T E N T S

About this Course Course introduction...................................................................................................xiii

Course description.....................................................................................................xiv

Course audience.........................................................................................................xiv

Prerequisites................................................................................................................xiv

Additional education.................................................................................................xiv

Level, delivery, and duration....................................................................................xv

Course success factors.................................................................................................xv

Course setup.................................................................................................................xv

Course materials..........................................................................................................xv

Learning process .........................................................................................................xv

Lesson 1

Describing Data Services Lesson introduction.......................................................................................................1

Describing the purpose of Data Services....................................................................2

Describing Data Services benefits .......................................................................2

Understanding data integration processes.........................................................2

Understanding the Data Services packages.......................................................3

Describing Data Services architecture........................................................................5

Defining Data Services components ...................................................................5

Describing the Designer .......................................................................................6

Describing the repository .....................................................................................7

Describing the Job Server......................................................................................8

Describing the engines........................................................................................12

Describing the Access Server..............................................................................12

Describing the adapters.......................................................................................12

Describing the real-time services ......................................................................12

Describing the Address Server...........................................................................13

Describing the Global Parsing Options, dictionaries, and directories.........13

Describing the Management Console ..............................................................13

Defining other Data Services tools.....................................................................14

Defining Data Services objects...................................................................................16

Understanding Data Services objects ...............................................................16

Defining relationship between objects .............................................................17

Defining projects and jobs ..................................................................................18

Using work flows.................................................................................................18

iv


Describing the object hierarchy .........................................................................19

Using the Data Services Designer interface.............................................................21

Describing the Designer window .....................................................................21

Using the Designer toolbar ................................................................................22

Using the Local Object Library ..........................................................................23

Using the project area .........................................................................................24

Using the tool palette ..........................................................................................26

Using the workspace ...........................................................................................27

Quiz: Describing Data Services .................................................................................28

Lesson summary..........................................................................................................29

Lesson 2

Defining Source and Target Metadata Lesson introduction.....................................................................................................31

Using datastores...........................................................................................................32

Explaining datastores .........................................................................................32

Using adapters .....................................................................................................33

Creating a database datastore ...........................................................................33

Changing a datastore definition ........................................................................34

Importing metadata from data sources ............................................................35

Importing metadata by browsing .....................................................................36

Activity: Creating source and target datastores..............................................37

Using datastore and system configurations.............................................................41

Creating multiple configurations in a datastore .............................................41

Creating a system configuration .......................................................................44

Defining file formats for flat files..............................................................................46

Explaining file formats .......................................................................................46

Creating file formats ...........................................................................................46

Handling errors in file formats ..........................................................................50

Activity: Creating a file format for a flat file....................................................51

Defining file formats for Excel files...........................................................................53

Using Excel as a native data source ..................................................................53

Activity: Creating a file format for an Excel file..............................................55

Defining file formats for XML files...........................................................................57

Importing data from XML documents..............................................................57

Importing metadata from a DTD file................................................................57

Importing metadata from an XML schema......................................................60

Explaining nested data........................................................................................62

Unnesting data......................................................................................................64

Quiz: Defining source and target metadata.............................................................66

Lesson summary..........................................................................................................67

v


Lesson 3

Creating Batch Jobs Lesson introduction.....................................................................................................69

Working with objects..................................................................................................70

Creating a project ................................................................................................70

Creating a job .......................................................................................................72

Adding, connecting, and deleting objects in the workspace ........................73

Creating a work flow ..........................................................................................73

Defining the order of execution in work flows ...............................................74

Creating a data flow....................................................................................................76

Using data flows ..................................................................................................76

Using data flows as steps in work flows ..........................................................76

Changing data flow properties .........................................................................77

Explaining source and target objects ................................................................78

Adding source and target objects .....................................................................79

Using the Query transform........................................................................................81

Describing the transform editor ........................................................................81

Explaining the Query transform .......................................................................82

Using target tables.......................................................................................................86

Setting target table options ................................................................................86

Using template tables .........................................................................................89

Executing the job..........................................................................................................93

Explaining job execution ....................................................................................93

Setting execution properties ..............................................................................93

Executing the job .................................................................................................94

Activity: Creating a basic data flow...................................................................96

Quiz: Creating batch jobs............................................................................................99

Lesson summary........................................................................................................100

Lesson 4

Troubleshooting Batch Jobs Lesson introduction...................................................................................................101

Using descriptions and annotations........................................................................102

Using descriptions with objects.......................................................................102

Using annotations to describe objects ............................................................103

Validating and tracing jobs......................................................................................104

Validating jobs ...................................................................................................104

Tracing jobs ........................................................................................................105

Using log files ....................................................................................................108

Examining trace logs .........................................................................................108

Examining monitor logs ...................................................................................109

Examining error logs ........................................................................................109

Using the Monitor tab .......................................................................................110

vi


Using the Log tab ..............................................................................................110

Determining the success of the job .................................................................111

Activity: Setting traces and adding annotations............................................112

Using View Data and the Interactive Debugger...................................................113

Using View Data with sources and targets ...................................................113

Using the Interactive Debugger ......................................................................115

Setting filters and breakpoints for a debug session ......................................117

Activity: Using the Interactive Debugger.......................................................119

Setting up auditing....................................................................................................121

Setting up auditing.............................................................................................121

Defining audit points.........................................................................................121

Defining audit labels..........................................................................................122

Defining audit rules...........................................................................................122

Defining audit actions.......................................................................................123

Choosing audit points.......................................................................................126

Activity: Using auditing in a data flow...........................................................127

Quiz: Troubleshooting batch jobs ...........................................................................128

Lesson summary........................................................................................................129

Lesson 5

Using Functions, Scripts, and Variables Lesson introduction...................................................................................................131

Defining built-in functions.......................................................................................132

Defining functions .............................................................................................132

Listing the types of operations for functions ................................................132

Defining other types of functions ...................................................................134

Using functions in expressions................................................................................136

Defining functions in expressions ...................................................................136

Activity: Using the search_replace function...................................................139

Using the lookup function........................................................................................141

Using lookup tables ..........................................................................................141

Activity: Using the lookup_ext() function......................................................144

Using the decode function........................................................................................146

Explaining the decode function ......................................................................146

Activity: Using the decode function ...............................................................148

Using scripts, variables, and parameters................................................................150

Defining scripts ..................................................................................................150

Defining variables .............................................................................................150

Defining parameters .........................................................................................151

Combining scripts, variables, and parameters ..............................................151

Defining global versus local variables ...........................................................151

Setting global variables using job properties ................................................156

Defining substitution parameters....................................................................156

Using Data Services scripting language.................................................................159

Using basic syntax .............................................................................................159

vii


Using syntax for column and table references in expressions ....................159

Using operators .................................................................................................160

Reviewing script examples ..............................................................................161

Using strings and variables ..............................................................................161

Using quotation marks .....................................................................................161

Using escape characters ....................................................................................162

Handling nulls, empty strings, and trailing blanks .....................................162

Scripting a custom function......................................................................................166

Creating a custom function ..............................................................................166

Importing a stored procedure as a function ..................................................169

Activity: Creating a custom function..............................................................170

Quiz: Using functions, scripts, and variables........................................................173

Lesson summary........................................................................................................174

Lesson 6

Using Platform Transforms Lesson introduction...................................................................................................175

Describing platform transforms..............................................................................176

Explaining transforms ......................................................................................176

Describing platform transforms ......................................................................177

Using the Map Operation transform.......................................................................178

Describing map operations...............................................................................178

Explaining the Map Operation transform .....................................................179

Activity: Using the Map Operation transform...............................................180

Using the Validation transform...............................................................................181

Explaining the Validation transform ..............................................................181

Activity: Using the Validation transform.......................................................186

Using the Merge transform......................................................................................190

Explaining the Merge transform .....................................................................190

Activity: Using the Merge transform..............................................................191

Using the Case transform.........................................................................................194

Explaining the Case transform ........................................................................194

Activity: Using the Case transform.................................................................197

Using the SQL transform..........................................................................................199

Explaining the SQL transform .........................................................................199

Activity: Using the SQL transform..................................................................201

Quiz: Using platform transforms............................................................................203

Lesson summary........................................................................................................204

Lesson 7

Setting up Error Handling Lesson introduction...................................................................................................205

Using recovery mechanisms....................................................................................206

Avoiding data recovery situations..................................................................206

viii


Describing levels of data recovery strategies ................................................207

Configuring work flows and data flows ........................................................207

Using recovery mode ........................................................................................208

Recovering from partially-loaded data ..........................................................209

Recovering missing values or rows ................................................................209

Defining alternative work flows .....................................................................210

Using try/catch blocks and automatic recovery ..........................................212

Activity: Creating an alternative work flow ..................................................217

Quiz: Setting up error handling ..............................................................................220

Lesson summary........................................................................................................221

Lesson 8

Capturing Changes in Data Lesson introduction...................................................................................................223

Updating data over time...........................................................................................224

Explaining Slowly Changing Dimensions (SCD) .........................................224

Updating changes to data ................................................................................226

Explaining history preservation and surrogate keys ...................................227

Comparing source-based and target-based CDC .........................................228

Using source-based CDC..........................................................................................229

Using source tables to identify changed data................................................229

Using CDC with timestamps............................................................................229

Managing overlaps.............................................................................................233

Activity: Using source-based CDC..................................................................234

Using target-based CDC...........................................................................................237

Using target tables to identify changed data .................................................237

Identifying history preserving transforms ....................................................238

Explaining the Table Comparison transform.................................................238

Explaining the History Preserving transform ...............................................241

Explaining the Key Generation transform .....................................................244

Activity: Using target-based CDC ..................................................................245

Quiz: Capturing changes in data ............................................................................247

Lesson summary........................................................................................................248

Lesson 9

Using Data Integrator Transforms Lesson introduction...................................................................................................249

Describing Data Integrator transforms...................................................................250

Defining Data Integrator transforms ..............................................................250

Using the Pivot transform........................................................................................251

Explaining the Pivot transform .......................................................................251

Activity: Using the Pivot transform.................................................................254

Using the Hierarchy Flattening transform.............................................................255

Explaining the Hierarchy Flattening transform.............................................255

ix


Activity: Using the Hierarchy Flattening transform.....................................257

Describing performance optimization....................................................................262

Describing push-down operations .................................................................262

Viewing SQL generated by a data flow .........................................................264

Caching data ......................................................................................................264

Slicing processes.................................................................................................265

Using the Data Transfer transform.........................................................................266

Explaining the Data Transfer transform.........................................................266

Activity: Using the Data Transfer transform..................................................267

Using the XML Pipeline transform.........................................................................269

Explaining the XML Pipeline transform.........................................................269

Activity: Using the XML Pipeline transform..................................................270

Quiz: Using Data Integrator transforms.................................................................273

Lesson summary........................................................................................................274

Answer Key Quiz: Describing Data Services ...............................................................................277

Quiz: Defining source and target metadata...........................................................278

Quiz: Creating batch jobs..........................................................................................279

Quiz: Troubleshooting batch jobs ...........................................................................280

Quiz: Using functions, scripts, and variables........................................................281

Quiz: Using platform transforms............................................................................282

Quiz: Setting up error handling ..............................................................................283

Quiz: Capturing changes in data ............................................................................284

Quiz: Using Data Integrator transforms.................................................................285


X BusinessObjects Data Integrator XI 3.0: Core Concepts-Learner's Guide

1


Lesson 1

Describing Data Services

Lesson introduction Data Services is a graphical interface for creating and staging jobs for data integration and data

quality purposes.

After completing this lesson, you will be able to:

• Describe the purpose of Data Services

• Describe Data Services architecture

• Define Data Services objects

• Use the Data Services Designer interface

2


Describing the purpose of Data Services

Introduction BusinessObjects Data Services provides a graphical interface that allows you to easily create

jobs that extract data from heterogeneous sources, transform that data to meet the business

requirements of your organization, and load the data into a single location.

Note: Although Data Services can be used for both real-time and batch jobs, this course covers

batch jobs only.

After completing this unit, you will be able to:

• List the benefits of Data Services

• Describe data integration processes

• Describe the functionality available in Data Services packages

Describing Data Services benefits The Business Objects Data Services platform enables you to perform enterprise-level data

integration and data quality functions. With Data Services, your enterprise can:

• Create a single infrastructure for data movement to enable faster and lower cost

implementation.

• Manage data as a corporate asset independent of any single system.

• Integrate data across many systems and re-use that data for many purposes.

• Improve performance.

• Reduce burden on enterprise systems.

• Prepackage data solutions for fast deployment and quick return on investment (ROI).

• Cleanse customer and operational data anywhere across the enterprise.

• Enhance customer and operational data by appending additional information to increase

the value of the data.

• Match and consolidate data at multiple levels within a single pass for individuals, households,

or corporations.

Understanding data integration processes Data Services combines both batch and real-time data movement and management with

intelligent caching to provide a single data integration platform for information management

from any information source and for any information use. This unique combination allows you

to:

• Stage data in an operational datastore, data warehouse, or data mart.

• Update staged data in batch or real-time modes.

• Create a single environment for developing, testing, and deploying the entire data integration

platform.

3


• Manage a single metadata repository to capture the relationships between different extraction

and access methods and provide integrated lineage and impact analysis.

Data Services performs three key functions that can be combined to create a scalable,

high-performance data platform. It:

• Loads Enterprise Resource Planning (ERP) or enterprise application data into an operational

datastore (ODS) or analytical data warehouse, and updates in batch or real-time modes.

• Creates routing requests to a data warehouse or ERP system using complex rules.

• Applies transactions against ERP systems.

Data mapping and transformation can be defined using the Data Services Designer graphical

user interface. Data Services automatically generates the appropriate interface calls to access

the data in the source system.

For most ERP applications, Data Services generates SQL optimized for the specific target

database (Oracle, DB2, SQL Server, Informix, and so on). Automatically-generated, optimized

code reduces the cost of maintaining data warehouses and enables you to build data solutions

quickly, meeting user requirements faster than other methods (for example, custom-coding,

direct-connect calls, or PL/SQL).

Data Services can apply data changes in a variety of data formats, including any custom format

using a Data Services adapter. Enterprise users can apply data changes against multiple

back-office systems singularly or sequentially. By generating calls native to the system in

question, Data Services makes it unnecessary to develop and maintain customized code to

manage the process.

You can also design access intelligence into each transaction by adding flow logic that checks

values in a data warehouse or in the transaction itself before posting it to the target ERP system.

Understanding the Data Services packages Data Services provides a wide range of functionality, depending on the package and options

selected:

• Data Integrator packages provide platform transforms for core functionality, and Data

Integrator transforms to enhance data integration projects.

• Data Quality packages provide platform transforms for core functionality, and Data Quality

transforms to parse, standardize, cleanse, enhance, match, and consolidate data.

• Data Services packages provide all of the functionality of both the Data Integrator and Data

Quality packages.

When your Data Services projects are based on enterprise applications such as SAP, PeopleSoft,

Oracle, JD Edwards, Salesforce.com, and Siebel, BusinessObjects Rapid Marts provide specialized

versions of Data Services functionality. Rapid Marts combine domain knowledge with data

integration best practices to deliver prebuilt data models, transformation logic, and data

extraction. Rapid Marts are packaged, powerful, and flexible data integration solutions that

help organizations:

• Jumpstart business intelligence deployments and accelerate time to value

• Deliver best-practice data warehousing solutions

4 BusinessObjects Data Integrator XI 3.0: Core Concepts-Learner's Guide


• Develop custom solutions to meet your unique requirements

Describing Data Services—Learner’s Guide 5


Describing Data Services architecture

Introduction Data Services relies on several unique components to accomplish the data integration and data

quality activities required to manage your corporate data.


• Describe standard Data Services components

• Describe Data Services management tools

Defining Data Services components Data Services includes the following standard components:

• Designer

• Repository

• Job Server

• Engines

• Access Server

• Adapters

• Real-time Services

• Address Server

• Global Parsing Options, Dictionaries, and Directories

• Management Console

This diagram illustrates the relationships between these components:

6 BusinessObjects Data Integrator XI 3.0: Core Concepts—Learner’s Guide


Describing the Designer Data Services Designer is a Windows client application used to create, test, and manually

execute jobs that transform data and populate a data warehouse. Using Designer, you create

data management applications that consist of data mappings, transformations, and control

logic.

You can create objects that represent data sources, and then drag, drop, and configure them in

flow diagrams.

Designer allows you to manage metadata stored in a local repository. From the Designer, you

can also trigger the Job Server to run your jobs for initial application testing.

To log in to Designer

1. From the Start menu, click Programs ➤ BusinessObjects XI 3.0 ➤ BusinessObjects Data

Services ➤ Data Services Designer to launch Designer.

The path may be different, depending on how the product was installed. 2. In the BusinessObjects Data Services Repository Login dialog box, enter the connection

information for the local repository.

3. Click OK.

4. To verify the Job Server is running in Designer, hover the cursor over the Job Server icon in

the bottom right corner of the screen.

The details for the Job Server display in the status bar in the lower left portion of the screen.



Describing the repository The Data Services repository is a set of tables that holds user-created and predefined system

objects, source and target metadata, and transformation rules. It is set up on an open

client/server platform to facilitate sharing metadata with other enterprise tools. Each repository

is stored on an existing Relational Database Management System (RDBMS).

There are three types of repositories:

• A local repository (known in Designer as the Local Object Library) is used by an application

designer to store definitions of source and target metadata and Data Services objects.

• A central repository (known in Designer as the Central Object Library) is an optional

component that can be used to support multi-user development. The Central Object Library

provides a shared library that allows developers to check objects in and out for development.

• A profiler repository is used to store information that is used to determine the quality of

data.

Each repository is associated with one or more Data Services Job Servers. To create a local repository


Services ➤ Data Services Repository Manager to launch the Repository Manager.

The path may be different, depending on how the product was installed. 2. In the BusinessObjects Data Services Repository Manager dialog box, enter the connection

information for the local repository.



3. Create Create.

You may need to confirm that you want to overwrite the existing repository, if it already

exists.

If you select the Show Details check box, you can see the SQL that is applied to create the

repository.

System messages confirm that the local repository is created.

4. To see the version of the repository, click Get Version.

The version displays in the pane at the bottom of the dialog box. Note that the version

number refers only to the last major point release number.

5. Click Close.

Describing the Job Server Each repository is associated with at least one Data Services Job Server, which retrieves the job

from its associated repository and starts the data movement engine. The data movement engine

integrates data from multiple heterogeneous sources, performs complex data transformations,

and manages extractions and transactions from ERP systems and other sources. The Job Server

can move data in batch or real-time mode and uses distributed query optimization,



multithreading, in-memory caching, in-memory data transformations, and parallel processing

to deliver high data throughput and scalability.

While designing a job, you can run it from the Designer. In your production environment, the

Job Server runs jobs triggered by a scheduler or by a real-time service managed by the Data

Services Access Server. In production environments, you can balance job loads by creating a

Job Server Group (multiple Job Servers), which executes jobs according to overall system load.

Data Services provides distributed processing capabilities through the Server Groups. A Server

Group is a collection of Job Servers that each reside on different Data Services server computers.

Each Data Services server can contribute one, and only one, Job Server to a specific Server

Group. Each Job Server collects resource utilization information for its computer. This

information is utilized by Data Services to determine where a job, data flow or sub-data flow

(depending on the distribution level specified) should be executed.

To verify the connection between repository and Job Server


Services ➤ Data Services Server Manager to launch the Server Manager.

The path may be different, depending on how the product was installed.



2. In the BusinessObjects Data Services Server Manager dialog box, click Edit Job Server

Config.

3. In the Job Server Configuration Editor dialog box, select the Job Server.



4. Click Resync with Repository.

5. In the Job Server Properties dialog box, select the repository.



6. Click Resync.

A system message displays indicating that the Job Server will be resynchronized with the

selected repository.

7. Click OK to acknowledge the warning message.

8. In the Password field, enter the password for the repository.

9. Click Apply.

10. Click OK to close the Job Server Properties dialog box.

11. Click OK to close the Job Server Configuration Editor dialog box.

12. In the BusinessObjects Data Services Server Manager dialog box, click Restart to restart

the Job Server.

A system message displays indicating that the Job Server will be restarted.

13. Click OK.

Describing the engines When Data Services jobs are executed, the Job Server starts Data Services engine processes to

perform data extraction, transformation, and movement. Data Services engine processes use

parallel processing and in-memory data transformations to deliver high data throughput and

scalability.

Describing the Access Server The Access Server is a real-time, request-reply message broker that collects incoming XML

message requests, routes them to a real-time service, and delivers a message reply within a

user-specified time frame. The Access Server queues messages and sends them to the next

available real-time service across any number of computing resources. This approach provides

automatic scalability because the Access Server can initiate additional real-time services on

additional computing resources if traffic for a given real-time service is high.

You can configure multiple Access Servers.

Describing the adapters Adapters are additional Java-based programs that can be installed on the job server to provide

connectivity to other systems such as Salesforce.com or the Java Messaging Queue. There is

also a Software Development Kit (SDK) to allow customers to create adapters for custom

applications.

Describing the real-time services The Data Services real-time client communicates with the Access Server when processing

real-time jobs. Real-time services are configured in the Data Services Management Console.



Describing the Address Server The Address Server is used specifically for processing European addresses using the Data

Quality Global Address Cleanse transform. It provides access to detailed address line information

for most European countries.

Describing the Global Parsing Options, dictionaries, and directories The Data Quality Global Parsing Options, dictionaries, and directories provide referential data

for the Data Cleanse and Address Cleanse transforms to use when parsing, standardizing, and

cleansing name and address data.

Global Parsing Options are packages that enhance the ability of Data Cleanse to accurately

process various forms of global data by including language-specific reference data and parsing

rules. Directories provide information on addresses from postal authorities; dictionary files

are used to identify, parse, and standardize data such as names, titles, and firm data. Dictionaries

also contain acronym, match standard, gender, capitalization, and address information.

Describing the Management Console The Data Services Management Console provides access to the following features:

• Administrator

• Auto Documentation

• Data Validation

• Impact and Lineage Analysis

• Operational Dashboard

• Data Quality Reports

Administrator Administer Data Services resources, including:

• Scheduling, monitoring, and executing batch jobs

• Configuring, starting, and stopping real-time services

• Configuring Job Server, Access Server, and repository usage

• Configuring and managing adapters

• Managing users

• Publishing batch jobs and real-time services via web services

• Reporting on metadata

Auto Documentation View, analyze, and print graphical representations of all objects as depicted in Data Services

Designer, including their relationships, properties, and more.



Data Validation Evaluate the reliability of your target data based on the validation rules you create in your Data

Services batch jobs in order to quickly review, assess, and identify potential inconsistencies or

errors in source data.

Impact and Lineage Analysis Analyze end-to-end impact and lineage for Data Services tables and columns, and Business

Objects Enterprise objects such as universes, business views, and reports.

Operational Dashboard View dashboards of status and performance execution statistics of Data Services jobs for one

or more repositories over a given time period.

Data Quality Reports Use data quality reports to view and export Crystal reports for batch and real-time jobs that

include statistics-generating transforms. Report types include job summaries, transform-specific

reports, and transform group reports.

To generate reports for Match, US Regulatory Address Cleanse, and Global Address Cleanse

transforms, you must enable the Generate report data option in the Transform Editor.

Defining other Data Services tools There are also several tools to assist you in managing your Data Services installation.

Describing the Repository Manager The Data Services Repository Manager allows you to create, upgrade, and check the versions

of local, central, and profiler repositories.

Describing the Server Manager The Data Services Server Manager allows you to add, delete, or edit the properties of Job Servers.

It is automatically installed on each computer on which you install a Job Server.

Use the Server Manager to define links between Job Servers and repositories. You can link

multiple Job Servers on different machines to a single repository (for load balancing) or each

Job Server to multiple repositories (with one default) to support individual repositories (for

example, separating test and production environments).

Describing the License Manager The License Manager displays the Data Services components for which you currently have a

license.

Describing Data Services-Learner's Guide 15


Describing the Metadata Integrator

The Metadata Integrator allows Data Services to seamlessly share metadata with Business

Objects Intelligence products. Run the Metadata Integrator to collect metadata into the Data

Services repository for Business Views and Universes used by Crystal Reports, Desktop

Intelligence documents, and Web Intelligence documents.



Defining Data Services objects

Introduction Data Services provides you with a variety of objects to use when you are building your data

integration and data quality applications.


• Define the objects available in Data Services

• Explain relationships between objects

Understanding Data Services objects In Data Services, all entities you add, define, modify, or work with are objects. Some of the

most frequently-used objects are:

• Projects

• Jobs

• Work flows

• Data flows

• Transforms

• Scripts

This diagram shows some common objects.

All objects have options, properties, and classes. Each can be modified to change the behavior

of the object.



Options Options control the object. For example, to set up a connection to a database, the database name

is an option for the connection.

Properties Properties describe the object. For example, the name and creation date describe what the object

is used for and when it became active. Attributes are properties used to locate and organize

objects.

Classes Classes define how an object can be used. Every object is either re-usable or single-use.

Single-use objects Single-use objects appear only as components of other objects. They operate only in the context

in which they were created.

Note: You cannot copy single-use objects.

Re-usable objects A re-usable object has a single definition and all calls to the object refer to that definition. If

you change the definition of the object in one place, and then save the object, the change is

reflected to all other calls to the object.

Most objects created in Data Services are available for re-use. After you define and save a

re-usable object, Data Services stores the definition in the repository. You can then re-use the

definition as often as necessary by creating calls to it.

For example, a data flow within a project is a re-usable object. Multiple jobs, such as a weekly

load job and a daily load job, can call the same data flow. If this data flow is changed, both jobs

call the new version of the data flow.

You can edit re-usable objects at any time independent of the current open project. For example,

if you open a new project, you can open a data flow and edit it. However, the changes you

make to the data flow are not stored until you save them.

Defining relationship between objects Jobs are composed of work flows and/or data flows:

• A work flow is the incorporation of several data flows into a sequence.

• A data flow is the process by which source data is transformed into target data.

A work flow orders data flows and the operations that support them. It also defines the

interdependencies between data flows.

For example, if one target table depends on values from other tables, you can use the work

flow to specify the order in which you want Data Services to populate the tables. You can also



use work flows to define strategies for handling errors that occur during project execution, or

to define conditions for running sections of a project.

This diagram illustrates a typical work flow.

A data flow defines the basic task that Data Services accomplishes, which involves moving

data from one or more sources to one or more target tables or files. You define data flows by

identifying the sources from which to extract data, the transformations the data should undergo,

and targets.

Defining projects and jobs A project is the highest-level object in Designer. Projects provide a way to organize the other

objects you create in Designer. Only one project can be open and visible in the project area at

a time.

A job is the smallest unit of work that you can schedule independently for execution. A project

is a single-use object that allows you to group jobs. For example, you can use a project to group

jobs that have schedules that depend on one another or that you want to monitor together.

Projects have the following characteristics:

• Projects are listed in the Local Object Library.

• Only one project can be open at a time.

• Projects cannot be shared among multiple users.

The objects in a project appear hierarchically in the project area. If a plus sign (+) appears next

to an object, you can expand it to view the lower-level objects contained in the object. Data

Services displays the contents as both names and icons in the project area hierarchy and in the

workspace.

Note: Jobs must be associated with a project before they can be executed in the project area of

Designer.

Using work flows Jobs with data flows can be developed without using work flows. However, one should consider

nesting data flows inside of work flows by default. This practice can provide various benefits.



Always using work flows makes jobs more adaptable to additional development and/or

specification changes. For instance, if a job initially consists of four data flows that are to run

sequentially, they could be set up without work flows. But what if specification changes require

that they be merged into another job instead? The developer would have to replicate their

sequence correctly in the other job. If these had been initially added to a work flow, the developer

could then have simply copied that work flow into the correct position within the new job.

There would be no need to learn, copy, and verify the previous sequence. The change can be

made more quickly with greater accuracy.

Even if there is one data flow per work flow, there are benefits to adaptability. Initially, it may

have been decided that recovery units are not important; the expectation being that if the job

fails, the whole process could simply be rerun. However, as data volumes tend to increase, it

may be determined that a full reprocessing is too time consuming. The job may then be changed

to incorporate work flows to benefit from recovery units to bypass reprocessing of successful

steps. However, these changes can be complex and can consume more time than allotted for

in a project plan. It also opens up the possibility that units of recovery are not properly defined.

Setting these up during initial development when the nature of the processing is being most

fully analyzed is preferred.

Describing the object hierarchy In the repository, objects are grouped hierarchically from a project, to jobs, to optional work

flows, to data flows. In jobs, work flows define a sequence of processing steps, and data flows

move data from source tables to target tables.

This illustration shows the hierarchical relationships for the key object types within Data

Services:



This course focuses on creating batch jobs using database datastores and file formats.



Using the Data Services Designer interface

Introduction The Data Services Designer interface allows you to plan and organize your data integration

and data quality jobs in a visual way. Most of the components of Data Services can be

programmed through this interface.


• Explain how Designer is used

• Describe key areas in the Designer window

Describing the Designer window The Data Services Designer interface consists of a single application window and several

embedded supporting windows. The application window contains the menu bar, toolbar, Local

Object Library, project area, tool palette, and workspace.



Tip: You can access the Data Services Technical Manuals for reference or help through the Designer

interface Help menu. These manuals are also accessible by going through Start ➤ Programs ➤

Business Objects XI 3.0 ➤ BusinessObjects Data Services ➤ Data Services Documentation ➤

Technical Manuals.

Using the Designer toolbar In addition to many of the standard Windows toolbar buttons, Data Services provides the

following unique toolbar buttons:

Button Tool Description

Close All Windows Closes all open windows in the workspace.

Local Object Library Opens and closes the Local Object Library window.

Central Object Library Opens and closes the Central Object Library window.

Variables Opens and closes the Variables and Parameters window.

Project Area Open and closes the project area.

Output Opens and closes the Output window.

View Enabled

Descriptions

Enables the system-level setting for viewing object

descriptions in the workspace.

Validate Current View

Validates the object definition open in the active tab of

the workspace. Other objects included in the definition

are also validated.

Validate All Objects in

View

Validates all object definitions open in the workspace.

Objects included in the definition are also validated.

Audit Opens the Audit window. You can collect audit statistics

on the data that flows out of any Data Services object.

View Where Used

Opens the Output window, which lists parent objects

(such as jobs) of the object currently open in the

workspace (such as a data flow).

Back Moves back in the list of active workspace windows.

Forward Move forward in the list of active workspace windows.



Button Tool Description

Data Services

Opens and closes the Data Services Management Console,

which provides access to Administrator, Auto

Management Console Documentation, Data Validation, Lineage and Impact Analysis, Operational Dashboard, and Data Quality Reports.

Assess and Monitor Opens Data Insight, which allows you to assess and

monitor the quality of your data.

Contents Opens the Data Services Technical Manuals.

Using the Local Object Library The Local Object Library gives you access to the object types listed in the table below. The table

shows the tab on which the object type appears in the Local Object Library and describes the

Data Services context in which you can use each type of object.

Tab Description

Projects are sets of jobs available at a given time.

Jobs are executable work flows. There are two job types: batch jobs and real-time

jobs.

Work flows order data flows and the operations that support data flows, defining

the interdependencies between them.

Data flows describe how to process a task.

Transforms operate on data, producing output data sets from the sources you specify.

The Local Object Library lists both platform, Data Integrator, and Data Quality

transforms.

Datastores represent connections to databases and applications used in your project.

Under each datastore is a list of the tables, documents, and functions imported into

Data Services

Formats describe the structure of a flat file, Excel file, XML file, or XML message.

Custom functions are functions written in the Data Services Scripting Language.



You can import objects to and export objects from your Local Object Library as a file. Importing

objects from a file overwrites existing objects with the same names in the destination Local

Object Library.

Whole repositories can be exported in either .atl or .xml format. Using the .xml file format can

make repository content easier for you to read. It also allows you to export Data Services to

other products.

To import a repository from a file 1. On any tab of the Local Object Library, right-click the white space and select Repository

➤ Import from File from the menu.

The Open Import File dialog box displays.

2. Browse to the destination for the file.

3. Click Open.

A warning message displays to let you know that it takes a long time to create new versions

of existing objects.

4. Click OK.

You must restart Data Services after the import process completes. To export a repository to a file

1. On any tab of the Local Object Library, right-click the white space and select Repository

➤ Export To File.

The Write Repository Export File dialog box displays.

2. Browse to the destination for the export file.

3. In the File name field, enter the name of the export file.

4. In the Save as type list, select the file type for your export file.

5. Click Save.

The repository is exported to the file.

Using the project area The project area provides a hierarchical view of the objects used in each project. Tabs on the

bottom of the project area support different tasks. Tabs include:

Tab Description

Create, view, and manage projects.

This provides a hierarchical view of all objects used in each project.

View the status of currently executing jobs.



Tab Description

Selecting a specific job execution displays its status, including which steps are

complete and which steps are executing. These tasks can also be done using the

Data Services Management Console.

View the history of complete jobs.

Logs can also be viewed with the Data Services Management Console.

To change the docked position of the project area 1. Right-click the border of the project area.

2. From the menu, select Floating.

3. Click and drag the project area to dock and undock at any edge within Designer.

When you drag the project area away from a window edge, it stays undocked. When you

position the project area where one of the directional arrows highlights a portion of the

window, this signifies a placement option. The project area does not dock inside the

workspace area.

4. To switch between the last docked and undocked locations, double-click the gray border. To change the undocked position of the project area

1. Right-click the border of the project area.

2. From the menu, select Floating to remove the check mark and clear the docking option.

3. Click and drag the project area to any location on your screen. To lock and unlock the project area

1. Click the pin icon ( ) on the border to unlock the project area.

The project area hides.

2. Move the mouse over the docked pane.

The project area re-appears.

3. Click the pin icon to lock the pane in place again. To hide/show the project area

1. Right-click the border of the project area.

2. From the menu, select Hide.

The project area disappears from the Designer window.

3. To show the project area, click Project Area in the toolbar.



Using the tool palette The tool palette is a separate window that appears by default on the right edge of the Designer

workspace. You can move the tool palette anywhere on your screen or dock it on any edge of

the Designer window.

The icons in the tool palette allow you to create new objects in the workspace. The icons are

disabled when they are invalid entries to the diagram open in the workspace.

To show the name of each icon, hold the cursor over the icon until the tool tip for the icon

appears.

When you create an object from the tool palette, you are creating a new definition of an object.

If a new object is re-usable, it is automatically available in the Local Object Library after you

create it.

For example, if you select the data flow icon from the tool palette and define a new data flow

called DF1, you can later drag that existing data flow from the Local Object Library and add it

to another data flow called DF2.

The tool palette contains these objects:

Icon Tool Description Available in

Pointer

Returns the tool pointer to a selection

pointer for selecting and moving objects

in a diagram.

All objects

Work flow Creates a new work flow. Jobs and work flows

Data flow Creates a new data flow. Jobs and work flows

R/3 data flow Creates a new data flow with the SAP

licensed extension only.

SAP licensed extension

Query

transform

Creates a query to define column

mappings and row selections.

Data flows

Template table Creates a new table for a target. Data flows

Template XML Creates a new XML file for a target. Data flows

Data transport Create a data transport flow for the SAP

Licensed extension.

SAP Licensed

extension

Script Creates a new script object. Jobs and work flows

Conditional Creates a new conditional object. Jobs and work flows



Icon Tool While Loop

Description

Repeats a sequence of steps in a work flow

as long as a condition is true.

Available in Work flows

Try

Creates a new try object that tries an

alternate work flow if an error occurs in a

job.

Jobs and work flows

Catch Creates a new catch object that catches

errors in a job.

Jobs and work flows

Annotation Creates an annotation used to describe

objects.

Jobs, work flows, and

data flows

Using the workspace When you open a job or any object within a job hierarchy, the workspace becomes active with

your selection. The workspace provides a place to manipulate objects and graphically assemble

data movement processes.

These processes are represented by icons that you drag and drop into a workspace to create a

diagram. This diagram is a visual representation of an entire data movement application or

some part of a data movement application.

You specify the flow of data by connecting objects in the workspace from left to right in the

order you want the data to be moved.



Quiz: Describing Data Services 1. List two benefits of using Data Services.

2. Which of these objects is single-use?

a. Job

b. Project

c. Data flow

d. Work flow

3. Place these objects in order by their hierarchy: data flows, jobs, projects, and work flows. 4. Which tool do you use to associate a job server with a repository?

5. Which tool allows you to create a repository?

6. What is the purpose of the Access Server?



Lesson summary After completing this lesson, you are now able to:

• Describe the purpose of Data Services

• Describe Data Services architecture

• Define Data Services objects

• Use the Data Services Designer interface

Defining Source and Target Metadata—Learner’s Guide 31


Lesson 2

Defining Source and Target Metadata

Lesson introduction To define data movement requirements in Data Services, you must import source and target

metadata.


• Use datastores

• Use datastore and system configurations

• Define file formats for flat files

• Define file formats for Excel files

• Define file formats for XML files



Using datastores

Introduction Datastores represent connections between Data Services and databases or applications.


• Explain datastores

• Create a database datastore

• Change a datastore definition

• Import metadata

Explaining datastores A datastore provides a connection or multiple connections to data sources such as a database.

Through the datastore connection, Data Services can import the metadata that describes the

data from the data source.

Data Services uses these datastores to read data from source tables or load data to target tables.

Each source or target must be defined individually and the datastore options available depend

on which Relational Database Management System (RDBMS) or application is used for the

datastore. Database datastores can be created for the following sources:

• IBM DB2, Microsoft SQL Server, Oracle, Sybase, and Teradata databases (using native

connections)

• Other databases (through ODBC)

• A simple memory storage mechanism using a memory datastore

• IMS, VSAM, and various additional legacy systems using BusinessObjects Data Services

Mainframe Interfaces such as Attunity and IBM Connectors

The specific information that a datastore contains depends on the connection. When your

database or application changes, you must make corresponding changes in the datastore

information in Data Services. Data Services does not automatically detect structural changes

to the datastore.

There are three kinds of datastores:

• Database datastores: provide a simple way to import metadata directly from an RDBMS.

• Application datastores: let users easily import metadata from most Enterprise Resource

Planning (ERP) systems.

• Adapter datastores: can provide access to an application’s data and metadata or just metadata.

For example, if the data source is SQL-compatible, the adapter might be designed to access

metadata, while Data Services extracts data from or loads data directly to the application.



Using adapters Adapters provide access to a third-party application’s data and metadata. Depending on the

adapter implementation, adapters can provide:

• Application metadata browsing

• Application metadata importing into the Data Services repository

For batch and real-time data movement between Data Services and applications, Business

Objects offers an Adapter Software Development Kit (SDK) to develop your own custom

adapters. You can also buy Data Services prepackaged adapters to access application data and

metadata in any application.

For more information on these adapters, see Chapter 5 in the Data Services Designer Guide.

You can use the Data Mart Accelerator for Crystal Reports adapter to import metadata from

BusinessObjects Enterprise. See the documentation folder under Adapters located in your Data Services installation for more information on the Data Mart Accelerator for Crystal Reports.

Creating a database datastore You need to create at least one datastore for each database file system with which you are

exchanging data. To create a datastore, you must have appropriate access privileges to the

database or file system that the datastore describes. If you do not have access, ask your database

administrator to create an account for you.

To create a database datastore 1. On the Datastores tab of the Local Object Library, right-click the white space and select New

from the menu.

The Create New Datastore dialog box displays. 2. In the Datastore name field, enter the name of the new datastore.

The name can contain any alphanumeric characters or underscores (_). It cannot contain

spaces.

3. In the Datastore Type drop-down list, ensure that the default value of Database is selected.

4. In the Database type drop-down list, select the RDBMS for the data source.

5. Enter the other connection details, as required.

The values you select for the datastore type and database type determine the options available

when you create a database datastore. The entries that you must make to create a datastore

depend on the selections you make for these two options. Note that if you are using MySQL,

any ODBC connection provides access to all of the available MySQL schemas.

6. Leave the Enable automatic data transfer check box selected.



7. Click OK.

Changing a datastore definition Like all Data Services objects, datastores are defined by both options and properties:

• Options control the operation of objects. These include the database server name, database

name, user name, and password for the specific database.

The Edit Datastore dialog box allows you to edit all connection properties except datastore

name and datastore type for adapter and application datastores. For database datastores,

you can edit all connection properties except datastore name, datastore type, database type,

and database version.

• Properties document the object. For example, the name of the datastore and the date on

which it is created are datastore properties. Properties are descriptive of the object and do

not affect its operation.

Properties Tab Description

General

Attributes

Contains the name and description of the datastore, if available. The

datastore name appears on the object in the Local Object Library and

in calls to the object. You cannot change the name of a datastore after

creation.

Includes the date you created the datastore. This value cannot be

changed.

Class Attributes Includes overall datastore information such as description and date

created.



To change datastore options 1. On the Datastores tab of the Local Object Library, right-click the datastore name and select

Edit from the menu.

The Edit Datastore dialog box displays the connection information.

2. Change the database server name, database name, username, and password options, as

required.

3. Click OK.

The changes take effect immediately. To change datastore properties

1. On the Datastores tab of the Local Object Library, right-click the datastore name and select

Properties from the menu.

The Properties dialog box lists the datastore’s description, attributes, and class attributes.

2. Change the datastore properties, as required.

3. Click OK.

Importing metadata from data sources Data Services determines and stores a specific set of metadata information for tables. You can

import metadata by name, searching, and browsing. After importing metadata, you can edit

column names, descriptions, and datatypes. The edits are propagated to all objects that call

these objects.

Metadata Description

Table name The name of the table as it appears in the database.

Table description The description of the table.

Column name The name of the table column.

Column description The description of the column.

Column datatype

The datatype for each column.

If a column is defined as an unsupported datatype (see datatypes listed

below) Data Services converts the datatype to one that is supported. In

some cases, if Data Services cannot convert the datatype, it ignores the

column entirely.

The following datatypes are supported: BLOB, CLOB, date, datetime,

decimal, double, int, interval, long, numeric, real, time, timestamp, and

varchar.



Metadata Description

The column that comprises the primary key for the table.

Primary key column After a table has been added to a data flow diagram, this columns is

indicated in the column list by a key icon next to the column name.

Table attribute Information Data Services records about the table such as the date created

and date modified if these values are available.

Owner name Name of the table owner.

You can also import stored procedures from DB2, MS SQL Server, Oracle, and Sybase databases

and stored functions and packages from Oracle. You can use these functions and procedures

in the extraction specifications you give Data Services.

Information that is imported for functions includes:

• Function parameters

• Return type

• Name

• Owner

Imported functions and procedures appear in the Function branch of each datastore tree on

the Datastores tab of the Local Object Library.

You can configure imported functions and procedures through the Function Wizard and the

Smart Editor in a category identified by the datastore name.

Importing metadata by browsing The easiest way to import metadata is by browsing. Note that functions cannot be imported

using this method.

For more information on importing by searching and importing by name, see “Ways of importing

metadata”, Chapter 5 in the Data Services Designer Guide.

To import metadata by browsing 1. On the Datastores tab of the Local Object Library, right-click the datastore and select Open

from the menu.

The items available to import appear in the workspace.

2. Navigate to and select the tables for which you want to import metadata.

You can hold down the Ctrl or Shift keys and click to select multiple tables. 3. Right-click the selected items and select Import from the menu.

The workspace contains columns that indicate whether the table has already been imported

into Data Services (Imported) and if the table schema has changed since it was imported



(Changed). To verify whether the repository contains the most recent metadata for an object,

right-click the object and select Reconcile.

4. In the Local Object Library, expand the datastore to display the list of imported objects,

organized into Functions, Tables, and Template Tables.

5. To view data for a imported datastore, right-click a table and select View Data from the

menu.

Activity: Creating source and target datastores You have been hired as a Data Services designer for Alpha Acquisitions. Alpha has recently

acquired Beta Businesses, an organization that develops and sells software products and related

services.

In an effort to consolidate and organize the data, and simplify the reporting process for the

growing company, the Omega data warehouse is being constructed to merge the data for both

organizations, and a separate data mart is being developed for reporting on Human Resources

data. You also have access to a database for staging purposes called Delta. To start the

development process, you must create datastores and import the metadata for all of these data

sources.

Objective • Create datastores and import metadata for the Alpha Acquisitions, Beta Businesses, Delta,

HR Data Mart, and Omega databases.

Instructions 1. In your Local Object Library, create a new source datastore for the Alpha Acquisitions data

with the following options:

Field Value

Datastore name Alpha

Datastore type Database

Database type MySQL

Database version MySQL 5.0

Data source alpha

User name alpha

Password alpha



2. Import the metadata for the following source tables:

• alpha.category

• alpha.city

• alpha.country

• alpha.customer

• alpha.department

• alpha.employee

• alpha.hr_comp_update

• alpha.last_run

• alpha.order_details

• alpha.orders

• alpha.product

• alpha.region 3. View the data for the category table and confirm that there are four records.

4. Create a second source datastore for the Beta Businesses data with the following options:

Field Value

Datastore name Beta


Database type MySQL


Data source beta

User name beta

Password beta

5. Import the metadata for the following source tables:

• beta.addrcodes

• beta.categories

• beta.city

• beta.country

• beta.customers

• beta.employees

• beta.orderdetails

• beta.orders

• beta.products



• beta.region

• beta.shippers

• beta.suppliers

• beta.usa_customers 6. View the data for the usa_customers table and confirm that Jane Hartley from Planview Inc.

is the first customer record.

7. Create a datastore for the Delta staging database with the following options:

Field Value

Datastore name Delta


Database type MySQL


Data source delta

User name delta

Password

You do not need to import any metadata.

delta

8. Create a target datastore for the HR data mart with the following options:

Field Value

Datastore name HR_datamart


Database type MySQL


Data source hr_datamart

User name hruser

Password hruser



9. Import the metadata for the following target tables:

• hr_datamart.emp_dept

• hr_datamart.employee

• hr_datamart.hr_comp_update

• hr_datamart.recovery_status 10. Create a target datastore for the Omega data warehouse with the following options:

Field Value

Datastore name Omega


Database type MySQL


Data source omega

User name omega

Password omega

11. Import the metadata for the following target tables:

• omega.emp_dim

• omega.product_dim

• omega.product_target

• omega.time_dim



Using datastore and system configurations

Introduction Data Services supports multiple datastore configurations, which allow you to change your

datastores depending on the environment in which you are working.

After completing this unit you will be able to:

• Create multiple configurations in a datastore

• Create a system configuration

Creating multiple configurations in a datastore A configuration is a property of a datastore that refers to a set of configurable options (such as

database connection name, database type, user name, password, and locale) and their values.

When you create a datastore, you can specify one datastore configuration at a time and specify

one as the default. Data Services uses the default configuration to import metadata and execute

jobs. You can create additional datastore configurations using the Advanced option in the

datastore editor. You can combine multiple configurations into a system configuration that is

selectable when executing or scheduling a job. Multiple configurations and system configurations

make portability of your job much easier (for example, different connections for development,

test, and production environments).

When you add a new configuration, Data Services modifies the language of data flows that

contain table targets and SQL transforms in the datastore based on what you defined in the

new configuration.

To create multiple datastore configurations in an existing datastore 1. On the Datastores tab of the Local Object Library, right-click a datastore and select Edit

from the menu.

The Edit Datastore dialog box displays.

2. Click Advanced >>.

A grid of additional datastore properties and the multiple configuration controls displays.



3. Click Edit next to the Configurations count at the bottom of the dialog box.

The Configurations for Datastore dialog box displays. The default configuration displays.

Each subsequent configuration displays as an additional column.



4. Double-click the header for the default configuration to change the name, and then click

outside of the header to commit the change.

5. Click Create New Configuration in the toolbar.

The Create New Configuration dialog box displays.

6. In the Name field, enter the name for your new configuration.

Do not include spaces when assigning names for your datastore configurations. 7. Select the database type and version.

8. Click OK.

A second configuration is added to the Configurations for Datastore window.



9. Adjust the other properties of the new configuration to correspond with the existing

configuration, as required.

If a property does not apply to a configuration, the cell does not accept input. Cells that

correspond to a group header also do not accept input, and are marked with hatched gray

lines.

10. If required, click Create New Alias to create an alias for the configuration, enter a value for

the alias at the bottom of the page, and click OK to return to the Edit Datastore dialog box.

11. Click OK to complete the datastore configuration.

12. Click OK to close the Edit Datastore dialog box.

Creating a system configuration System configurations define a set of datastore configurations that you want to use together

when running a job. In many organizations, a Data Services designer defines the required

datastore and system configurations, and a system administrator determines which system

configuration to use when scheduling or starting a job in the Administrator.

When designing jobs, determine and create datastore configurations and system configurations

depending on your business environment and rules. Create datastore configurations for the

datastores in your repository before you create the system configurations for them.

Data Services maintains system configurations separately. You cannot check in or check out

system configurations. However, you can export system configurations to a separate flat file

which you can later import. By maintaining system configurations in a separate file, you avoid



modifying your datastore each time you import or export a job, or each time you check in and

check out the datastore.

You cannot define a system configuration if your repository does not contain at least one

datastore with multiple configurations.

To create a system configuration 1. From the Tools menu, select System Configurations.

The System Configuration Editor dialog box displays columns for each datastore.

2. In the Configuration name column, enter the system configuration name.

Use the SC_ prefix in the system configuration name so that you can easily identify this file

as a system configuration, particularly when exporting.

3. In the drop-down list for each datastore column, select the appropriate datastore

configuration that you want to use when you run a job using this system configuration.

4. Click OK.



Defining file formats for flat files

Introduction File formats are connections to flat files in the same way that datastore are connections to

databases.


• Explain file formats

• Create a file format for a flat file

Explaining file formats A file format is a generic description that can be used to describe one file or multiple data files

if they share the same format. It is a set of properties describing the structure of a flat file (ASCII).

File formats are used to connect to source or target data when the data is stored in a flat file.

The Local Object Library stores file format templates that you use to define specific file formats

as sources and targets in data flows.

File format objects can describe files in:

• Delimited format — delimiter characters such as commas or tabs separate each field.

• Fixed width format — the fixed column width is specified by the user.

• SAP R/3 format — this is used with the predefined Transport_Format or with a custom

SAP R/3 format.

Creating file formats Use the file format editor to set properties for file format templates and source and target file

formats. The file format editor has three work areas:

• Property Value: Edit file format property values. Expand and collapse the property groups

by clicking the leading plus or minus.

• Column Attributes: Edit and define columns or fields in the file. Field-specific formats

override the default format set in the Properties-Values area.

• Data Preview: View how the settings affect sample data.

The properties and appearance of the work areas vary with the format of the file.

Date formats In the Property Values work area, you can override default date formats for files at the field

level. The following data format codes can be used:

Code Description

DD 2-digit day of the month



Code Description

MM 2-digit month

MONTH Full name of the month

MON 3-character name of the month

YY 2-digit year

YYYY 4-digit year

HH24 2-digit hour of the day (0-23)

MI 2-digit minute (0-59)

SS 2-digit second (0-59)

FF Up to 9-digit sub-seconds

To create a new file format 1. On the Formats tab of the Local Object Library, right-click Flat Files and select New from

the menu to open the File Format Editor.

To make sure your file format definition works properly, it is important to finish inputting

the values for the file properties before moving on to the Column Attributes work area.



2. In the Type field, specify the file type:

• Delimited: select this file type if the file uses a character sequence to separate columns.

• Fixed width: select this file type if the file uses specified widths for each column.

If a fixed-width file format uses a multi-byte code page, then no data is displayed in the

Data Preview section of the file format editor for its files.

3. In the Name field, enter a name that describes this file format template.

Once the name has been created, it cannot be changed. If an error is made, the file format

must be deleted and a new format created.

4. Specify the location information of the data file including Location, Root directory, and File

name.

The Group File Read can read multiple flat files with identical formats through a single file

format. By substituting a wild card character or list of file names for the single file name,

multiple files can be read.

5. Click Yes to overwrite the existing schema.

This happens automatically when you open a file. 6. Complete the other properties to describe files that this template represents. Overwrite the

existing schema as required.



7. For source files, specify the structure of each column in the Column Attributes work area

as follows:

Column Description

Field Name Enter the name of the column.

Data Type Select the appropriate datatype from the drop-down list.

Field Size For columns with a datatype of varchar, specify the length of

the field.

Precision For columns with a datatype of decimal or numeric, specify

the precision of the field.

Scale For columns with a datatype of decimal or numeric, specify

the scale of the field.

Format

For columns with any datatype but varchar, select a format

for the field, if desired. This information overrides the default

format set in the Property Values work area for that datatype.

You do not need to specify columns for files used as targets. If you do specify columns and

they do not match the output schema from the preceding transform, Data Services writes

to the target file using the transform’s output schema.

For a decimal or real datatype, if you only specify a source column format and the column

names and datatypes in the target schema do not match those in the source schema, Data

Services cannot use the source column format specified. Instead, it defaults to the format

used by the code page on the computer where the Job Server is installed.

8. Click Save & Close to save the file format and close the file format editor.

9. In the Local Object Library, right-click the file format and select View Data from the menu

to see the data.

To create a file format from an existing file format 1. On the Formats tab of the Local Object Library, right-click an existing file format and select

Replicate.

The File Format Editor opens, displaying the schema of the copied file format.

2. In the Name field, enter a unique name for the replicated file format.

Data Services does not allow you to save the replicated file with the same name as the

original (or any other existing File Format object). After it is saved, you cannot modify the

name again.

3. Edit the other properties as desired.



4. Click Save & Close to save the file format and close the file format editor. To read multiple flat files with identical formats through a single file format

1. On the Formats tab of the Local Object Library, right-click an existing file format and select

Edit from the menu.

The format must be based on one single file that shares the same schema as the other files. 2. In the location field of the format wizard, enter one of the following:

• Root directory (optional to avoid retyping)

• List of file names, separated by commas

• File name containing a wild character (*)

When you use the (*) to call the name of several file formats, Data Services reads one file

format, closes it and then proceeds to read the next one. For example, if you specify the file

name revenue*.txt, Data Services reads all flat files starting with revenue in the file name.

Handling errors in file formats One of the features available in the File Format Editor is error handling. When you enable

error handling for a file format, Data Services:

• Checks for the two types of flat-file source errors:

○ Datatype conversion errors. For example, a field might be defined in the File Format

Editor as having a datatype of integer but the data encountered is actually varchar.

○ Row-format errors. For example, in the case of a fixed-width file, Data Services identifies

a row that does not match the expected width value.

• Stops processing the source file after reaching a specified number of invalid rows.

• Logs errors to the Data Services error log. You can limit the number of log entries allowed

without stopping the job.

You can choose to write rows with errors to an error file, which is a semicolon-delimited text

file that you create on the same machine as the Job Server.

Entries in an error file have this syntax: source file path and name; row number in source file; Data Services error; column

number where the error occurred; all columns from the invalid row



To enable flat file error handling in the File Format Editor 1. On the Formats tab of the Local Object Library, right-click the file format and select Edit

from the menu.

2. Under the Error handling section, in the Capture data conversion errors drop-down list,

select Yes.

3. In the Capture row format errors drop-down list, select Yes.

4. In the Write error rows to file drop-down list, select Yes.

You can also specify the maximum warnings to log and the maximum errors before a job

is stopped.

5. In the Error file root directory field, click the folder icon to browse to the directory in which

you have stored the error handling text file you created.

6. In the Error file name field, enter the name for the text file you created to capture the flat

file error logs in that directory.

7. Click Save & Close.

Activity: Creating a file format for a flat file In addition to the main databases for source information, records for some of the orders for

Alpha Acquisitions are stored in flat files.

Objective • Create a file format for the orders flat files so you can use them as source objects.

Instructions 1. In the Local Object Library, create a new delimited file format called Orders_Format for the

orders_12_21_06.txt flat file in the Activity_Source folder.

The path depends on where the folder has been copied from the Learner Resource CD. 2. Adjust the format so that it reflects the source file.

Consider the following:

• The column delimiter is a semicolon (;).

• The row delimiter is {Windows new line}.



• The date format is dd-mon-yyyy.

• The row header should be skipped. 3. In the Column Attributes pane, adjust the datatypes for the columns based on their content.

Column Datatype

ORDERID int

EMPLOYEEID varchar(15)

ORDERDATE date

CUSTOMERID int

COMPANYNAME varchar(50)

CITY varchar(50)

COUNTRY varchar(50)

4. Save your changes and view the data to confirm that order 11196 was placed on December

21, 2006.



Defining file formats for Excel files

Introduction You can create file formats for Excel files in the same way that you would for flat files.


• Create a file format for an Excel file

Using Excel as a native data source It is possible to connect to Excel workbooks natively as a source, with no ODBC connection

setup and configuration needed. You can select specific data in the workbook using custom

ranges or auto-detect, and you can specify variable for file and sheet names for more flexibility.

As with file formats and datastores, these Excel formats show up as sources in impact and

lineage analysis reports.

To import and configure an Excel source 1. On the Formats tab of the Local Object Library, right-click Excel Workbooks and select New

from the menu.

The Import Excel Workbook dialog box displays.



2. In the Format name field, enter a name for the format.

The name may contain underscores but not spaces.

3. On the Format tab, click the drop-down button beside the Directory field and select <Select

folder...>.

4. Navigate to and select a new directory, and then click OK.

5. Click the drop-down button beside the File name field and select <Select file...>.

6. Navigate to and select an Excel file, and then click OK.

7. Do one of the following:

• To reference a named range for the Excel file, select the Named range radio button and

enter a value in the field provided.

• To reference an entire worksheet, select the Worksheet radio button and then select the

All fields radio button.



• To reference a custom range, select the Worksheet radio button and the Custom range

radio button, click the ellipses (...) button, select the cells, and close the Excel file by

clicking X in the top right corner of the worksheet.

8. If required, select the Extend range checkbox.

The Extend range checkbox provides a means to extend the spreadsheet in the event that

additional rows of data are added at a later time. If this checkbox is checked, at execution

time, Data Services searches row by row until a null value row is reached. All rows above

the null value row are included.

9. If applicable, select the Use first row values as column names option.

If this option is selected, field names are based on the first row of the imported Excel sheet. 10. Click Import schema.

The schema is displayed at the top of the dialog box.

11. Specify the structure of each column as follows:

Column Description

Field Name Enter the name of the column.

Data Type Select the appropriate datatype from the drop-down list.

Field Size For columns with a datatype of varchar, specify the length of the field.

Precision For columns with a datatype of decimal or numeric, specify the

precision of the field.

Scale For columns with a datatype of decimal or numeric, specify the scale

of the field.

Description If desired, enter a description of the column.

12. If required, on the Data Access tab, enter any changes that are required.

The Data Access tab provides options to retrieve the file via FTP or execute a custom

application (such as unzipping a file) before reading the file.

13. Click OK.

The newly imported file format appears in the Local Objects Library with the other Excel

workbooks. The sheet is now available to be selected for use as a native data source.

Activity: Creating a file format for an Excel file Compensation information for Alpha Acquisitions is stored in an Excel spreadsheet. To use

this information in data flows, you must create a file format.



Objective • Create a file format to enable you to use the compensation spreadsheet as a source object

Instructions 1. In the Local Object Library, create a new file format for an Excel Workbook called Comp_HR.

2. Navigate to the Comp_HR.xls file in the Activity_Source folder. The path depends on where

the folder has been copied from the Learner Resource CD.

3. Select a custom range for the Comp_HR worksheet and select all cells that contain data.

4. Specify that you want to be able to extend the range.

5. Use the first row for the column names.

6. Import the schema and adjust the datatypes for the columns as follows:

Column Datatype

EmployeeID varchar(10)

Emp_Salary int

Emp_Bonus int

Emp_VacationDays int

date_updated datetime 7. Save your changes and view the data to confirm that employee 2Lis5 has 16 vacation days

accrued.



Defining file formats for XML files

Introduction Data Services allows you to import and export metadata for XML documents that you can use

as sources or targets in jobs.


• Import data from XML documents

• Explain nested data

Importing data from XML documents XML documents are hierarchical and the set of properties describing their structure is stored

in separate format files. These format files describe the data contained in the XML document

and the relationships among the data elements, the schema. The format of an XML file or

message (.xml) can be specified using either a document type definition (.dtd) or XML Schema

(.xsd).

Data flows can read and write data to messages or files based on a specified DTD format or

XML Schema. You can use the same DTD format or XML Schema to describe multiple XML

sources or targets.

Data Services uses Nested Relational Data Modeling (NRDM) to structure imported metadata

from format documents, such as .xsd or .dtd files, into an internal schema to use for hierarchical

documents.

Importing metadata from a DTD file As an example, an XML document that contains information to place a sales order, such as

order header, customer, and line items, the corresponding DTD includes the order structure

and the relationship between the data elements.



You can import metadata from either an existing XML file (with a reference to a DTD) or a

DTD file. If you import the metadata from an XML file, Data Services automatically retrieves

the DTD for that XML file.

When importing a DTD format, Data Services reads the defined elements and attributes, and

ignores other parts, such as text and comments, from the file definition. This allows you to

modify imported XML data and edit the datatype as needed.

To import a DTD format 1. On the Formats tab of the Local Object Library, right-click DTDs, and select New.

The Import DTD Format dialog box appears.



2. In the DTD definition name field, enter the name you want to give the imported DTD

format.

3. Beside the File name field, click Browse, locate the file path that specifies the DTD you want

to import, and open the DTD.

4. In the File type area, select a file type.

The default file type is DTD. Use the XML option if the DTD file is embedded within the

XML data. 5. In the Root element name field, select the name of the primary node of the XML that the

DTD format is defining.

Data Services only imports elements of the format that belong to this node or any sub-nodes.

This option is not available when you select the XML file option type.

6. In the Circular level field, specify the number of levels the DTD, if applicable.

If the DTD format contains recursive elements, for example, element A contains B and

element B contains A, this value must match the number of recursive levels in the DTD

format’s content. Otherwise, the job that uses this DTD format will fail.

7. In the Default varchar size field, set the varchar size to import strings into Data Services.

The default varchar size is 1024. 8. Click OK.

After you import the DTD format, you can view the DTD format’s column properties, and

edit the nested table and column attributes in the DTD - XML Format editor. For more

information on DTD attributes, see Chapter 2 in the Data Services Reference Guide.

To edit column attributes of nested schemas 1. On the Formats tab of the Local Object Library, expand DTDs and double-click the DTD

name to open it in the workspace.

2. In the workspace, right-click a nested column or column and select Properties.

3. In the Column Properties window, click the Attributes tab.

4. To change an attribute, click the attribute name and enter the appropriate value in the Value

field.



5. Click OK.

Importing metadata from an XML schema For an XML document that contains, for example, information to place a sales order, such as

order header, customer, and line items, the corresponding XML schema includes the order

structure and the relationship between the data as shown:



When importing an XML Schema, Data Services reads the defined elements and attributes,

and imports:

• Document structure

• Table and column names

• Datatype of each column

• Nested table and column attributes

Note: While XML Schemas make a distinction between elements and attributes, Data Services

imports and converts them all to nested table and column attributes. For more information on Data

Services attributes, see Chapter 2 in the Data Services Reference Guide.

To import an XML schema 1. On the Formats tab of the Local Object Library, right-click XML Schemas, and select New.

The Import XML Schema Format editor appears.



2. In the Format name field, enter the name you want to give the format.

3. In the File name/ URL field, enter the file name and URL address of the source file, or click

Browse, locate the file path that specifies the XML Schema you want to import, and open

the file.

4. In the Root element name drop-down list, select the name of the primary node you want

to import.

Data Services only imports elements of the XML Schema that belong to this node or any

subnodes. If the root element name is not unique within the XML Schema, select a namespace

to identify the imported XML Schema.

5. In the Circular level field, specify the number of levels the XML Schema has, if applicable.

If the XML Schema contains recursive elements, for example, element A contains B and

element B contains A, this value must match the number of recursive levels in the XML

Schema’s content. Otherwise, the job that uses this XML Schema will fail.

6. In the Default varchar size field, set the varchar size to import strings into Data Services.

The default varchar size is 1024. 7. Click OK.

After you import an XML Schema, you can view the XML schema’s column properties, and

edit the nested table and column attributes in the workspace.

Explaining nested data Sales orders are often presented using nested data. For example, the line items in a sales order

are related to a single header and are represented using a nested schema. Each row of the sales

order data set contains a nested line item schema as shown:



Using the nested data method can be more concise (no repeated information), and can scale to

present a deeper level of hierarchical complexity.

To expand on the example above, columns inside a nested schema can also contain columns.

There is a unique instance of each nested schema for each row at each level of the relationship

as shown:

Generalizing further with nested data, each row at each level can have any number of columns

containing nested schemas.



Data Services maps nested data to a separate schema implicitly related to a single row and

column of the parent schema. This mechanism is called Nested Relational Data Modeling

(NRDM). NRDM provides a way to view and manipulate hierarchical relationships within

data flow sources, targets, and transforms.

In Data Services, you can see the structure of nested data in the input and output schemas of

sources, targets, and transforms in data flows.

Unnesting data Loading a data set that contains nested schemas into a relational target requires that the nested

rows be unnested.

For example, a sales order may use a nested schema to define the relationship between the

order header and the order line items. To load the data into relational schemas, the multi-level

must be unnested.

Unnesting a schema produces a cross-product of the top-level schema (parent) and the nested

schema (child).

You can also load different columns from different nesting levels into different schemas. For

example, a sales order can be flattened so that the order number is maintained separately with

each line-item and the header and line-item information are loaded into separate schemas.



Data Services allows you to unnest any number of nested schemas at any depth. No matter

how many levels are involved, the result of unnesting schemas is a cross product of the parent

and child schemas.

When more than one level of unnesting occurs, the inner-most child is unnested first, then the

result—the cross product of the parent and the inner-most child—is then unnested from its

parent, and so on to the top-level schema.

Keep in mind that unnesting all schemas to create a cross product of all data might not produce

the results you intend. For example, if an order includes multiple customer values such as

ship-to and bill-to addresses, flattening a sales order by unnesting customer and line-item

schemas produces rows of data that might not be useful for processing the order.



Quiz: Defining source and target metadata 1. What is the difference between a datastore and a database?

2. What are the two methods in which metadata can be manipulated in Data Services objects?

What does each of these do? 3. Which of the following is NOT a datastore type?

a. Database

b. Application

c. Adapter

d. File Format 4. What is the difference between a repository and a datastore?




• Use datastores

• Use datastore and system configurations

• Define file formats for flat files

• Define file formats for Excel files

• Define file formats for XML files

Creating Batch Jobs—Learner’s Guide 69


Lesson 3

Creating Batch Jobs

Lesson introduction Once metadata has been imported for your datastores, you can create data flows to define data

movement requirements.


• Work with objects

• Create a data flow

• Use the Query transform

• Use target tables

• Execute the job



Working with objects

Introduction Data flows define how information is moved from source to target. These data flows are

organized into executable jobs, which are grouped into projects.


• Create a project

• Create a job

• Add, connect, and delete objects in the workspace

• Create a work flow

Creating a project A project is a single-use object that allows you to group jobs. It is the highest level of organization

offered by Data Services. Opening a project makes one group of objects easily accessible in the

user interface. Only one project can be open at a time.

A project is used solely for organizational purposes. For example, you can use a project to

group jobs that have schedules that depend on one another or that you want to monitor together.

The objects in a project appear hierarchically in the project area in Designer. If a plus sign (+)

appears next to an object, you can expand it to view the lower-level objects.

The objects in the project area also display in the workspace, where you can drill down into

additional levels:.



To create a new project

1. From the Project menu, select New ➤ Project.

You can also right-click the white space on the Projects tab of the Local Object Library and

select New from the menu.

The Project - New dialog box displays.

2. Enter a unique name in the Project name field.

The name can include alphanumeric characters and underscores (_). It cannot contain blank

spaces.

3. Click Create.

The new project appears in the project area. As you add jobs and other lower-level objects

to the project, they also appear in the project area.

To open an existing project 1. From the Project menu, select Open.

The Project - Open dialog box displays. 2. Select the name of an existing project from the list.

3. Click Open.

If another project is already open, Data Services closes that project and opens the new one

in the project area.

To save a project 1. From the Project menu, select Save All.

The Save all changes dialog box lists the jobs, work flows, and data flows that you edited

since the last save.



2. Deselect any listed object to avoid saving it.

3. Click OK.

You are also prompted to save all changes made in a job when you execute the job or exit

the Designer.

Creating a job A job is the only executable object in Data Services. When you are developing your data flows,

you can manually execute and test jobs directly in Data Services. In production, you can schedule

batch jobs and set up real-time jobs as services that execute a process when Data Services

receives a message request.

A job is made up of steps that are executed together. Each step is represented by an object icon

that you place in the workspace to create a job diagram. A job diagram is made up of two or

more objects connected together. You can include any of the following objects in a job definition:

• Work flows

• Scripts

• Conditionals

• While loops

• Try/catch blocks

• Data flows

○ Source objects

○ Target objects

○ Transforms

If a job becomes complex, you can organize its content into individual work flows, and then

create a single job that calls those work flows.

Tip: It is recommended that you follow consistent naming conventions to facilitate object

identification across all systems in your enterprise.



To create a job in the project area 1. In the project area, right-click the project name and select New Batch Job from the menu.

A new batch job is created in the project area.

2. Edit the name of the job.

The name can include alphanumeric characters and underscores (_). It cannot contain blank

spaces.

Data Services opens a new workspace for you to define the job.

3. Click the cursor outside of the job name or press Enter to commit the changes.

You can also create a job and related objects from the Local Object Library. When you create

a job in the Local Object Library, you must associate the job and all related objects to a project

before you can execute the job.

Adding, connecting, and deleting objects in the workspace

After creating a job, you can add objects to the job workspace area using either the Local Object

Library or the tool palette. To add objects from the Local Object Library to the workspace

1. In the Local Object Library, click the tab for the type of object you want to add.

2. Click and drag the selected object on to the workspace. To add objects from the tool palette to the workspace

• In the tool palette, click the desired object, move the cursor to the workspace, and then click

the workspace to add the object.

Creating a work flow A work flow is an optional object that defines the decision-making process for executing other

objects.



For example, elements in a work flow can determine the path of execution based on a value

set by a previous job or can indicate an alternative path if something goes wrong in the primary

path. Ultimately, the purpose of a work flow is to prepare for executing data flows and to set

the state of the system after the data flows are complete.

Note: In essence, jobs are just work flows that can be executed. Almost all of the features documented

for work flows also apply to jobs.

Work flows can contain data flows, conditionals, while loops, try/catch blocks, and scripts.

They can also call other work flows, and you can nest calls to any depth. A work flow can even

call itself.

To create a work flow 1. Open the job or work flow to which you want to add the work flow.

2. Select the Work Flow icon in the tool palette.

3. Click the workspace where you want to place the work flow.

4. Enter a unique name for the work flow.

5. Click the cursor outside of the work flow name or press Enter to commit the changes. To connect objects in the workspace area

• Click and drag from the triangle or square of an object to the triangle or square of the next

object in the flow to connect the objects.

To disconnect objects in the workspace area • Select the connecting line between the objects and press Delete.

Defining the order of execution in work flows The connections you make between the icons in the workspace determine the order in which

work flows execute, unless the jobs containing those work flows execute in parallel. Steps in a

work flow execute in a sequence from left to right. You must connect the objects in a work flow

when there is a dependency between the steps.

To execute more complex work flows in parallel, you can define each sequence as a separate

work flow, and then call each of the work flows from another work flow, as in this example:

First, you must define Work Flow A:



Next, define Work Flow B:

Finally, create Work Flow C to call Work Flows A and B:

You can specify a job to execute a particular work flow or data flow once only. If you specify

that it should be executed only once, Data Services only executes the first occurrence of the

work flow or data flow, and skips subsequent occurrences in the job. You might use this feature

when developing complex jobs with multiple paths, such as jobs with try/catch blocks or

conditionals, and you want to ensure that Data Services only executes a particular work flow

or data flow one time.



Creating a data flow

Introduction Data flows contain the source, transform, and target objects that represent the key activities in

data integration and data quality processes.



• Explain source and target objects

• Add source and target objects to a data flow

Using data flows Data flows determine how information is extracted from sources, transformed, and loaded into

targets. The lines connecting objects in a data flow represent the flow of data through data

integration and data quality processes.

Each icon you place in the data flow diagram becomes a step in the data flow. The objects that

you can use as steps in a data flow are:

• Source and target objects

• Transforms

The connections you make between the icons determine the order in which Data Services

completes the steps.

Using data flows as steps in work flows Each step in a data flow, up to the target definition, produces an intermediate result. For

example, the results of a SQL statement contain a WHERE clause that flows to the next step in

the data flow. The intermediate result consists of a set of rows from the previous operation and

the schema in which the rows are arranged. This result is called a data set. This data set may,

in turn, be further filtered and directed into yet another data set.

Data flows are closed operations, even when they are steps in a work flow. Any data set created

within a data flow is not available to other steps in the work flow.

A work flow does not operate on data sets and cannot provide more data to a data flow;

however, a work flow can:

• Call data flows to perform data movement operations.

• Define the conditions appropriate to run data flows.

• Pass parameters to and from data flows.



To create a new data flow 1. Open the job or work flow in which you want to add the data flow.

2. Select the Data Flow icon in the tool palette.

3. Click the workspace where you want to add the data flow.

4. Enter a unique name for your data flow.

Data flow names can include alphanumeric characters and underscores (_). They cannot

contain blank spaces.

5. Click the cursor outside of the data flow or press Enter to commit the changes.

6. Double-click the data flow to open the data flow workspace.

Changing data flow properties You can specify the following advanced data properties for a data flow:

Data Flow Property Description

When you specify that a data flow should only execute once,

a batch job will never re-execute that data flow after the data

Execute only once flow completes successfully, even if the data flow is contained

in a work flow that is a recovery unit that re-executes. You

should not select this option if the parent work flow is a

recovery unit.

Use database links

Database links are communication paths between one database

server and another. Database links allow local users to access

data on a remote database, which can be on the local or a

remote computer of the same or different database type. For

more information see “Database link support for push-down

operations across datastores” in the Data Services Performance

Optimization Guide.

Degree of parallelism (DOP) is a property of a data flow that

defines how many times each transform within a data flow

Degree of parallelism replicates to process a parallel subset of data. For more

information see “Degree of parallelism” in the Data Services

Performance Optimization Guide.

Cache type

You can cache data to improve performance of operations

such as joins, groups, sorts, filtering, lookups, and table

comparisons. Select one of the following values:

• In Memory: Choose this value if your data flow processes

a small amount of data that can fit in the available memory.



Data Flow Property Description

• Pageable: Choose this value if you want to return only a

subset of data at a time to limit the resources required. This

is the default.

For more information, see “Tuning Caches” in the Data Services

Performance Optimization Guide.

To change data flow properties 1. Right-click the data flow and select Properties from the menu.

The Properties window opens for the data flow.

2. Change the properties of the data flow as required.

3. Click OK.

For more information about how Data Integrator processes data flows with multiple

properties, see “Data Flow” in the Data Services Resource Guide.

Explaining source and target objects

A data flow directly reads data from source objects and loads data to target objects.



Object Description Type

Table

A file formatted with columns

and rows as used in relational

databases.

Source and target

Template table

A template table that has been

created and saved in another

data flow (used in

development).

Source and target

File A delimited or fixed-width flat

file.

Source and target

Document

A file with an

application-specific format

(not readable by SQL or XML

parser).

Source and target

XML file A file formatted with XML

tags.

Source and target

XML message A source in real-time jobs. Source only

An XML file whose format is

based on the preceding

XML template file transform output (used in

development, primarily for

debugging data flows).

Target only

Transform

A pre-built set of operations

that can create new data, such

as the Date Generation

transform.

Source only

Adding source and target objects Before you can add source and target objects to a data flow, you must first create the datastore

and import the table metadata for any databases, or create the file format for flat files.

To add a source or target object to a data flow 1. In the workspace, open the data flow in which you want to place the object.

2. Do one of the following:



• To add a database table, in the Datastores tab of the Local Object Library, select the table.

• To add a flat file, in the Formats tab of the Local Object Library, select the file format. 3. Click and drag the object to the workspace.

A pop-up menu appears for the source or target object.

4. Select Make Source or Make Target from the menu, depending on whether the object is a

source or target object.

5. Add and connect objects in the data flow as appropriate.



Using the Query transform

Introduction The Query transform is the most commonly-used transform, and is included in most data flows.

It enables you to select data from a source and filter or reformat it as it moves to the target.


• Describe the transform editor


Describing the transform editor The transform editor is a graphical interface for defining the properties of transforms. The

workspace can contain these areas:

• Input schema area

• Output schema area

• Parameters area



The input schema area displays the schema of the input data set. For source objects and some

transforms, this area is not available.

The output schema area displays the schema of the output data set, including any functions.

For template tables, the output schema can be defined based on your preferences.

For any data that needs to move from source to target, a relationship must be defined between

the input and output schemas. To create this relationship, you must map each input column

to the corresponding output column.

Below the input and output schema areas is the parameters area. The options available on this

tab differs based on which transform or object you are modifying. The I icon ( ) indicates tabs

containing user-defined entries.

Explaining the Query transform The Query transform is used so frequently that it is included in the tool palette with other

standard objects. It retrieves a data set that satisfies conditions that you specify, similar to a

SQL SELECT statement.



The Query transform can perform the following operations:

• Filter the data extracted from sources.

• Join data from multiple sources.

• Map columns from input to output schemas.

• Perform transformations and functions on the data.

• Perform data nesting and unnesting.

• Add new columns, nested schemas, and function results to the output schema.

• Assign primary keys to output columns.

For example, you could use the Query transform to select a subset of the data in a table to show

only those records from a specific region.

The next section gives a brief description the function, data input requirements, options, and

data output results for the Query transform. For more information on the Query transform see

“Transforms” Chapter 5 in the Data Services Reference Guide.

Input/Output The data input is a data set from one or more sources with rows flagged with a NORMAL

operation code.

The NORMAL operation code creates a new row in the target. All the rows in a data set are

flagged as NORMAL when they are extracted by a source table or file. If a row is flagged as

NORMAL when loaded into a target table or file, it is inserted as a new row in the target.

The data output is a data set based on the conditions you specify and using the schema specified

in the output schema area.

Note: When working with nested data from an XML file, you can use the Query transform to

unnest the data using the right-click menu for the output schema, which provides options for

unnesting.

Options The input schema area displays all schemas input to the Query transform as a hierarchical tree.

Each input schema can contain multiple columns.

Output schema area displays the schema output from the Query transform as a hierarchical

tree. The output schema can contain multiple columns and functions.



Icons preceding columns are combinations of these graphics:

Icon Description

This indicates that the column is a primary key.

This indicates that the column has a simple mapping. A simple mapping

is either a single column or an expression with no input column.

This indicates that the column has a complex mapping, such as a

transformation or a merge between two source columns.

This indicates that the column mapping is incorrect.

Data Integrator does not perform a complete validation during design, so

not all incorrect mappings will necessarily be flagged.

The parameters area of the Query transform includes the following tabs:

Tab Description

Mapping Specify how the selected output column is derived.

Select Select only distinct rows (discarding any duplicate rows).

From Specify the input schemas used in the current output schema.

Outer Join Specify an inner table and an outer table for joins that you want

treated as outer joins.

Where Set conditions that determine which rows are output.

Group By

Specify a list of columns for which you want to combine output.

For each unique set of values in the group by list, Data Services

combines or aggregates the values in the remaining columns.

Order By Specify the columns you want used to sort the output data set.

Search/Replace Search for and replace a specific work or item in the input schema

or the output schema.

Advanced

Create separate sub data flows to process any of the following

resource-intensive query clauses:

• DISTINCT

• GROUP BY



Tab Description

• JOIN

• ORDER BY

For more information, see “Distributed Data Flow execution” in

the Data Services Designer Guide.

To map input columns to output columns • In the transform editor, do any of the following:

• Drag and drop a single column from the input schema area into the output schema area.

• Drag a single input column over the corresponding output column, release the cursor,

and select Remap Column from the menu.

• Select multiple input columns (using Ctrl+click or Shift+click) and drag onto Query

output schema for automatic mapping.

• Select the output column and manually enter the mapping on the Mapping tab in the

parameters area. You can either type the column name in the parameters area or click

and drag the column from the input schema pane.



Using target tables

Introduction The target object for your data flow can be either a physical table or file, or a template table.


• Set target table options

• Use template tables

Setting target table options When your target object is a physical table in a database, the target table editor opens in the

workspace with different tabs where you can set database type properties, table loading options,

and tuning techniques for loading a job.



Note: Most of the tabs in the target table editor focus on migration or performance-tuning techniques,

which are outside the scope of this course.

You can set the following table loading options in the Options tab of the target table editor:

Option Description

Rows per commit Specifies the transaction size in number of rows.

Column comparison

Delete data from table before

loading

Specifies how the input columns are mapped to output

columns. There are two options:

• Compare_by_position — disregards the column names

and maps source columns to target columns by

position.

• Compare_by_name — maps source columns to target

columns by name.

Validation errors occur if the datatypes of the columns

do not match.

Sends a TRUNCATE statement to clear the contents of

the table before loading during batch jobs. Defaults to not

selected.

Specifies the number of loaders (to a maximum of five)

and the number of rows per commit that each loader

receives during parallel loading.

Number of loaders For example, if you choose a Rows per commit of 1000

and set the number of loaders to three, the first 1000 rows

are sent to the first loader. The second 1000 rows are sent

to the second loader, the third 1000 rows to the third

loader, and the next 1000 rows back to the first loader.

Writes rows that cannot be loaded to the overflow file for

recovery purposes. Options are enabled for the file name

Use overflow file and file format. The overflow format can include the data

rejected and the operation being performed (write_data)

or the SQL command used to produce the rejected

operation (write_sql).

Specifies a value that might appear in a source column

that you do not want updated in the target table. When

Ignore columns with value this value appears in the source column, the

corresponding target column is not updated during auto

correct loading. You can enter spaces.



Option Description

Ignore columns with null Ensures that NULL source columns are not updated in

the target table during auto correct loading.

Use input keys

Enables Data Integrator to use the primary keys from the

source table. By default, Data Integrator uses the primary

key of the target table.

Update key columns Updates key column values when it loads data to the

target.

Auto correct load

Include in transaction

Transaction order

Ensures that the same row is not duplicated in a target

table. This is particularly useful for data recovery

operations.

When Auto correct load is selected, Data Integrator reads

a row from the source and checks if a row exists in the

target table with the same values in the primary key. If a

matching row does not exist, it inserts the new row

regardless of other options. If a matching row exists, it

updates the row depending on the values of Ignore

columns with value and Ignore columns with null.

Indicates that this target is included in the transaction

processed by a batch or real-time job. This option allows

you to commit data to multiple tables as part of the same

transaction. If loading fails for any one of the tables, no

data is committed to any of the tables.

Transactional loading can require rows to be buffered to

ensure the correct load order. If the data being buffered

is larger than the virtual memory available, Data

Integrator reports a memory error.

The tables must be from the same datastore.

If you choose to enable transactional loading, these options

are not available: Rows per commit, Use overflow file,

and overflow file specification, Number of loaders, Enable

partitioning, and Delete data from table before loading.

Data Integrator also does not parameterize SQL or push

operations to the database if transactional loading is

enabled.

Indicates where this table falls in the loading order of the

tables being loaded. By default, there is no ordering.



Option Description

All loaders have a transaction order of zero. If you specify

orders among the tables, the loading operations are

applied according to the order. Tables with the same

transaction order are loaded together. Tables with a

transaction order of zero are loaded at the discretion of

the data flow process.

See the Data Services Performance Optimization Guide and “Description of objects” in the Data

Services Reference Guide for more information. To access the target table editor

1. In a data flow, double-click the target table.

The target table editor opens in the workspace.

2. Change the values as required.

Changes are automatically committed.

3. Click Back to return to the data flow.

Using template tables During the initial design of an application, you might find it convenient to use template tables

to represent database tables. Template tables are particularly useful in early application

development when you are designing and testing a project.



With template tables, you do not have to initially create a new table in your RDBMS and import

the metadata into Data Services. Instead, Data Services automatically creates the table in the

database with the schema defined by the data flow when you execute a job.

After creating a template table as a target in one data flow, you can use it as a source in other

data flows. Although a template table can be used as a source table in multiple data flows, it

can be used only as a target in one data flow.

You can modify the schema of the template table in the data flow where the table is used as a

target. Any changes are automatically applied to any other instances of the template table.

After a template table is created in the database, you can convert the template table in the

repository to a regular table. You must convert template tables so that you can use the new

table in expressions, functions, and transform options. After a template table is converted, you

can no longer alter the schema.

To create a template table 1. Open a data flow in the workspace.

2. In the tool palette, click the Template Table icon and click the workspace to add a new

template table to the data flow.

The Create Template dialog box displays.

3. In the Table name field, enter the name for the template table.

4. In the In datastore drop-down list, select the datastore for the template table.

5. Click OK.

You also can create a new template table in the Local Object Library Datastore tab by

expanding a datastore and right-clicking Templates.

To convert a template table into a regular table from the Local Object Library

1. On the Datastores tab of the Local Object Library, expand the branch for the datastore to

view the template table.

2. Right-click a template table you want to convert and select Import Table from the menu.



Data Services converts the template table in the repository into a regular table by importing

it from the database.

3. To update the icon in all data flows, from View menu, select Refresh.

On the Datastore tab of the Local Object Library, the table is listed under Tables rather than

Template Tables. To convert a template table into a regular table from a data flow

1. Open the data flow containing the template table.

2. Right-click the template table you want to convert and select Import Table from the menu.



cust_temp(OEMO_Ta

View Enabled Descriptions

Enable Object Description

Make Embedded Data Flow...

Make Port

Update Schema

r atch Schema

Delete

Rename F2

Import Table

0.. View Data...

Properties...



Executing the job

Introduction Once you have created a data flow, you can execute the job in Data Services to see how the

data moves from source to target.


• Understand job execution

• Execute the job

Explaining job execution After you create your project, jobs, and associated data flows, you can then execute the job.

You can run jobs two ways:

• Immediate jobs

Data Services initiates both batch and real-time jobs and runs them immediately from within

the Designer. For these jobs, both the Designer and designated Job Server (where the job

executes, usually on the same machine) must be running. You will likely run immediate

jobs only during the development cycle.

• Scheduled jobs

Batch jobs are scheduled. To schedule a job, use the Data Services Management Console or

use a third-party scheduler. The Job Server must be running.

If a job has syntax errors, it does not execute.

Setting execution properties When you execute a job, the following options are available in the Execution Properties window:

Option Description

Print all trace messages Records all trace messages in the log.

Disable data validation statistics

collection

Does not collect audit statistics for this specific job execution.

Enable auditing Collects audit statistics for this specific job execution.

Enable recovery

Enables the automatic recovery feature. When enabled, Data

Services saves the results from completed steps and allows

you to resume failed jobs.



Option Description

Recover from last failed

Resumes a failed job. Data Services retrieves the results

from any steps that were previously executed successfully

and re-executes any other steps.

execution This option is a run-time property. This option is not

available when a job has not yet been executed or when

recovery mode was disabled during the previous run.

Collect statistics for optimization Collects statistics that the Data Services optimizer will use

to choose an optimal cache type (in-memory or pageable).

Collect statistics for monitoring Displays cache statistics in the Performance Monitor in

Administrator.

Use collected statistics Optimizes Data Services to use the cache statistics collected

on a previous execution of the job.

Specifies the system configuration to use when executing

this job. A system configuration defines a set of datastore

configurations, which define the datastore connections.

System configuration If a system configuration is not specified, Data Services uses

the default datastore configuration for each datastore.

This option is a run-time property that is only available if

there are system configurations defined in the repository.

Job Server or Server Group Specified the Job Server or server group to execute this job.

Distribution level

Allows a job to be distributed to multiple Job Servers for

processing. The options are:

• Job - The entire job will execute on one server.

• Data flow - Each data flow within the job will execute

on a separate server.

• Sub-data flow - Each sub-data flow (can be a separate

transform or function) within a data flow will execute

on a separate Job server.

Executing the job Immediate or on demand tasks are initiated from the Designer. Both the Designer and Job

Server must be running for the job to execute.



To execute a job as an immediate task 1. In the project area, right-click the job name and select Execute from the menu.

Data Services prompts you to save any objects that have not been saved.

2. Click OK.

The Execution Properties dialog box displays.



3. Select the required job execution parameters.

4. Click OK.

Activity: Creating a basic data flow After analyzing the source data, you have determined that the structure of the customer data

for Beta Businesses is the appropriate structure for the customer data in the Omega data

warehouse, and you must therefore change the structure of the Alpha Acquisitions customer

data to use the same structure in preparation for merging customer data from both datastores

at a later date.

Objective • Use the Query transform to change the schema of the Alpha Acquisitions Customer table

and move the data into the Delta staging database.

Instructions 1. Create a new project called Omega.

2. In the Omega project, create a new batch job called Alpha_Customers_Job with a new data

flow called Alpha_Customers_DF.

3. In the workspace for Alpha_Customers_DF, add the customer table from the Alpha datastore

as the source object.

4. Create a new template table called alpha_customers in the Delta datastore as the target

object.

5. Add the Query transform to the workspace between the source and target.

6. Connect the objects from source to transform to target.

7. In the transform editor for the Query transform, create the following output columns:

Name Data type Content type

CustomerID int

Firm varchar(50) Firm

ContactName varchar(50) Name

Title varchar(30) Title

Address1 varchar(50) Address

City varchar(50) Locality

Region varchar(25) Region



Name Data type Content type

PostalCode varchar(25) Postcode

Country varchar(50) Country

Phone varchar(25) Phone

Fax

8. Map the columns as follows:

varchar(25) Phone

Schema In Schema Out

CUSTOMERID CustomerID

COMPANYNAME Firm

CONTACTNAME ContactName

CONTACTTITLE Title

ADDRESS Address1

CITY City

REGIONID Region

POSTALCODE PostalCode

COUNTRYID Country

PHONE Phone

FAX Fax 9. Set the CustomerID column as the Primary Key.

10. Execute Alpha_Customers_Job with the default execution properties and save all objects

you have created.

11. Return to the data flow workspace and view data for the target table to confirm that 25

records were loaded.



A solution file called SOLUTION_Baslc. atl is included in your resource CD. To check the

solution, import the file and open it to view the data flow design and mapping logic. Do not

execute the solution job, as this may override the results in your target table.



Quiz: Creating batch jobs 1. Does a job have to be part of a project to be executed in the Designer?

2. How do you add a new template table?

3. Name the objects contained within a project.

4. What factors might you consider when determining whether to run work flows or data

flows serially or in parallel?




• Work with objects



• Use target tables

• Execute the job

Troubleshooting Batch Jobs—Learner’s Guide 101


Lesson 4

Troubleshooting Batch Jobs

Lesson introduction To document decisions and troubleshoot any issues that arise when executing your jobs, you

can validate and add annotations to jobs, work flows, and data flows, set trace options, and

debug your jobs. You can also set up audit rules to ensure the correct data is loaded to the

target.


• Use descriptions and annotations

• Validate and trace jobs

• Use View Data and the Interactive Debugger

• Use auditing in data flows



Using descriptions and annotations

Introduction Descriptions and annotations are a convenient way to add comments to objects and workspace

diagrams.


• Use descriptions with objects

• Use annotations to describe flows

Using descriptions with objects A description is associated with a particular object. When you import or export a repository

object, you also import or export its description.

Designer determines when to show object descriptions based on a system-level setting and an

object-level setting. Both settings must be activated to view the description for a particular

object.

Note: The system-level setting is unique to your setup.

There are three requirements for displaying descriptions:

• A description has been entered into the properties of the object.

• The description is enabled on the properties of that object.

• The global View Enabled Object Descriptions option is enabled. To show object descriptions at the system level

• From the View menu, select Enabled Descriptions.

This is a global setting.

To add a description to an object

1. In the project area or the workspace, right-click an object and select Properties from the

menu.

The Properties dialog box displays.

2. In the Description text box, enter your comments.

3. Click OK.

If you are modifying the description of a re-usable object, Data Services provides a warning

message that all instances of the re-usable object will be affected by the change.

4. Click Yes.

The description for the object displays in the Local Object Library.



To display a description in the workspace • In the workspace, right-click the object in the workspace and select Enable Object

Description from the menu.

The description displays in the workspace under the object.

Using annotations to describe objects An annotation is an object in the workspace that describes a flow, part of a flow, or a diagram.

An annotation is associated with the object where it appears. When you import or export a job,

work flow, or data flow that includes annotations, you also import or export associated

annotations.

To add an annotation to the workspace 1. In the workspace, from the tool palette, click the Annotation icon and then click the

workspace.

An annotation appears on the diagram.

2. Double-click the annotation.

3. Add text to the annotation.

4. Click the cursor outside of the annotation to commit the changes.

You can resize and move the annotation by clicking and dragging.

You cannot hide annotations that you have added to the workspace. However, you can

move them out of the way or delete them.



Validating and tracing jobs

Introduction It is a good idea to validate your jobs when you are ready for job execution to ensure there are

no errors. You can also select and set specific trace properties, which allow you to use the

various log files to help you read job execution status or troubleshoot job errors.


• Validate jobs

• Trace jobs

• Use log files

• Determine the success of a job

Validating jobs As a best practice, you want to validate your work as you build objects so that you are not

confronted with too many warnings and errors at one time. You can validate your objects as

you create a job or you can automatically validate all your jobs before executing them.

To validate jobs automatically before job execution 1. From the Tools menu, select Options.

The Options dialog box displays.

2. In the Category pane, expand the Designer branch and click General.

3. Select the Perform complete validation before job execution option.



4. Click OK.

To validate objects on demand

1. From the Validation menu, select Validate ➤ Current View or All Objects in View.

The Output dialog box displays.

2. To navigate to the object where an error occurred, right-click the validation error message

and select Go To Error from the menu.

Tracing jobs Use trace properties to select the information that Data Services monitors and writes to the

trace log file during a job. Data Services writes trace messages to the trace log associated with

the current Job Server and writes error messages to the error log associated with the current

Job Server.



The following trace options are available.

Trace Description

Row Writes a message when a transform imports or exports a row.

Session Writes a message when the job description is read from the

repository, when the job is optimized, and when the job runs.

Work flow

Writes a message when the work flow description is read from

the repository, when the work flow is optimized, when the work

flow runs, and when the work flow ends.

Data flow Writes a message when the data flow starts and when the data

flow successfully finishes or terminates due to error.

Transform Writes a message when a transform starts and completes or

terminates.

Custom Transform Writes a message when a custom transform starts and completes

successfully.

Custom Function Writes a message of all user invocations of the AE_LogMessage

function from custom C code.

SQL Functions

SQL Transforms

Writes data retrieved before SQL functions:

• Every row retrieved by the named query before the SQL is

submitted in the key_generation function.

• Every row retrieved by the named query before the SQL is

submitted in the lookup function (but only if

PRE_LOAD_CACHE is not specified).

• When mail is sent using the mail_to function. Writes a message (using the Table Comparison transform) about

whether a row exists in the target table that corresponds to an

input row from the source table.

SQL Readers Writes the SQL query block that a script, query transform, or

SQL function submits to the system. Also writes the SQL results.

SQL Loaders Writes a message when the bulk loader starts, submits a warning

message, or completes successfully or unsuccessfully.



Trace Description

Memory Source Writes a message for every row retrieved from the memory

table.

Memory Target Writes a message for every row inserted into the memory table.

Optimized Data Flow For Business Objects consulting and technical support use.

Tables Writes a message when a table is created or dropped.

Scripts and Script Functions Writes a message when a script is called, a function is called by

a script, and a script successfully completes.

Trace Parallel Execution Writes messages describing how data in a data flow is parallel

processed.

Access Server

Communication

Writes messages exchanged between the Access Server and a

service provider.

Stored Procedure Writes a message when a stored procedure starts and finishes,

and includes key values.

Audit Data

To set trace options

Writes a message that collects a statistic at an audit point and

determines if an audit rule passes or fails.

1. From the project area, right-click the job name and do one of the following:

• To set trace options for a single instance of the job, select Execute from the menu.

• To set trace options for every execution of the job, select Properties from the menu.

Save all files.

Depending on which option you selected, the Execution Properties dialog box or the Properties dialog box displays.

2. Click the Trace tab.



3. Under the name column, click a trace object name.

The Value drop-down list is enabled when you click a trace object name.

4. From the Value drop-down list, select Yes to turn the trace on.

5. Click OK.

Using log files As a job executes, Data Services produces three log files. You can view these from the project

area. The log files are, by default, also set to display automatically in the workspace when you

execute a job.

You can click the Trace, Monitor, and Error icons to view the following log files, which are

created during job execution.

Examining trace logs Use the trace logs to determine where an execution failed, whether the execution steps occur

in the order you expect, and which parts of the execution are the most time consuming.



Examining monitor logs Use monitor logs to quantify the activities of the components of the job. It lists the time spent

in a given component of a job and the number of data rows that streamed through the

component.

Examining error logs Use the error logs to determine how an execution failed. If the execution completed without

error, the error log is blank.



Using the Monitor tab The Monitor tab lists the trace logs of all current or most recent executions of a job.

The traffic-light icons in the Monitor tab indicate the following:

• Green light indicates that the job is running.

You can right-click and select Kill Job to stop a job that is still running.

• Red light indicates that the job has stopped.

You can right-click and select Properties to add a description for a specific trace log. This

description is saved with the log which can be accessed later from the Log tab.

• Red cross indicates that the job encountered an error.

Using the Log tab You can also select the Log tab to view a job’s log history.



You may find these job log indicators:

Indicator Description

Indicates that the job executed successfully on this explicitly selected Job

Server.

Indicates that the job encountered an error on this explicitly selected Job

Server.

Indicates that the job executed successfully by a server group. The Job Server

listed executed the job.

Indicates that the job encountered an error while being executed by a server

group. The Job Server listed executed the job.

To view log files from the project area 1. In the project area, click the Log tab.

2. Select the job for which you want to view the logs.

3. In the workspace, in the Filter drop-down list, select the type of log you want to view.

4. In the list of logs, double-click the log to view details.

5. To copy log content from an open log, select one or more lines and use the key commands

[Ctrl+C].

Determining the success of the job The best measure of the success of a job is the state of the target data. Always examine your

data to make sure the data movement operation produced the results you expect. Be sure that:

• Data was not converted to incompatible types or truncated.

• Data was not duplicated in the target.

• Data was not lost between updates of the target.



• Generated keys have been properly incremented.

• Updated values were handled properly.

If a job fails to execute:

1. Check the Job server icon in the status bar.

2. Verify that the Job Service is running.

3. Check that the port number in Designer matches the number specified in Server Manager.

4. Use the Server Manager resync button to reset the port number in the Local Object Library.

Activity: Setting traces and adding annotations You will be sharing your jobs with other developers during the project, so you want to make

sure that you identify the purpose of the job you just created. You also want to ensure that the

job is handling the movement of each row appropriately.

Objectives • Add an annotation to a job so that other designers who reference this information will be

able to identify its purpose.

• Execute the job in trace mode to determine when a transform imports and exports from

source to target.

Instructions 1. Open the workspace for Alpha_Customers_Job.

2. Add an annotation to the workspace beside the data flow with an explanation of the purpose

of the job.

3. Save all objects you have created.

4. Execute Alpha_Customers_Job and enable Trace rows option on the Trace tab of the

Execution Properties dialog box.

Note that the result of tracing the rows is that there is an entry for each row in the log to

indicate how it is being handled by the data flow.



Using View Data and the Interactive Debugger

Introduction You can debug jobs in Data Services using the View Data and Interactive Debugger features.

With View Data, you can view samples of source and target data for your jobs. Using the

Interactive Debugger, you can examine what happens to the data after each transform or object

in the flow.


• Use View Data with sources and targets

• Use the Interactive Debugger

• Set filters and breakpoints for a debug session

Using View Data with sources and targets With the View Data feature, you can check the status of data at any point after you import the

metadata for a data source, and before or after you process your data flows. You can check the

data when you design and test jobs to ensure that your design returns the results you expect.

View Data allows you to see source data before you execute a job. Using data details you can:

• Create higher quality job designs.

• Scan and analyze imported table and file data from the Local Object Library.

• See the data for those same objects within existing jobs.

• Refer back to the source data after you execute the job.



View Data also allows you to check your target data before executing your job, then look at

the changed data after the job executes. In a data flow, you can use one or more View Data

panels to compare data between transforms and within source and target objects.

View Data displays your data in the rows and columns of a data grid. The path for the selected

object displays at the top of the pane. The number of rows displayed is determined by a

combination of several conditions:

• Sample size: the number of rows sampled in memory. Default sample size is 1000 rows for

imported source, targets, and transforms.

• Filtering: the filtering options that are selected.

• Sorting: if your original data set is smaller or if you use filters, the number of returned rows

could be less than the default.

Keep in mind that you can have only two View Data windows open at any time. if you already

have two windows open and try to open a third, you are prompted to select which to close.

To use View Data in source and target tables • On the Datastore tab of the Local Object Library, right-click a table and select View Data

from the menu.

The View Data dialog box displays. To open a View Data pane in a data flow workspace

1. In the data flow workspace, click the magnifying glass button on a data flow object.

A large View Data pane appears beneath the current workspace area.

2. To compare data, click the magnifying glass button for another object.

A second pane appears below the workspace area, and the first pane area shrinks to

accommodate it.



When both panes are filled and you click another View Data button, a small menu appears

containing window placement icons. The black area in each icon indicates the pane you

want to replace with a new set of data. When you select a menu option, the data from the

latest selected object replaces the data in the corresponding pane.

Using the Interactive Debugger

Designer includes an Interactive Debugger that allows you to troubleshoot your jobs by placing

filters and breakpoints on lines in a data flow diagram. This enables you to examine and modify

data row by row during a debug mode job execution.

The Interactive Debugger can also be used without filters and breakpoints. Running the job in

debug mode and then navigating to the data flow while remaining in debug mode enables you

to drill into each step of the data flow and view the data.

When you execute a job in debug mode, Designer displays several additional windows that

make up the Interactive Debugger: Call stack, Trace, Variables, and View Data panes.



The left View Data pane shows the data in the source table, and the right pane shows the rows

that have been passed to the query up to the breakpoint.

To start the Interactive Debugger 1. In the project area, right-click the job and select Start debug from the menu.

The Debug Properties dialog box displays.



2. Set properties for the execution.

You can specify many of the same properties as you can when executing a job without

debugging. In addition, you can specify the number of rows to sample in the Data sample

rate field.

3. Click OK.

The debug mode begins.

While in debug mode, all other Designer features are set to read-only. A Debug icon is visible

in the task bar while the debug is in progress.

4. If you have set breakpoints, in the Interactive Debugger toolbar, click Get next row to move

to the next breakpoint.

5. To exit the debug mode, from the Debug menu, select Stop Debug.

Setting filters and breakpoints for a debug session You can set filters and breakpoints on lines in a data flow diagram before you start a debugging

session that allow you to examine and modify data row-by-row during a debug mode job

execution.



A debug filter functions the same as a simple Query transform with a WHERE clause. You can

use a filter if you want to reduce a data set in a debug job execution. The debug filter does not

support complex expressions.

A breakpoint is the location where a debug job execution pauses and returns control to you.

A breakpoint can be based on a condition, or it can be set to break after a specific number of

rows.

You can place a filter or breakpoint on the line between a source and a transform or two

transforms. If you set a filter and a breakpoint on the same line, Data Services applies the filter

first, which means that the breakpoint applies to the filtered rows only.

To set filters and breakpoints 1. In the data flow workspace, right-click the line that connects two objects and select Set

Filter/Breakpoint from the menu.

2. In the Breakpoint window, in the Filter or Breakpoint section, select the Set check box.

3. In the Column drop-down list, select the column to which the filter or breakpoint applies.

4. In the Operator drop-down list, select the operator for the expression.

5. In the Value field, enter the value to complete the expression.

The condition for filters/breakpoints do not use a delimiter for strings. 6. If you are using multiple conditions, repeat step 3 to step 5 for all conditions and select the

appropriate operator from the Concatenate all conditions using drop-down list.



7. Click OK.

Activity: Using the Interactive Debugger To ensure that your job is processing the data correctly, you want to run the job in debug mode.

To minimize the data you have to review in the Interactive Debugger, you will set the debug

process to show only records from the USA (represented by a CountryID value of 1). Once you

have confirmed that the structure appears correct, you will run another debug session with all

records, breaking after every row.

Objectives • View the data in debug mode with a filter to limit records to those with a CountryID of 1

(USA).

• View the data in debug mode with a breakpoint to stop the debug process after each row.

Instructions 1. In the workspace for Alpha_Customers_DF, add a filter between the source and the Query

transform to filter the records so that only customers from the USA are included in the

debug session.

2. Execute Alpha_Customers_DF in debug mode.

3. Return to the data flow workspace and view data for the target table.

Note that only five rows were returned.

4. Remove the filter and add a breakpoint to break the debug session after every row.

5. Execute Alpha_Customers_DF in debug mode again.

6. Discard the first row, and then step through the rest of the records.



7. Exit the debugger, return to the data flow workspace, and view data for the target table.

Nate that only 24 of 25 rows were returned.



Setting up auditing

Introduction You can collect audit statistics on the data that flows out of any Data Services object, such as a

source, transform, or target. If a transform has multiple distinct or different outputs (such as

Validation or Case), you can audit each output independently.


• Define audit points and rules

• Explain guidelines for choosing audit points

Setting up auditing When you audit data flows, you:

1. Define audit points to collect run-time statistics about the data that flows out of objects.

These audit statistics are stored in the Data Services repository.

2. Define rules with these audit statistics to ensure that the data extracted from sources,

processed by transforms, and loaded into targets is what you expect.

3. Generate a run-time notification that includes the audit rule that failed and the values of

the audit statistics at the time of failure.

4. Display the audit statistics after the job execution to help identify the object in the data flow

that might have produced incorrect data.

Defining audit points An audit point represents the object in a data flow where you collect statistics. You can audit

a source, a transform, or a target in a data flow.

When you define audit points on objects in a data flow, you specify an audit function. An audit

function represents the audit statistic that Data Services collects for a table, output schema, or

column. You can choose from these audit functions:

Data object Function Description

Table or

output

schema

Count

This function collects two statistics:

• Good count for rows that were successfully processed.

• Error count for rows that generated some type of error if

you enabled error handling.

The datatype for this function is integer.



Data object Function Description

Column

Column

Column

Sum

Average

Checksum

Sum of the numeric values in the column. This function only

includes the good rows.

This function applies only to columns with a datatype of

integer, decimal, double, and real.

Average of the numeric values in the column. This function

only includes the good rows.


integer, decimal, double, and real.

Detect errors in the values in the column by using the

checksum value.


varchar.

Defining audit labels An audit label represents the unique name in the data flow that Data Services generates for

the audit statistics collected for each audit function that you define. You use these labels to

define audit rules for the data flow.

If the audit point is on a table or output schema, these two labels are generated for the Count

audit function: $Count_objectname

$CountError_objectname

If the audit point is on a column, the audit label is generated with this format: $auditfunction_objectname

Note: An audit label can become invalid if you delete or rename an object that had an audit point

defined on it. Invalid labels are listed as a separate node on the Labels tab. To resolve the issue, you

must re-create the labels and then delete the invalid items.

Defining audit rules Use auditing rules if you want to compare audit statistics for one object against another object.

For example, you can use an audit rule if you want to verify that the count of rows from the

source table is equal to the rows in the target table.

An audit rule is a Boolean expression which consists of a left-hand-side (LHS), a Boolean

operator, and a right-hand-side (RHS):



• The LHS can be a single audit label, multiple audit labels that form an expression with one

or more mathematical operators, or a function with audit labels as parameters.

• The RHS can be a single audit label, multiple audit labels that form an expression with one

or more mathematical operators, a function with audit labels as parameters, or a constant.

These are examples of audit rules: $Count_CUSTOMER = $Count_CUSTDW

$Sum_ORDER_US + $Sum_ORDER_EUROPE = $Sum_ORDER_DW

round($Avg_ORDER_TOTAL) >= 10000

Defining audit actions You can choose any combination of the actions listed for notification of an audit failure:

• Email to list: Data Services sends a notification of which audit rule failed to the email

addresses that you list in this option. Use a comma to separate the list of email addresses.

You can specify a variable for the email list.

This option uses the smtp_to function to send email. Therefore, you must define the server

and sender for the Simple Mail Tool Protocol (SMTP) in the Data Services Server Manager.

• Script: Data Services executes the custom script that you create in this option.

• Raise exception: When an audit rule fails, the Error Log shows the rule that failed. The job

stops at the first audit rule that fails. This is an example of a message in the Error Log:

Audit rule failed <($Checksum_ODS_CUSTOMER = $Count_CUST_DIM)> for <Data Flow

Demo_DF>.

This action is the default. If you clear this action and an audit rule fails, the job completes

successfully and the audit does not write messages to the job log.

If you choose all three actions, Data Services executes them in the order presented.

You can see the audit status in one of these places:

Places where you can view audit information Action on Failure

Job Error Log, Metadata Reports Raise an exception

Email message, Metadata Reports Email to list

Wherever the custom script sends the audit messages,

Metadata Reports

To define audit points and rules in a data flow

Script

1. On the Data Flow tab of the Local Object Library, right-click a data flow and select Audit

from the menu.



The Audit dialog box displays with a list of the objects you can audit, with any audit functions

and labels for those objects.

2. On the Label tab, right-click the object you want to audit and select Properties from the

menu.

The Schema Properties dialog box displays.

3. In the Audit tab of the Schema Properties dialog box, in the Audit function drop-down

list, select the audit function you want to use against this data object type.

The audit functions displayed in the drop-down menu depend on the data object type that

you have selected.

Default values are assigned for the audit labels, which can be changed if required.



4. Click OK.

5. Repeat step 2 to step 4 for all audit points.

6. On the Rule tab, under Auditing Rules, click Add.

The expression editor activates and the Custom options become available for use. The

expression editor contains three drop-down lists where you specify the audit labels for the

objects you want to audit and choose the Boolean expression to use between these labels.



7. In the left-hand-side drop-down list in the expression editor, select the audit label for the

object you want to audit.

8. In the operator drop-down list in the expression editor, select a Boolean operator.

9. In the right-hand-side drop-down list in the expression editor, select the audit label for the

second object you want to audit.

If you want to compare audit statistics for one or more objects against statistics for multiple

other objects or a constant, select the Custom radio button, and click the ellipsis button

beside Functions. This opens up the full-size smart editor where you can drag different

functions and labels to use for auditing.

10. Repeat step 7 to step 10 for all audit rules.

11. Under Action on Failure, select the action you want.

12. Click Close. To trace audit data

1. In the project area, right-click the job and select Execute from the menu.

2. In the Execution Properties window, click the Trace tab.

3. Select Trace Audit Data.

4. In the Value drop-down list, select Yes.

5. Click OK.

The job executes and the job log displays the Audit messages based on the audit function

that is used for the audit object.

Choosing audit points When you choose audit points, consider the following:

• The Data Services optimizer cannot push down operations after the audit point. Therefore,

if the performance of a query that is pushed to the database server is more important than

gathering audit statistics from the source, define the first audit point on the query or later

in the data flow.

For example, suppose your data flow has a source, a Query transform, and a target, and the

Query has a WHERE clause that is pushed to the database server that significantly reduces



the amount of data that returns to Data Services. Define the first audit point on the Query,

rather than on the source, to obtain audit statistics on the results.

• If a pushdown_sql function is after an audit point, Data Services cannot execute it.

• The auditing feature is disabled when you run a job with the debugger.

• If you use the CHECKSUM audit function in a job that normally executes in parallel, Data

Services disables the Degrees of Parallelism (DOP) for the whole data flow. The order of

rows is important for the result of CHECKSUM, and DOP processes the rows in a different

order than in the source. For more information on DOP, see “Using Parallel Execution” and

“Maximizing the number of push-down operations” in the Data Services Performance

Optimization Guide.

Activity: Using auditing in a data flow You must ensure that all records from the Customer table in the Alpha database are being

moved to the Delta staging database using the audit logs.

Objectives • Add audit points to the source and target tables.

• Create an audit rule to ensure that the count of both tables is the same.

• Execute the job with auditing enabled.

Instructions 1. In the Local Object Library, set up auditing for Alpha_Customers_DF by adding an audit

point to count the total number of records in the source table.

2. Add another audit point to count the total number of records in the target table.

3. Construct an audit rule that states that, if the count from both tables is not the same, the

audit must raise an exception in the log.

4. Execute Alpha_Customers_Job. Ensure that the Enable auditing option is selected on the

Parameters tab of the Execution Properties dialog box, and that the Trace Audit Data option

is enabled on the Trace tab.

Note that the audit rule passes validation.

A solution file called SOLUTION_Audit.atl is included in your resource CD. To check the





Quiz: Troubleshooting batch jobs 1. List some reasons why a job might fail to execute.

2. Explain the View Data feature.

3. What must you define in order to audit a data flow?

4. True or false? The auditing feature is disabled when you run a job with the debugger.




• Use descriptions and annotations

• Validate and trace jobs

• Use View Data and the Interactive Debugger

• Use auditing in data flows

Using Functions, Scripts, and Variables—Learner’s Guide 131


Lesson 5

Using Functions, Scripts, and Variables

Lesson introduction Data Services gives you the ability to perform complex operations using functions and to extend

the flexibility and re-usability of objects by writing scripts, custom functions, and expressions

using Data Services scripting language and variables.


• Define built-in functions

• Use functions in expressions

• Use the lookup function

• Use the decode function

• Use variables and parameters

• Use Data Services scripting language

• Script a custom function



Defining built-in functions

Introduction Data Services supports built-in and custom functions.


• Define functions

• List the types of operations available for functions

• Describe other types of functions

Defining functions Functions take input values and produce a return value. Functions also operate on individual

values passed to them. Input values can be parameters passed into a data flow, values from a

column of data, or variables defined inside a script.

You can use functions in expressions that include scripts and conditional statements.

Note: Data Services does not support functions that include tables as input or output parameters,

except functions imported from SAP R/3.

Listing the types of operations for functions Functions are grouped into different categories:

Type Description Functions

Aggregate Functions Performs calculations on

numeric values.

avg, count, count_distinct,

max, min, sum

Conversion Functions

Converts values to specific

datatypes.

cast, interval_to_char,

julian_to_date, load_to_xml,

long_to_varchar,

num_to_interval, to_char,

to_date, to_decimal,

to_decimal_ext,

varchar_to_long

Custom Functions Performs functions defined by the user.

Database Functions Performs operations specific

to databases.

key_generation, sql, total_rows



Type Description Functions

Date Functions

Environment Functions

Lookup Functions

Performs calculations and

conversions on date values. Performs operations specific

to your Data Services

environment. Looks up data in other tables.

add_months,

concat_date_time, date_diff,

date_part, day_in_month,

day_in_week, day_in_year,

fiscal_day, isweekend, julian,

last_date, month, quarter,

sysdate, systime,

week_in_month,

week_in_year, year

get_env, get_error_filename,

get_monitor_filename,

get_trace_filename, is_set_env,

set_env

lookup, lookup_ext,

lookup_seq

Math Functions

Performs complex

mathematical operations on

numeric values.

abs, ceil, floor, ln, log, mod,

power, rand, rand_ext, round,

sqrt, trunc

Miscellaneous Functions

Performs various operations.

base64_decode,

base64_encode,

current_configuration,

current_system_configuration,

dataflow_name,

datastore_field_value,

db_database_name, db_owner,

db_type, db_version, decode,

file_exists, gen_row_num,

gen_row_num_by_group,

get_domain_description,

get_file_attribute, greatest,

host_name, ifthenelse,

is_group_changed, isempty,

job_name, least, nvl,

previous_row_value,

pushdown_sql,

raise_exception,

raise_exception_ext,

repository_name, sleep,

system_user_name,



Type

String Functions

System Functions

Description

Performs operations on

alphanumeric strings of data. Performs system operations.

Functions

table_attribute, truncate_table,

wait_for_file, workflow_name

ascii, chr, double_metaphone,

index, init_cap, length, literal,

lower, lpad, lpad_ext, ltrim,

ltrim_blanks, ltrim_blanks_ext,

match_pattern, match_regex,

print, replace_substr,

replace_substr_ext, rpad,

rpad_ext, rtrim, rtrim_blanks,

rtrim_blanks_ext,

search_replace, soundex,

substr, upper, word, word_ext

exec, mail_to, smtp_to

Validation Functions

Validates specific types of

values.

is_valid_date,

is_valid_datetime,

is_valid_decimal,

is_valid_double, is_valid_int,

is_valid_real, is_valid_time

Defining other types of functions In addition to built-in functions, you can also use these functions:

• Database and application functions:

These functions are specific to your RDBMS. You can import the metadata for database and

application functions and use them in Data Services applications. At run time, Data Services

passes the appropriate information to the database or application from which the function

was imported.

The metadata for a function includes the input, output, and their datatypes. If there are

restrictions on data passed to the function, such as requiring uppercase values or limiting

data to a specific range, you must enforce these restrictions in the input. You can either test

the data before extraction or include logic in the data flow that calls the function.

You can import stored procedures from DB2, Microsoft SQL Server, Oracle, and Sybase

databases. You can also import stored packages from Oracle. Stored functions from SQL

Server can also be imported. For more information on importing functions, see “Custom

Datastores”, in Chapter 5, in the Data Services Reference Guide.

• Custom functions:

Using Functions, Scripts, and Variables-Learner's Guide 135


These are functions that you define. You can create your own functions by writing script

functions in Data Services scripting language.



Using functions in expressions

Introduction Functions can be used in expressions to map return values as new columns, which allows

columns that are not in the initial input data set to be specified in the output data set.



Defining functions in expressions Functions are typically used to add columns based on some other value (lookup function) or

generated key fields. You can use functions in:

• Transforms: The Query, Case, and SQL transforms support functions.

• Scripts: These are single-use objects used to call functions and assign values to variables in

a work flow.

• Conditionals: These are single-use objects used to implement branch logic in a work flow.

• Other custom functions: These are functions that you create as required.

Before you use a function, you need to know if the function’s operation makes sense in the

expression you are creating. For example, the max function cannot be used in a script or

conditional where there is no collection of values on which to operate.

You can add existing functions in an expression by using the Smart Editor or the Function

wizard. The Smart Editor offers you many options, including variables, datatypes, keyboard

shortcuts, and so on. The Function wizard allows you to define parameters for an existing

function and is recommended for defining complex functions.

To use the Smart Editor 1. Open the object in which you want to use an expression.

2. Click the ellipses (...) button.

The Smart Editor displays.



3. Click the Functions tab and expand a function category.

4. Click and drag the specific function onto the workspace.

5. Enter the input parameters based on the syntax of your formula.

6. Click OK. To use the Function wizard

1. Open the object in which you want to use an expression.

2. Click Functions.

The Select Function dialog box opens.



3. In the Function list, select a category.

4. In the Function name list, select a specific function.

The functions shown depend on the object you are using. Clicking each function separately

also displays a description of the function below the list boxes.

5. Click Next.

The Define Input Parameter(s) dialog box displays. The options available depend on the

selected function.

6. Click the drop-down arrow next to the input parameters.

The Input Parameter dialog box displays.



7. Double-click to select the source object and column for the function.

8. Repeat steps 6 and 7 for all other input parameters.

9. Click Finish.

Activity: Using the search_replace function When evaluating the customer data for Alpha Acquisitions, you discover a data entry error

where the contact title of Account Manager has been entered as Accounting Manager. You

want to clean up this data before it is moved to the data warehouse.

Objective • Use the search_replace function in an expression to change the contact title from Accounting

Manager to Account Manager.

Instructions 1. In the Alpha_Customers_DF workspace, open the transform editor for the Query transform.



2. On the Mapping tab, delete the existing expression for the Title column.

3. Using the Function wizard, create a new expression for the Title column using the

search_replace function (under String functions) to replace the full string of "Accounting

Manager" with "Account Manager".

Note: Be aware that the search_replace function can react unpredictably if you use the

external table option.


you have created.


Note that the titles for the affected contacts have been changed.

A solution file called SOLUTION_SearchReplace.atl is included in your resource CD. To check

the solution, import the file and open it to view the data flow design and mapping logic. Do

not execute the solution job, as this may override the results in your target table.



Using the lookup function

Introduction Lookup functions allow you to look up values in other tables to populate columns.


• Use the lookup function to look up values in another table

Using lookup tables Lookup functions allow you to use values from the source table to look up values in other

tables to generate the data that populates the target table.

Lookups enable you to store re-usable values in memory to speed up the process. Lookups are

useful for values that rarely change.

The lookup, lookup_seq, and lookup_ext functions all provide a specialized type of join, similar

to an SQL outer join. While a SQL outer join may return multiple matches for a single record

in the outer table, lookup functions always return exactly the same number of records that are

in the source table.

While all lookup functions return one row for each row in the source, they differ in how they

choose which of several matching rows to return:

• Lookup does not provide additional options for the lookup expression.

• Lookup_ext allows you to specify an Order by column and Return policy (Min, Max) to

return the record with the highest/lowest value in a given field (for example, a surrogate

key).



• Lookup_seq searches in matching records to return a field from the record where the sequence

column (for example, effective_date) is closest to but not greater than a specified sequence

value (for example, a transaction date).

lookup_ext

The lookup_ext function is recommended for lookup operations because of its enhanced options.

You can use this function to retrieve a value in a table or file based on the values in a different

source table or file. This function also extends functionality by allowing you to:

• Return multiple columns from a single lookup.

• Choose from more operators to specify a lookup condition.

• Specify a return policy for your lookup.

• Perform multiple (including recursive) lookups.

• Call lookup_ext in scripts and custom functions. This also lets you re-use the lookups

packaged inside scripts.

• Define custom SQL using the SQL_override parameter to populate the lookup cache,

narrowing large quantities of data to only the sections relevant for your lookup(s).

• Use lookup_ext to dynamically execute SQL.

• Call lookup_ext, using the Function wizard, in the query output mapping to return multiple

columns in a Query transform.

• Design jobs to use lookup_ext without having to hard code the name of the translation file

at design time.

• Use lookup_ext with memory datastore tables.

Tip: Use this function to the right of the Query transform instead of to the right of a column

mapping. This allows you to select multiple output columns and go back to edit the function

in the Function wizard instead of manually editing the function’s complex syntax.

Feature Details

Syntax

lookup_ext ([translate_table, cache_spec, return_policy],

[return_column_list], [default_value_list], [condition_list],

[orderby_column_list], [output_variable_list], [sql_override])

Return value Returns any type of value. The return type is the first lookup column

in return_column_list.

Where

The following where clauses are available:

• translate_table represents the table, file, or memory datastore that

contains the value you are looking up (result_column_list).

• cache_spec represents the caching method the lookup_ext operation

uses.

• return_policy specifies whether the return columns should be

obtained from the smallest or the largest row based on values in

the order by columns.



Feature Details

• return_column_list is a comma-separated list containing the names

of output columns in the translate_table.

• default_value_list is a comma-separated list containing the default

expressions for the output columns. When no rows match the lookup

condition, the default values are returned for the output column.

• condition_list is a list of triplets that specify lookup conditions. Each

set in a triplet contains a compare_column, a compare operator

(<,<=,>,>=,=. IS, IS NOT), and a compare expression.

• orderby_column_list is a comma-separated list of column names

from the translate_table.

• output_variable_list is a comma-separated list of output variables.

• sql_override is available in the Function Wizard. It must contain a

valid, single-quoted SQL SELECT statement of a $variable of type

varchar to populate the lookup cache when the cache specification

is PRE_LOAD_CACHE.

Lookup(ds.owner.emp, empname, .no body., .NO_CACHE., empno,

1);

Example Lookup_ext[(ds.owner.emp, .NO_CACHE.,.MAX.], [empname], [.no

body.]. [empno, .=., 1]

These expressions both retrieve the name of an employee whose empno

is equal to 1.

To create a lookup expression 1. Open the Query transform.

The Query transform should have at least one main source table and one lookup table, and

it must be connected to a single target object.

2. Select the output schema column for which the lookup function is being performed.

3. In the Mapping tab, click Functions.

The Select Function window opens.

4. In the Function list, select Lookup Functions.

5. In the Function name list, select lookup_ext.

6. Click Next.

The Lookup_ext - Select Parameters dialog box displays.

7. In the Translate table drop-down list, select the lookup table.

8. Change the caching specification, if required.

9. Under Condition, in the Table column drop-down list, select the key in the lookup table

that corresponds to the source table.



10. In the Op. drop-down list, select an operator.

11. Enter the other logical join from the source table in the Expression column.

You can click and drag the column from the Available parameters pane to the Expression

column. For a direct lookup, click and drag the key from the Input Schema (source table)

that corresponds to the lookup table.

12. Under Output parameters, in the Table column drop-down list, select the column with the

value that will be returned by the lookup function.

13. Specify default values and order by parameters, if required.

14. Click Finish.

Activity: Using the lookup_ext() function In the Alpha Acquisitions database, the country for a customer is stored in a separate table and

referenced with a foreign key. To speed up access to information in the data warehouse, this

lookup should be eliminated.



Objective • Use the lookup_ext function to swap the ID for the country in the Customers table for Alpha

Acquisitions with the actual value from the Countries table.

Instructions 1. In the Alpha_Customers_DF workspace, open the transform editor for the Query transform.

2. On the Mapping tab, delete the current expression for the Country column.

3. Use the Functions wizard to create a new lookup expression using the lookup_ext function

with the following parameters:

Field/Option Value

Translate table

Condition

Alpha.alpha.country

Table column COUNTRYID

Op. =

Expression

Output parameters

customer.COUNTRYID

Table column

The following code is generated:

COUNTRYNAME

lookup_ext([Alpha.alpha.country,'PRE_LOAD_CACHE','MAX'],

[COUNTRYNAME],[NULL],[COUNTRYID,'=',customer.COUNTRYID]) SET

("run_as_separate_process"='no')


you have created.

5. Return to the data flow workspace and view data for the target table after the lookup

expression is added.

A solution file called SOLUTION_LookupFunction.atl is included in your resource CD. To check





Using the decode function

Introduction You can use the decode function as an alternative to nested if/then/else conditions.



Explaining the decode function You can use the decode function to return an expression based on the first condition in the

specified list of conditions and expressions that evaluates to TRUE. It provides an alternate

way to write nested ifthenelse functions.

Use this function to apply multiple conditions when you map columns or select columns in a

query. For example, you can use this function to put customers into different groupings.

The syntax of the decode function uses the following format:

decode(condition_and_expression_list, default_expression)

The elements of the syntax break down as follows:

Element Description

Return value

Where

expression or default_expression

Returns the value associated with the first

condition that evaluates to TRUE.

The datatype of the return value is the data

type of the first expression in the

condition_and_expression_list.

Note: If the data type of any subsequent expression

or the default_expression is not convertible to the data

type of the first expression, Data Integrator produces

an error at validation. If the data types are convertible

but do not match, a warning appears at validation. condition_and_expression_list

Represents a comma-separated list of one or

more pairs that specify a variable number of

conditions. Each pair contains one condition

and one expression separated by a comma.

You must specify at least one condition and

expression pair:

• The condition evaluates to TRUE or FALSE.



Element Description

• The expression is the value that the function

returns if the condition evaluates to TRUE. default_expression

Represents an expression that the function

returns if none of the conditions in

condition_and_expression_list evaluate to

TRUE.

Note: You must specify a default_expression.

The decode function provides an easier way to write nested ifthenelse functions. In nested

ifthenelse functions, you must write nested conditions and ensure that the parentheses are in

the correct places as in this example: ifthenelse((EMPNO = 1),'111',

ifthenelse((EMPNO = 2),'222',



'NO_ID'))))

In the decode function, you list the conditions as in this example: decode((EMPNO = 1),'111',

(EMPNO = 2),'222',

(EMPNO = 3),'333',

(EMPNO = 4),'444',

'NO_ID')

Therefore, decode is less prone to error than nested ifthenelse functions.

To improve performance, Data Services pushes this function to the database server when

possible. Thus, the database server, rather than Data Integrator, evaluates the decode function.

To configure the decode function 1. Open the Query transform.

2. Select the output schema column for which the decode function is being performed.

3. In the Mapping tab, click Functions.

The Select Function window opens.

4. In the Function list, select Miscellaneous Functions.

5. In the Function name list, select decode.



6. Click Next.

The Define Input Parameter(s) dialog box displays.

7. In the Conditional expression field, select or enter the IF clause in the case logic.

8. In the Case expression field, select or enter the THEN clause.

9. In the Default expression field, select or enter the ELSE clause.

10. Click Finish.

11. If required, add any additional THEN clauses in the mapping expression.

Activity: Using the decode function You need to calculate the total value of all orders, including their discounts, for reporting

purposes.

Objective • Use the sum and decode functions to calculate the total value of orders in Order_Details

table.

Instructions 1. In the Omega project, create a new batch job called Alpha_Order_Sum_Job with a data flow

called Alpha_Order_Sum_DF.

2. In the Alpha_Order_Sum_DF workspace, add the Order_Details and Product tables from

the Alpha datastore as the source objects.

3. Add a new template table to the Delta datastore called order_sum as the target object.

4. Add a Query transform and connect all objects.

5. In the transform editor for the Query transform, on the WHERE tab, propose a join between

the two source tables.

6. Map the ORDERID column from the input schema to the output schema.

7. Create a new output column called TOTAL_VALUE with a data type of decimal(10,2).

8. On the Mapping tab of the new output column, use the Function wizard or the Smart Editor

to construct an expression to calculate the total value of the orders using the decode and

sum functions.

The discount and order total can be multiplied to determine the total after discount. The

decode functions allows you to avoid multiplying order with zero discount by zero.


• The expression must specify that if the value in the DISCOUNT column is not zero

(Conditional expression), then the total value of the order is calculated by multiply the

QUANTITY from the order_details table by the COST from the product table, and then

multiplying that value by the DISCOUNT (Case expression).



• Otherwise, the total value of the order is calculated by simply multiplying the QUANTITY

from the order_details table by the COST from the product table (Default expression).

• Once these values are calculated for each order, a sum must be calculated for the entire

collection of orders.

Tip: You can use the Function wizard to construct the decode portion of the mapping, and

then use the Smart Editor or the main window in the Mapping tab to wrap the sum function

around the expression.

The expression should be:

sum(decode(order_details.DISCOUNT <> 0, (order_details.QUANTITY * product.COST)

* order_details.DISCOUNT, order_details.QUANTITY * product.COST))

9. On the GROUP BY tab, add the order_details.ORDERID column.

10. Execute Alpha_Orders_Job with the default execution properties and save all objects you

have created.

11. Return to the data flow workspace and view data for the target table after the decode

expression is added to confirm that order 11146 has a total value of $204,000.

A solution file called SOLUTION_DecodeFunction.atl is included in your resource CD. To check





Using scripts, variables, and parameters

Introduction With the Data Services scripting language, you can assign values to variables, call functions,

and use standard string and mathematical operators to transform data and manage work flow.


• Describe the purpose of scripts, variables, and parameters

• Explain the differences between global and local variables

• Set global variable values using properties

• Describe the purpose of substitution parameters

Defining scripts To apply decision-making and branch logic to work flows, you will use a combination of scripts,

variables, and parameters to calculate and pass information between the objects in your jobs.

A script is a single-use object that is used to call functions and assign values in a work flow.

Typically, a script is executed before data flows for initialization steps and used in conjunction

with conditionals to determine execution paths. A script may also be used after work flows or data flows to record execution information such as time, or a change in the number of rows in

a data set.

Use a script when you want to calculate values that will be passed on to other parts of the work

flow. Use scripts to assign values to variables and execute functions.

A script can contain these statements:

• Function calls

• If statements

• While statements

• Assignment statements

• Operators

Defining variables A variable is common component in scripts that acts as a placeholder to represent values that

have the potential to change each time a job is executed. To make them easy to identify in an

expression, variable names start with a dollar sign ($). They can be of any datatype supported

by Data Services.

You can use variables in expressions in scripts or transforms to facilitate decision making or

data manipulation (using arithmetic or character substitution). A variable can be used in a

LOOP or IF statement to check a variable's value to decide which step to perform.



Note that variables can be used to enable the same expression to be used for multiple output

files. Variables can be used as file names for:

• Flat file sources and targets

• XML file sources and targets

• XML message targets (executed in the Designer in test mode)

• Document file sources and targets (in an SAP R/3 environment)

• Document message sources and targets (SAP R/3 environment)

In addition to scripts, you can also use variables in a catch or a conditional. A catch is part of

a serial sequence called a try/catch block. The try/catch block allows you to specify alternative

work flows if errors occur while Data Services is executing a job. A conditional is a single-use

object available in work flows that allows you to branch the execution logic based on the results

of an expression. The conditional takes the form of an if/then/else statement.

Defining parameters A parameter is another type of placeholder that calls a variable. This call allows the value from

the variable in a job or work flow to be passed to the parameter in a dependent work flow or

data flow. Parameters are most commonly used in WHERE clauses.

Combining scripts, variables, and parameters To illustrate how scripts, variables, and parameters are used together, consider an example

where you start with a job, work flow, and data flow. You want the data flow to update only

those records that have been created since the last time the job executed.

To accomplish this, you would start by creating a variable for the update time at the work flow

level, and a parameter at the data flow level that calls the variable.

Next, you would create a script within the work flow that executes before the data flow runs.

The script contains an expression that determines the most recent update time for the source

table.

The script then assigns that update time value to the variable, which identifies what that value

is used for and allows it to be re-used in other expressions.

Finally, in the data flow, you create an expression that uses the parameter to call the variable

and find out the update time. This allows the data flow to compare the update time to the

creation date of the records and identify which rows to extract from the source.

Defining global versus local variables There are two types of variables: local and global.

Local variables are restricted to the job or work flow in which they are created. You must use

parameters to pass local variables to the work flows and data flows in the object.



Global variables are also restricted to the job in which they are created. However, they do not

require parameters to be passed to work flows and data flows in that job. Instead, you can

reference the global variable directly in expressions in any object in that job.

Global variables can simplify your work. You can set values for global variables in script objects

or using external job, execution, or schedule properties. For example, during production, you

can change values for default global variables at run time from a job's schedule without having

to open a job in the Designer.

Whether you use global variables or local variables and parameters depends on how and where

you need to use the variables. If you need to use the variable at multiple levels of a specific job,

it is recommended that you create a global variable.

However, there are implications to using global variables in work flows and data flows that

are re-used in other jobs. A local variable is included as part of the definition of the work flow

or data flow, and so it is portable between jobs. Because a global variable is part of the definition

of the job to which the work flow or data flow belongs, it is not included when the object is

re-used.

The following table summarizes the type of variables and parameters you can create for each

type of object.

Object Type Used by

Job Global variable Any object in the job.

Job

Local variable A script or conditional in the

job.

Work flow

Local variable

This work flow or passed

down to other work flows or

data flows using a parameter.

Work flow

Data flow

Parameter Parameter

Parent objects to pass local

variables. Work flows may

also return variables or

parameters to parent objects.

A WHERE clause, column

mapping, or function in the

data flow. Data flows cannot

return output values.

To ensure consistency across projects and minimize troubleshooting errors, it is a best practice

to use a consistent naming convention for your variable and parameters. Keep in mind that

names can include any alpha or numeric character or underscores (_), but cannot contain blank

spaces. To differentiate between the types of objects, start all names with a dollar sign ($), and

use the following prefixes:



Type Naming convention

Global variable $G_

Local variable $L_

Parameter $P_

To define a global variable, local variable, or parameter 1. Select the object in the project area.

For a global variable, the object must be a job. For a local variable, it can be a job or a work

flow. For a parameter, if can be work flow or a data flow.

2. From the Tools menu, select Variables.

The Variables and Parameters dialog box displays.

3. On the Definitions tab, right-click the type of variable or parameter and select Insert from

the menu.

4. Right-click the new variable or parameter and select Properties from the menu.

The Properties dialog box displays. The properties differ depending on the type of variable

or parameter.



5. In the Name field, enter a unique name for the variable or parameter.

6. In the Data type drop-down list, select the datatype for the variable or parameter.

7. For parameters, in the Parameter type drop-down list, select whether the parameter is for

input, output, or both.

For most applications, parameters are used for input. 8. Click OK.

You can create a relationship between a local variable and the parameter by specifying that

the name of the local variable as the value in the properties for the parameter in the Calls

tab.

To define the relationship between a local variable and a parameter

1. Select the dependent object in the project area.

2. From the Tools menu, select Variables to open the Variables and Parameters dialog box.

3. Click the Calls tab.

Any parameters that exist in dependent objects display on the Calls tab.



4. Right-click the parameter and select Properties from the menu.

The Parameter Value dialog box displays.

5. In the Value field, enter the name of the local variable you want the parameter to call or a

constant value.

If you enter a variable, it must of the same datatype as the parameter. 6. Click OK.



Setting global variables using job properties In addition to setting a variable inside a job using a script, you can also set and maintain global

variable values outside a job using properties. Values set outside a job are processed the same

way as those set in a script. However, if you set a value for the same variable both inside and

outside a job, the value from the script overrides the value from the property.

Values for global variables can be set as a job property or as an execution or schedule property.

All values defined as job properties are shown in the Properties window. By setting values

outside a job, you can rely on the Properties window for viewing values that have been set for global variables and easily edit values when testing or scheduling a job.

To set a global variable value as a job property

1. Right-click a job in the Local Object Library or project area and select Properties from the

menu.


2. Click the Global Variable tab.

All global variables for the job are listed. 3. In the Value column for the global variable, enter a constant value or an expression, as

required.

4. Click OK.

You can also view and edit these default values in the Execution Properties dialog of the

Designer. This allows you to override job property values at run time.

Data Services saves values in the repository as job properties.

Defining substitution parameters Substitution parameters provide a way to define parameters that have a constant value for one

environment, but might need to get changed in certain situations. In case a change is needed,

it can be changed in one location to affect all jobs. You can override the parameter for particular

job executions.

The typical use case is for file locations (directory files or source/target/error files) that are

constant in one environment, but will change when a job is migrated to another environment

(like migrating a job from test to production).

As with variables and parameters, the name can include any alpha or numeric character or

underscores (_), but cannot contain blank spaces. Follow the same naming convention and

always begin the name for a substitution parameter with double dollar signs ($$) and an S_

prefix to differentiate from out-of-the-box substitution parameters.

Note: When exporting a job (to a file or a repository), the substitution parameter configurations

(values) are not exported with them. You need to export substitution parameters via a separate

command to a text file and use this text file to import into another repository.



To create a substitution parameter configuration 1. From the Tools menu, select Substitution Parameter Configurations.

The Substitution Parameter Editor dialog box displays all pre-defined substitution

parameters:

2. Double-click the header for the default configuration to change the name, and then click

outside of the header to commit the change.

3. Do any of the following:

• To add a new configuration, click Create New Substitution Parameter Configuration

to add a new column, enter a name for the new configuration in the header, and click

outside of the header to commit the change. Enter the values of the substitution parameters

as required for the new configuration.

• To add a new substitution parameter, in the Substitution Parameter column of the last

line, enter the name and value for the substitution parameter.

4. Click OK.

To add a substitution parameter configuration to a system configuration

1. From the Tools menu, select System Configurations.

The System Configuration Editor dialog box displays any existing system configurations:



2. For an existing system configuration, in the Substitution Parameter drop-down list, select

the substitution parameter configuration.

3. Click OK.



Using Data Services scripting language

Introduction With Data Services scripting language, you can assign values to variables, call functions, and

use standard string and mathematical operators. The syntax can be used in both expressions

(such as WHERE clauses) and scripts.


• Explain language syntax

• Use strings and variables in Data Services scripting language

Using basic syntax Expressions are a combination of constants, operators, functions, and variables that evaluate

to a value of a given datatype. Expressions can be used inside script statements or added to

data flow objects.

Data Services scripting language follows these basic syntax rules when you are creating an

expression:

• Each statement ends with a semicolon (;).

• Variable names start with a dollar sign ($).

• String values are enclosed in single quotation marks (').

• Comments start with a pound sign (#).

• Function calls always specify parameters, even if they do not use parameters.

• Square brackets substitute the value of the expression. For example:

Print('The value of the start date is:[sysdate()+5]');

• Curly brackets quote the value of the expression in single quotation marks. For example:

$StartDate = sql('demo_target', 'SELECT ExtractHigh FROM Job_Execution_Status

WHERE JobName = {$JobName}');

Using syntax for column and table references in expressions

Because expressions can be used inside data flow objects, they often contain column names.

The Data Services scripting language recognizes column and table names without special

syntax. For example, you can indicate the start_date column as the input to a function in the

Mapping tab of a query as: to_char(start_date, 'dd.mm.yyyy')

The column start_date must be in the input schema of the query.



If there is more than one column with the same name in the input schema of a query, indicate

which column is included in an expression by qualifying the column name with the table name.

For example, indicate the column start_date in the table status as: status.start_date

Column and table names as part of SQL strings may require special syntax based on the RDBMS

that the SQL is evaluated by. For example, select all rows from the LAST_NAME column of

the CUSTOMER table as: sql('oracle_ds','select CUSTOMER.LAST_NAME from CUSTOMER')

Using operators The operators you can use in expressions are listed in the following table in order of precedence.

Note that when operations are pushed to a RDBMS to perform, the precedence is determined

by the rules of the RDBMS.

Operator Description

+ Addition

- Subtraction

* Multiplication

/ Division

= Comparison, equals

< Comparison, is less than

<= Comparison, is less than or equal to

> Comparison, is greater than

>= Comparison, is greater than or equal to

!= Comparison, is not equal to

|| Concatenate

AND Logical AND

OR Logical OR



Operator Description

NOT Logical NOT

IS NULL Comparison, is a NULL value

IS NOT NULL

Reviewing script examples Example 1

$language = 'E';

$start_date = '1994.01.01';

$end_date = '1998.01.31';

Example 2

Comparison, is not a NULL value

$start_time_str = sql('tutorial_ds', 'select to_char(start_time,\'YYYY-MM-DD

HH24:MI:SS\')');

$end_time_str = sql('tutorial_ds', 'select to_char(max(last_update),\'YYYY-MM-DD

HH24:MI:SS\')');

$start_time = to_date($start_time_str, 'YYYY-MM-DD HH24:MI:SS');

$end_time = to_date($end_time_str, 'YYYY-MM-DD HH24:MI:SS');

Example 3 $end_time_str = sql('tutorial_ds', 'select to_char(end_time,\'YYYY-MM-DD

HH24:MI:SS\')');

if (($end_time_str IS NULL) or ($end_time_str = '')) $recovery_needed = 1;

else $recovery_needed = 0;

Using strings and variables Special care must be given to handling of strings. Quotation marks, escape characters, and

trailing blanks can all have an adverse effect on your script if used incorrectly.

Using quotation marks The type of quotation marks to use in strings depends on whether you are using identifiers or

constants. An identifier is the name of the object (for example, table, column, data flow, or

function). A constant is a fixed value used in computation. There are two types of constants:

• String constants (for example, 'Hello' or '2007.01.23')



• Numeric constants (for example, 2.14)

Identifiers need quotation marks if they contain special (non-alphanumeric) characters. For

example, you need a double quote for the following because it contains blanks: "compute large numbers"

Use single quotes for string constants.

Using escape characters If a constant contains a single quote (') or backslash (\) or another special character used by

the Data Services scripting language, then those characters must be preceded by an escape

character to be evaluated properly in a string. Data Services uses the backslash (\) as the escape

character.

Character Example

Single quote (') 'World\'s Books'

Backslash (\) 'C:\\temp'

Handling nulls, empty strings, and trailing blanks To conform to the ANSI VARCHAR standard when dealing with NULLS, empty strings, and

trailing blanks, Data Services:

• Treats an empty string as a zero length varchar value, instead of as a NULL value.

• Returns a value of FALSE when you use the operators Equal (=) and Not Equal (<>) to

compare to a NULL value.

• Provides IS NULL and IS NOT NULL operators to test for NULL values.

• Treats trailing blanks as regular characters when reading from all sources, instead of trimming

them.

• Ignores trailing blanks in comparisons in transforms (Query and Table Comparison) and

functions (decode, ifthenelse, lookup, lookup_ext, lookup_seq).

NULL values To represent NULL values in expressions, type the word NULL. For example, you can check

whether a column (COLX) is null or not with the following expressions: COLX IS NULL

COLX IS NOT NULL

Data Services does not check for NULL values in data columns. Use the function nvl to remove

NULL values. For more information on the NVL function, see “Functions and Procedures”,

Chapter 6 in the Data Services Reference Guide.



NULL values and empty strings Data Services uses the following two rules with empty strings:

• When you assign an empty string to a variable, Data Services treats the value of the variable

as a zero-length string.

An error results if you assign an empty string to a variable that is not a varchar. To assign

a NULL value to a variable of any type, use the NULL constant.

• As a constant (' '), Data Services treats the empty string as a varchar value of zero length.

Use the NULL constant for the null value. Data Services uses the following three rules with NULLS and empty strings in conditionals:

Rule 1 The Equals (=) and Is Not Equal to (<>) comparison operators against a NULL value always

evaluate to FALSE. This FALSE result includes comparing a variable that has a value of NULL

against a NULL constant.

The following table shows the comparison results for the variable assignments $var1 = NULL

and $var2 = NULL:

Condition Translates to Returns

If (NULL = NULL) NULL is equal to NULL FALSE

If (NULL != NULL) NULL is not equal to NULL FALSE

If (NULL = ' ') NULL is equal to empty string FALSE

If (NULL != ' ')

NULL is not equal to empty

string

FALSE

If ('bbb' = NULL) bbb is equal to NULL FALSE

If ('bbb' != NULL) bbb is not equal to NULL FALSE

If ('bbb' = ' ') bbb is equal to empty string FALSE

If ('bbb' != ' ')

bbb is not equal to empty

string

TRUE

If ($var1 = NULL) NULL is equal to NULL FALSE

If ($var != NULL) NULL is not equal to NULL FALSE



Condition Translates to Returns

If ($var1 = ' ') NULL is equal to empty string FALSE

If ($var != ' ')

NULL is not equal to empty

string

FALSE

If ($var1 = $var2) NULL is equal to NULL FALSE

If ($var != $var2) NULL is not equal to NULL FALSE

The following table shows the comparison results for the variable assignments $var1 = ' '

and $var2 = ' ':

Condition Translates to Return

If ($var1 = NULL) Empty string is equal to NULL FALSE

If ($var != NULL)

Empty string is not equal to

NULL

FALSE

If ($var1 = ' ')

Empty string is equal to empty

string

TRUE

If ($var != ' ')


empty string

FALSE

If ($var1 = $var2)

Empty string is equal to empty

string

TRUE

If ($var != $var2)

Rule 2


empty string

FALSE

Use the IS NULL and IS NOT NULL operators to test the presence of null values. For example,

assuming a variable assignment $var1 = NULL;


If ('bbb' IS NULL) bbb is NULL FALSE

If ('bbb' IS NOT NULL) bbb is not NULL TRUE

If (' ' IS NULL) Empty string is NULL FALSE




If (' ' IS NOT NULL) Empty string is not NULL TRUE

If ($var1 IS NULL) NULL is NULL TRUE

If ($var1 IS NOT NULL)

Rule 3

NULL is not NULL FALSE

When comparing two variables, always test for NULL. In this scenario, you are not testing a

variable with a value of NULL against a NULL constant (as in the first rule). Either test each

variable and branch accordingly or test in the conditional as shown in the second row of the

following table.

Condition Recommendation

Do not compare without explicitly testing for

NULLS. It is not recommended to use this logic If ($var1 = $var2)

because any relational comparison to a NULL

value returns FALSE.

If ( (($var1 IS NULL) AND ($var2 IS

NULL)) OR ($var1 = $var2))

Execute the TRUE branch if both $var1 and

$var2 are NULL, or if neither are NULL but

are equal to each other.



Scripting a custom function

Introduction If the built-in functions that are provided by Data Services do not meet your requirements, you

can create your own custom functions using the Data Services scripting language.


• Create a custom function

• Import a stored procedure to use as a custom function

Creating a custom function You can create your own functions by writing script functions in Data Services scripting

language using the Smart Editor. Saved custom functions appear in the Function wizard and

the Smart Editor under the Custom Functions category, and are also displayed on the Custom

Functions tab of the Local Object Library. You can edit and delete custom functions from the

Local Object Library.

Consider these guidelines when you create your own functions:

• Functions can call other functions.

• Functions cannot call themselves.

• Functions cannot participate in a cycle of recursive calls. For example, function A cannot

call function B if function B calls function A.

• Functions return a value.

• Functions can have parameters for input, output, or both. However, data flows cannot pass

parameters of type output or input/output.

Before creating a custom function, you must know the input, output, and return values and

their datatypes. The return value is predefined to be Return.

To create a custom function 1. On the Custom Functions tab of the Local Object Library, right-click the white space and

select New from the menu.

The Custom Function dialog box displays.



2. In the Function name field, enter a unique name for the new function.

3. In the Description field, enter a description.

4. Click Next.

The Smart Editor enables you to define the return type, parameter list, and any variables to

be used in the function.



5. On the Variables tab, expand the Parameters branch.

6. Right-click Return and select Properties from the menu.

The Return value Properties dialog box displays.

7. In the Data type drop-down list, select the datatype you want to return for the custom

function.

By default, the return datatype is set to integer.

8. Click OK.

9. To define a new variable or parameter for your custom function, in the Variables tab,

right-click the appropriate branch and select Insert from the menu.

10. In the Name field, enter a unique name for the variable or parameter.

11. In the Data type drop-down list, select the datatype for the variable or parameter.

12. For a parameter, in the Parameter type drop-down list, select whether the parameter is for

input, output, or both.

Data Services data flows cannot pass variable parameters of type output and input/output. 13. Click OK.

14. Repeat step 9 to step 13 for each variable or parameter required in your function.



When adding subsequent variables or parameters, the right-click menu will include options

to Insert Above or Insert Above. Use these menu commands to create, delete, or edit variables

or parameters.

15. In the main area of the Smart Editor, enter the expression for your function.

Your expression must include the Return parameter. 16. Click Validate to check the syntax of your function.

If your function contains syntax errors, Data Services displays a list of those errors in an

embedded pane below the editor. To see where the error occurs in the text, double-click an

error. The Smart Editor redraws to show the location of the error.

17. Click OK. To edit a custom function

1. On the Custom Functions tab of the Local Object Library, right-click the custom function

and select Edit from the menu.

2. In the Smart Editor, change the expression as required.

3. Click OK. To delete a custom function

1. On the Custom Functions tab of the Local Object Library, right-click the custom function

and select Delete from the menu.

2. Click OK to confirm the deletion.

Importing a stored procedure as a function If you are using Microsoft SQL Server, you can use stored procedures to insert, update, and

delete data in your tables. To use stored procedures in Data Services, you must import them

as custom functions.

To import a stored procedure 1. On the Datastores tab of the Local Object Library, expand the datastore that contains the

stored procedure.

2. Right-click Functions and select Import By Name from the menu.

The Import By Name dialog box displays.



3. In the Type drop-down list, select Function.

4. In the Name field, enter the name of the stored procedure.

5. Click OK.

Activity: Creating a custom function The Marketing department would like to send special offers to customers who have placed a

specified numbers of orders. This requires creating a custom function that must be able to be

called in a real-time job as a customer's order is entered into the system.

Objectives • Create a custom function to accept the input parameters of the Customer ID and the number

of orders required to receive a special order, check the Orders table, and return a value of

1 or 0.

• Create a batch job using the custom function to create an initial list of customers who have

placed more than five orders, and are therefore eligible to receive the special offer.

Instructions 1. In the Local Object Library, create a new custom function called CF_MarketingOffer.

2. In the Smart Editor for the function, create a parameter called $P_CustomerID with a datatype

of varchar(10) and a type of input.

3. Create a second parameter called $P_Orders with a datatype of int and a type of input.

4. Define the custom function as a conditional clause that specifies that, if the number of rows

in the Orders table is equal to the $P_Orders value for the Customer ID, then the function

should return a 1; otherwise, it should return 0.

The syntax should be as follows:

if ((sql('alpha', 'select count(*) from orders where customerid =

[$P_CustomerID]')) >= $P_Orders)

Return 1;

else Return 0;



5. In the Omega project, create a new batch job called Alpha_Marketing_Offer_Job with a data

flow called Alpha_Marketing_Offer_DF.

6. Create a new global variable for the job called $G_Num_to_Qual with a datatype of int.

7. In the job workspace, to the left of the data flow, create a new script called CheckOrders and

create an expression in the script to define the global variable as five orders to qualify.


$G_Num_to_Qual = 5;

8. Connect the script to the data flow.

9. In the data flow workspace, add the Customer table from the Alpha datastore as the source

object.

10. Add a template table to the Delta datastore called offer_mailing_list as the target object.

11. Add two Query transforms and connect all objects. 12. In the transform editor for the first Query transform, map the following columns:


CONTACTNAME CONTACTNAME

ADDRESS ADDRESS

CITY CITY

POSTALCODE POSTALCODE 13. Create a new output column called OFFER_STATUS with a datatype of int.

14. On the Mapping table, map the new output column to the custom function using the Function

wizard. Specify the CUSTOMERID column for $P_CustomerID and the global variable for

$P_Orders.

The expression should be as follows:

CF_MarketingOffer(customer.CUSTOMERID, $G_Num_to_Qual)

15. In the transform editor for the second Query transform, map the following columns:


CONTACTNAME CONTACTNAME

ADDRESS ADDRESS

CITY CITY




POSTALCODE POSTALCODE

16. On the WHERE tab, create an expression to select only those records where the

OFFER_STATUS value is 1.


Query.OFFER_STATUS = 1

17. Execute Alpha_Marketing_Offer_Job with the default execution properties and save all

objects you have created.


You should have one output record for contact Lev M. Melton in Quebec.

A solution file called SOLUTION_CustomFunction.atl is included in your resource CD. To check





Quiz: Using functions, scripts, and variables 1. Describe the differences between a function and a transform.

2. Why are functions used in expressions?

3. What does a lookup function do? How do the different variations of the lookup function

differ?

4. What value would the Lookup_ext function return if multiple matching records were found

on the translate table?

5. Explain the differences between a variable and a parameter. 6. When would you use a global variable instead of a local variable?

7. What is the recommended naming convention for variables in Data Services?

8. Which object would you use to define a value that is constant in one environment, but may

change when a job is migrated to another environment?

a. Global variable

b. Local variable

c. Parameter

d. Substitution parameter




• Define built-in functions


• Use the lookup function


• Use variables and parameters

• Use Data Services scripting language

• Script a custom function

Using Platform Transforms—Learner’s Guide 175


Lesson 6

Using Platform Transforms

Lesson introduction A transform enables you to control how data sets change in a data flow.


• Describe platform transforms

• Use the Map Operation transform

• Use the Validation transform

• Use the Merge transform

• Use the Case transform

• Use the SQL transform



Describing platform transforms

Introduction Transforms are optional objects in a data flow that allow you to transform your data as it moves

from source to target.


• Explain transforms

• Describe the platform transforms available in Data Services

• Add a transform to a data flow

• Describe the Transform Editor window

Explaining transforms Transforms are objects in data flows that operate on input data sets by changing them or by

generating one or more new data sets. The Query transform is the most commonly-used

transform.

Transforms are added as components to your data flow in the same way as source and target

objects. Each transform provides different options that you can specify based on the transform's

function. You can choose to edit the input data, output data, and parameters in a transform.

Some transforms, such as the Date Generation and SQL transforms, can be used as source

objects, in which case they do not have input options.

Transforms are often used in combination to create the output data set. For example, the Table

Comparison, History Preserve, and Key Generation transforms are used for slowly changing

dimensions.

Transforms are similar to functions in that they can produce the same or similar values during

processing. However, transforms and functions operate on a different scale:

• Functions operate on single values, such as values in specific columns in a data set.

• Transforms operate on data sets by creating, updating, and deleting rows of data.



Describing platform transforms The following platform transforms are available on the Transforms tab of the Local Object

Library:

Icon Transform Description

Case Divides the data from an input data set into multiple output

data sets based on IF-THEN-ELSE branch logic.

Map Operation Allows conversions between operation codes.

Merge Unifies rows from two or more input data sets into a single

output data set.

Query Retrieves a data set that satisfies conditions that you specify.

A query transform is similar to a SQL SELECT statement.

Row Generation Generates a column filled with integers starting at zero and

incrementing by one to the end value you specify.

SQL Performs the indicated SQL query operation.

Validation

Allows you to specify validation criteria for an input data set.

Data that fails validation can be filtered out or replaced. You

can have one validation rule per column.



Using the Map Operation transform

Introduction The Map Operation transform enables you to change the operation code for records.


• Describe map operations


Describing map operations Data Services maintains operation codes that describe the status of each row in each data set

described by the inputs to and outputs from objects in data flows. The operation codes indicate

how each row in the data set would be applied to a target table if the data set were loaded into

a target. The operation codes are as follows:

Operation Code Description

NORMAL

Creates a new row in the target.

All rows in a data set are flagged as NORMAL when they are

extracted by a source table or file. If a row is flagged as NORMAL

when loaded into a target table or file, it is inserted as a new row

in the target.

Most transforms operate only on rows flagged as NORMAL.

Creates a new row in the target.

INSERT Only History Preserving and Key Generation transforms can accept

data sets with rows flagged as INSERT as input.

Is ignored by the target. Rows flagged as DELETE are not loaded.

DELETE Only the History Preserving transform, with the Preserve delete

row(s) as update row(s) option selected, can accept data sets with rows flagged as DELETE.

Overwrites an existing row in the target table.

UPDATE Only History Preserving and Key Generation transforms can accept

data sets with rows flagged as UPDATE as input.



Explaining the Map Operation transform The Map Operation transform allows you to change operation codes on data sets to produce

the desired output. For example, if a row in the input data set has been updated in some previous

operation in the data flow, you can use this transform to map the UPDATE operation to an

INSERT. The result could be to convert UPDATE rows to INSERT rows to preserve the existing

row in the target.

Data Services can push Map Operation transforms to the source database.


data output results for the Map Operation transform. For more information on the Map

Operation transform see “Transforms” Chapter 5 in the Data Services Reference Guide.

Inputs/Outputs Input for the Map Operation transform is a data set with rows flagged with any operation

codes. It can contain hierarchical data.

Use caution when using columns of datatype real in this transform, because comparison results

are unpredictable for this datatype.

Output for the Map Operation transform is a data set with rows flagged as specified by the

mapping operations.

Options The Map Operation transform enables you to set the Output row type option to indicate the

new operations desired for the input data set. Choose from the following operation codes:

INSERT, UPDATE, DELETE, NORMAL, or DISCARD.



Activity: Using the Map Operation transform End users of employee reports have requested that employee records in the data mart contain

only current employees.

Objective • Use the Map Operation transform to remove any employee records that have a value in the

discharge_date column.

Instructions 1. In the Omega project, create a new batch job called Alpha_Employees_Current_Job with a

data flow called Alpha_Employees_Current_DF.

2. In the data flow workspace, add the Employee table from the Alpha datastore as the source

object.

3. Add the Employee table from the HR_datamart datastore as the target object.

4. Add the Query transform to the workspace and connect all objects.

5. In the transform editor for the Query transform, map all columns from the input schema to

the same column in the output schema.

6. On the WHERE tab, create an expression to select only those rows where discharge_date is

not empty.


employee.discharge_date is not null

7. In the data flow workspace, disconnect the Query transform from the target table.

8. Add a Map Operation transform between the Query transform and the target table and

connect it to both.

9. In the transform editor for the Map Operation transform, change the settings so that rows

with an input operation code of NORMAL have an output operation code of DELETE.

10. Execute Alpha_Employees_Current_Job with the default execution properties and save all


11. Return to the data flow workspace and view data for both the source and target tables.

Note that two rows were filtered from the target table.

A solution file called SOLUTION_MapOperation.atl is included in your resource CD. To check





Using the Validation transform

Introduction The Validation transform enables you to create validation rules and move data into target

objects based on whether they pass or fail validation.



Explaining the Validation transform Use the Validation transform in your data flows when you want to ensure that the data at any

stage in the data flow meets your criteria.

For example, you can set the transform to ensure that all values:

• Are within a specific range

• Have the same format

• Do not contain NULL values

The Validation transform allows you to define a re-usable business rule to validate each record

and column. The Validation transform qualifies a data set based on rules for input schema

columns. It filters out or replaces data that fails your criteria. The available outputs are pass

and fail. You can have one validation rule per column.

For example, if you want to load only sales records for October 2007, you would set up a

validation rule that states: Sales Date is between 10/1/2007 to 10/31/2007. Data Services looks

at this date field in each record to validate if the data meets this requirement. If it does not, you

can choose to pass the record into a Fail table, correct it in the Pass table, or do both.



Your validation rule consists of a condition and an action on failure:

• Use the condition to describe what you want for your valid data.

For example, specify the condition IS NOT NULL if you do not want any NULLS in data

passed to the specified target.

• Use the Action on Failure area to describe what happens to invalid or failed data.

Continuing with the example above, for any NULL values, you may want to select the Send

to Fail option to send all NULL values to a specified FAILED target table.

You can also create a custom Validation function and select it when you create a validation

rule. For more information on creating a custom Validation functions, see “Validation

Transform”, Chapter 12 in the Data Services Reference Guide.


data output results for the Validation transform. For more information on the Validation

transform see “Transforms” Chapter 5 in the Data Services Reference Guide.

Input/Output Only one source is allowed as a data input for the Validation transform.

The Validation transform outputs up to two different data sets based on whether the records

pass or fail the validation condition you specify. You can load pass and fail data into multiple

targets.

The Pass output schema is identical to the input schema. Data Services adds the following two

columns to the Fail output schemas:

• The DI_ERRORACTION column indicates where failed data was sent in this way:

○ The letter B is used for sent to both Pass and Fail outputs.

○ The letter F is used for sent only to the Fail output.

If you choose to send failed data to the Pass output, Data Services does not track the results.

You may want to substitute a value for failed data that you send to the Pass output because

Data Services does not add columns to the Pass output.

• The DI_ERRORCOLUMNS column displays all error messages for columns with failed

rules. The names of input columns associated with each message are separated by colons.

For example, “<ValidationTransformName> failed rule(s): c1:c2”.

If a row has conditions set for multiple columns and the Pass, Fail, and Both actions are

specified for the row, then the precedence order is Fail, Both, Pass. For example, if one

column’s action is Send to Fail and the column fails, then the whole row is sent only to the

Fail output. Other actions for other validation columns in the row are ignored.

Options When you use the Validation transform, you select a column in the input schema and create a

validation rule in the Validation transform editor. The Validation transform offers several

options for creating this validation rule:



Option Description

Enable Validation Turn the validation rule on and off for the

column.

Do not validate when NULL

Condition

Send all NULL values to the Pass output

automatically. Data Services will not apply the

validation rule on this column when an

incoming value for it is NULL.

Define the condition for the validation rule:

• Operator: select an operator for a Boolean

expression (for example, =, <, >) and enter

the associated value.

• In: specify a list of possible values for a

column.

• Between/and: specify a range of values for

a column.

• Match pattern: enter a pattern of upper and

lowercase alphanumeric characters to

ensure the format of the column is correct.

• Custom validation function: select a

function from a list for validation purposes.

Data Services supports Validation functions

that take one parameter and return an

integer datatype. If a return value is not a

zero, then Data Services processes it as

TRUE.

• Exists in table: specify that a column’s value

must exist in a column in another table. This

option also uses the LOOKUP_EXT

function. You can define the NOT NULL

constraint for the column in the LOOKUP

table to ensure the Exists in table condition

executes properly.

• Custom condition: create more complex

expressions using the function and smart

editors.

Data Services converts substitute values in the

condition to a corresponding column datatype:

integer, decimal, varchar, date, datetime,

timestamp, or time. The Validation transform

requires that you enter some values in specific

formats:

• date (YYYY.MM.DD)



Option

Action on Fail

Description

• datetime (YYYY.MM.DD HH24:MI:SS)

• time (HH24:MI:SS)

• timestamp (YYYY.MM.DD HH24:MI:SS.FF)

If, for example, you specify a date as

12-01-2004, Data Services produces an error

because you must enter this date as 2004.12.01.

Define where a record is loaded if it fails the

validation rule:

• Send to Fail

• Send to Pass

• Send to both

If you choose Send to Pass or Send to Both,

you can choose to substitute a value or

expression for the failed values that are sent

to the Pass output.

To create a validation rule 1. Open the data flow workspace.

2. Add your source object to the workspace.

3. On the Transforms tab of the Local Object Library, click and drag the Validation transform

to the workspace to the right of your source object.

4. Add your target objects to the workspace.

You will require one target object for records that pass validation, and an optional target

object for records that fail validation, depending on the options you select.

5. Connect the source object to the transform.

6. Double-click the Validation transform to open the transform editor.



7. In the input schema area, click to select an input schema column.

8. In the parameters area, select the Enable Validation option.

9. In the Condition area, select a condition type and enter any associated value required.

All conditions must be Boolean expressions. 10. On the Properties tab, enter a name and description for the validation rule.

11. On the Action On Failure tab, select an action.

12. If desired, select the For pass, substitute with option and enter a substitute value or

expression for the failed value that is sent to the Pass output.

This option is only available if you select Send to Pass or Send to Both. 13. Click Back to return to the data flow workspace.

14. Click and drag from the transform to the target object.

15. Release the mouse and select the appropriate label for that object from the pop-up menu.



16. Repeat step 14 and step 15 for all target objects.

Activity: Using the Validation transform Order data is stored in multiple formats with different structures and different information.

You will use the Validation transform to validate order data from flat file sources and the alpha

orders table before merging it.

Objectives • Join the data in the Orders flat files with that in the Order_Shippers flat files.

• Create a column on the target table for employee information so that orders taken by

employees who are no longer with the company are assigned to a default current employee

using the validation transform in a new column named order_assigned_to.

• Create a column to hold the employee ID of the employee who originally made the sale.

• Replace null values in the shipper fax column with a value of 'No Fax' and send those rows

to a separate table for follow up.

Instructions 1. Create a file format called Order_Shippers_Format for the flat file

Order_Shippers_04_20_07.txt. Use the structure of the text file to determine the appropriate

settings.

2. In the Column Attributes pane, adjust the datatypes for the columns based on their content:

Column Datatype

ORDERID int

SHIPPERNAME varchar(50)

SHIPPERADDRESS varchar(50)

SHIPPERCITY varchar(50)

SHIPPERCOUNTRY int

SHIPPERPHONE varchar(20)

SHIPPERFAX varchar(20)

SHIPPERREGION int

SHIPPERPOSTALCODE varchar(15)



3. In the Omega project, create a new batch job called Alpha_Orders_Validated_Job and two

data flows, one named Alpha_Orders_Files_DF, and the second named Alpha_Orders_DB_DF.

4. Add the file formats Orders_Format and Order_Shippers_Format as source objects to the

Alpha_Orders_Files_DF data flow workspace.

5. Edit the source objects so that the Orders_Format source is using all three related orders

flat files and the Order_Shippers_Format source is using all three order shippers files.

Tip: You can use a wildcard to replace the dates in the file names.

6. Add a Query transform to the workspace and connect it to the two source objects.

7. In the transform editor for the Query transform, create a WHERE clause to join the data on

the OrderID values.


Order_Shippers_Format.ORDERID = Orders_Format.ORDERID

8. Add the following mappings in the Query transform:

Schema Out Mapping

ORDERID Orders_Format.ORDERID

CUSTOMERID Orders_Format.CUSTOMERID

ORDERDATE Orders_Format.ORDERDATE

SHIPPERNAME Order_Shippers_Format.SHIPPERNAME

SHIPPERADDRESS Order_Shippers_Format.SHIPPERADDRESS

SHIPPERCITY Order_Shippers_Format.SHIPPERCITY

SHIPPERCOUNTRY Order_Shippers_Format.SHIPPERCOUNTRY

SHIPPERPHONE Order_Shippers_Format.SHIPPERPHONE

SHIPPERFAX Order_Shippers_Format.SHIPPERFAX

SHIPPERREGION Order_Shippers_Format.SHIPPERREGION

SHIPPERPOSTALCODE Order_Shippers_Format.SHIPPERPOSTALCODE 9. Insert a new output column above ORDERDATE called ORDER_TAKEN_BY with a datatype

of varchar(15) and map it to Orders_Format.EMPLOYEEID.



10. Insert a new output column above ORDERDATE called ORDER_ASSIGNED_TO with a datatype

of varchar(15) and map it to Orders_Format.EMPLOYEEID.

11. Add a Validation transform to the right of the Query transform and connect the transforms.

12. In the transform editor for the Validation transform, enable validation for the

ORDER_ASSIGNED_TO column to verify the value in the column exists in the EMPLOYEEID

column of the Employee table in the HR_datamart datastore.


HR_datamart.hr_datamart.employee.EMPLOYEEID

13. Set the action on failure for the Order_Assigned_To column to send to both pass and fail.

For pass, substitute '3Cla5' to assign it to the default employee.

14. Enable validation for the SHIPPERFAX column to send NULL values to both pass and fail,

substituting 'No Fax' for pass.

15. Add two target tables in the Delta datastore as targets, one called Orders_Files_Work and

one called Orders_Files_No_Fax.

16. Connect the pass output from the Validation transform to Orders_Files_Work and the fail

output to Orders_Files_No_Fax.

17. In the Alpha_Orders_DB_DF workspace, add the Orders table from the Alpha datastore as

the source object.

18. Add a Query transform to the workspace and connect it to the source.

19. In the transform editor for the Query transform, define the following mappings:

Column Mapping

ORDERID Orders.ORDERID

CUSTOMERID Orders.CUSTOMERID

ORDERDATE Orders.ORDERDATE

SHIPPERNAME Orders.SHIPPERNAME

SHIPPERADDRESS Orders.SHIPPERADDRESS

SHIPPERCITY Orders.SHIPPERCITYID

SHIPPERCOUNTRYID Orders.SHIPPERCOUNTRY

SHIPPERPHONE Orders.SHIPPERPHONE

SHIPPERFAX Orders.SHIPPERFAX



Column Mapping

SHIPPERREGION Orders.SHIPPERREGION

SHIPPERPOSTALCODE Orders.SHIPPERPOSTALCODE 20. Insert a new output column above ORDERDATE called ORDER_TAKEN_BY with a data type

of varchar(10) and map it to Orders.EMPLOYEEID.

21. Insert a new output column above ORDERDATE called ORDER_ASSIGNED_TP with a data

type of varchar(10) and map it to Orders.EMPLOYEEID.

22. Add a Validation transform to the right of the Query transform and connect the transforms.

23. Enable validation for Order_Assigned_To to verify the column value exists in the

EMPLOYEEID column of the Employee table in the HR_datamart datastore.

24. Set the action on failure for the Order_Assigned_To column to send to both pass and fail.

For pass, substitute '3Cla5' to assign it to the default employee.

25. Enable validation for the ShipperFax column to send NULL values to both pass and fail,

substituting 'No Fax' for pass.

26. Add two target tables in the Delta datastore as targets, one named Orders_DB_Work and

one named Orders_DB_No_Fax.

27. Connect the pass output from the Validation transform to Orders_DB_Work and the fail

output to Orders_DB_No_Fax.

28. Execute Alpha_Orders_Validated_Job with the default execution properties and save all


29. View the data in the target tables to view the differences between passing and failing records.

A solution file called SOLUTION_Validation.atl is included in your resource CD. To check the

solution, import the file and open it to view the data flow design and mapping logic. Do not execute the solution job, as this may override the results in your target table.



Using the Merge transform

Introduction The Merge transform allows you to combine multiple sources with the same schema into a

single target.



Explaining the Merge transform The Merge transform combines incoming data sets with the same schema structure to produce

a single output data set with the same schema as the input data sets.

For example, you could use the Merge transform to combine two sets of address data:


data output results for the Merge transform. For more information on the Merge transform see


Input/Output The Merge transform performs a union of the sources. All sources must have the same schema

as shown in the diagram below, including:

• Number of columns

• Column names

• Column datatypes



If the input data set contains hierarchical data, the names and datatypes must match at every

level of the hierarchy.

The output data has the same schema as the source data. The output data set contains a row

for every row in the source data sets. The transform does not strip out duplicate rows. If columns

in the input set contain nested schemas, the nested data is passed through without change.

Tip: If you want to merge tables that do not have the same schema, you can add the Query

transform to one of the tables before the Merge transform to redefine the schema to match the

other table.

Options The Merge transform does not offer any options.

Activity: Using the Merge transform The Orders data has now been validated, but the output is for two different sources: flat files

and database tables. The next step in the process is to modify the structure of those data sets

to they match, and then merge them into a single data set.

Objectives • Use the Query transforms to modify any column names and data types and to perform

lookups for any columns that reference other tables.

• Use the Merge transform to merge the validated orders data.

Instructions 1. In the Omega project, create a new batch job called Alpha_Orders_Merged_Job with a data

flow called Alpha_Orders_Merged_DF .

2. In the data flow workspace, add the orders_file_work and orders_db_work tables from the

Delta datastore as the source objects.

3. Add two Query transforms to the data flow, connecting each source object to its own Query

transform.

4. In the transform editor for the Query transform connected to the orders_files_work table,

map all columns from input to output.

5. Change the datatype for the following columns as specified:

Column Type

ORDER_TAKEN_BY varchar(15)

ORDER_ASSIGNED_TO varchar(15)

ORDERDATE datetime



Column Type

SHIPPERADDRESS varchar(100)

SHIPPERCOUNTRY varchar(50)

SHIPPERREGION varchar(50)

SHIPPERPOSTALCODE varchar(50)

6. For the SHIPPERCOUNTRY column, change the mapping to perform a lookup of

CountryName from the Country table in the Alpha datastore.



[COUNTRYNAME],[NULL],[COUNTRYID,'=',orders_file_work.SHIPPERCOUNTRY]) SET


7. For the SHIPPERREGION column, change the mapping to perform a lookup of RegionName

from the Region table in the Alpha datastore.


lookup_ext([Alpha.alpha.region,'PRE_LOAD_CACHE','MAX'],

[REGIONNAME],[NULL],[REGIONID,'=',orders_file_work.SHIPPERREGION]) SET


8. In the transform editor for the Query transform connected to the orders_db_work table,

map all columns from input to output.

9. Change the datatype for the following columns as specified:

Column Type

SHIPPERCOUNTRY varchar(50)

SHIPPERREGION varchar(50) 10. For the SHIPPERCITY column, change the mapping to perform a lookup of CityName from

the City table in the Alpha datastore.


lookup_ext([Alpha.alpha.city,'PRE_LOAD_CACHE','MAX'],

[CITYNAME],[NULL],[CITYID,'=',orders_db_work.SHIPPERCITYID]) SET


11. For the SHIPPERCOUNTRY column, change the mapping to perform a lookup of

CountryName from the Country table in the Alpha datastore.





[COUNTRYNAME],[NULL],[COUNTRYID,'=',orders_db_work.SHIPPERCOUNTRYID]) SET


12. For the SHIPPERREGIONID column, change the mapping to perform a lookup of

RegionName from the Region table in the Alpha datastore.



[REGIONNAME],[NULL],[REGIONID,'=',orders_db_work.SHIPPERREGIONID]) SET


13. Add a Merge transform to the data flow and connect both Query transforms to the Merge

transform.

14. Add a template table called Orders_Merged in the Delta datastore as the target table and

connect it to the Merge transform.

15. Execute Alpha_Orders_Merged_Job with the default execution properties and save all objects

you have created.

16. View the data in the target table.

Note that the SHIPPERCITY, SHIPPERCOUNTRY, and SHIPPERREGION columns for the

363 records in the template table consistently have names versus ID values.

A solution file called SOLUTION_Merge.atl is included in your resource CD. To check the





Using the Case transform

Introduction The Case transform supports separating data from a source into multiple targets based on

branch logic.



Explaining the Case transform You use the Case transform to simplify branch logic in data flows by consolidating case or

decision-making logic into one transform. The transform allows you to split a data set into

smaller sets based on logical branches.

For example, you can use the Case transform to read a table that contains sales revenue facts

for different regions and separate the regions into their own tables for more efficient data access:


data output results for the Case transform. For more information on the Case transform, see


Input/Output Only one data flow source is allowed as a data input for the Case transform. Depending on the

data, only one of multiple branches is executed per row. The input and output schema are also

identical when using the case transform.



The connections between the Case transform and objects used for a particular case must be

labeled. Each output label in the Case transform must be used at least once.

You connect the output of the Case transform with another object in the workspace. Each label

represents a case expression (WHERE clause).

Options The Case transform offers several options:

Option Description

Label

Define the name of the connection that

describes where data will go if the

corresponding Case condition is true.

Expression Define the Case expression for the

corresponding label.

Produce default option with label

Specify that the transform must use the

expression in this label when all other Case

expressions evaluate to false.

Row can be TRUE for one case only

To create a case statement

Specify that the transform passes each row to

the first case whose expression returns true.

1. Open the data flow workspace.


3. On the Transforms tab of the Local Object Library, click and drag the Case transform to the

workspace to the right of your source object.

4. Add your target objects to the workspace.

You will require one target object for each possible condition in the case statement. 5. Connect the source object to the transform.

6. Double-click the Case transform to open the transform editor.



7. In the parameters area of the transform editor, click Add to add a new expression.

8. In the Label field, enter a label for the expression.

9. Click and drag an input schema column to the Expression pane at the bottom of the window.

10. Enter the rest of the expression to define the condition.

For example, to specify that you want all Customers with a RegionID of 1, create the following

statement: Customer.RegionID = 1

11. Repeat step 7 to step 10 for all expressions.

12. To direct records that do not meet any defined conditions to a separate target object, select

the Produce default option with label option and enter the label name in the associated

field.

13. To direct records that meet multiple conditions to only one target, select the Row can be

TRUE for one case only option.

In this case, records are placed in the target associated with the first condition that evaluates

as true.

14. Click Back to return to the data flow workspace.



15. Connect the transform to the target object.

16. Release the mouse and select the appropriate label for that object from the pop-up menu.

17. Repeat step 15 and step 16 for all target objects.

Activity: Using the Case transform Once the orders have been validated and merged, the resulting data set must be split out by

quarter for reporting purposes.

Objective • Use the Case transform to create separate tables for orders occurring in fiscal quarters 3 and

4 for the year 2007 and quarter 1 of 2008.

Instructions 1. In the Omega project, create a new batch job called Alpha_Orders_By_Quarter_Job with a

data flow named Alpha_Orders_By_Quarter_DF.

2. In the data flow workspace, add the Orders_Merged table from the Delta datastore as the

source object.

3. Add a Query transform to the data flow and connect it to the source table.

4. In the transform editor for the Query transform, map all columns from input to output.

5. Add the following two output columns:

Column Type Mapping

ORDERQUARTER int quarter (orders_merged.ORDERDATE)

ORDERYEAR

varchar(4) to_char (orders_merged.ORDERDATE,

'YYYY')

6. Add a Case transform to the data flow and connect it to the Query transform.

7. In the transform editor for the Case transform, create the following labels and associated

expressions:

Label Expression

Q42006 Query.ORDERYEAR = '2006' and

Query.ORDERQUARTER = 4





Label Expression






Query.ORDERQUARTER = 4 8. Choose the settings to not produce a default output set for the Case transform and to specify

that rows can be true for one case only.

9. Add five template tables in the Delta datastore called Orders_Q4_2006, Orders_Q1_2007,

Orders_Q2_2007, Orders_Q3_2007, and Orders_Q4_2007.

10. Connect the output from the Case transform to the target tables selecting the corresponding

labels.

11. Execute Alpha_Orders_By_Quarter_Job with the default execution properties and save all


12. View the data in the target tables and confirm that there are 103 orders that were placed in

Q1 of 2007.

A solution file called SOLUTION_Case.atl is included in your resource CD. To check the solution,

import the file and open it to view the data flow design and mapping logic. Do not execute the

solution job, as this may override the results in your target table.



Using the SQL transform

Introduction The SQL transform allows you to submit SQL commands to generate data to be moved into

target objects.



Explaining the SQL transform Use this transform to perform standard SQL operations when other built-in transforms cannot

perform them.

The SQL transform can be used to extract for general select statements as well as stored

procedures and views.

You can use the SQL transform as a replacement for the Merge transform when you are dealing

with database tables only. The SQL transform performs more efficiently because the merge is

pushed down to the database. However, you cannot use this functionality if your source objects

include file formats.


data output results for the SQL transform. For more information on the SQL transform see


Inputs/Outputs There is no input data set for the SQL transform.

There are two ways of defining the output schema for a SQL transform if the SQL submitted

is expected to return a result set:



• Automatic — After you type the SQL statement, click Update schema to execute a select

statement against the database that obtains column information returned by the select

statement and populates the output schema.

• Manual — Output columns must be defined in the output portion of the SQL transform if

the SQL operation is returning a data set. The number of columns defined in the output of

the SQL transform must equal the number of columns returned by the SQL query, but the

column names and data types of the output columns do not need to match the column names

or data types in the SQL query.

Options The SQL transform has the following options:

Option Description

Datastore Specify the datastore for the tables referred to in the SQL statement.

Database type Specify the type of database for the datastore where there are

multiple datastore configurations.

Join rank

Indicate the weight of the output data set if the data set is used in

a join. The highest ranked source is accessed first to construct the

join.

Array fetch size Indicate the number of rows retrieved in a single request to a source

database. The default value is 1000.

Cache

Hold the output from this transform in memory for use in

subsequent transforms. Use this only if the data set is small enough

to fit in memory.

SQL text Enter the text of the SQL query.

To create a SQL statement 1. Open the data flow workspace.

2. On the Transforms tab of the Local Object Library, click and drag the SQL transform to the

workspace.

3. Add your target object to the workspace.


5. Double-click the SQL transform to open the transform editor.



6. In the parameters area, select the source datastore from the Datastore drop-down list.

7. If there is more than one datastore configuration, select the appropriate configuration from

the Database type drop-down list.

8. Change the other available options, if required.

9. In the SQL text area, enter the SQL statement.

For example, to copy the entire contents of a table into the target object, you would use the

following statement: Select * from Customers.

10. Click Update Schema to update the output schema with the appropriate values.

If required, you can change the names and datatypes of these columns. You can also create

the output columns manually.

11. Click Back to return to the data flow workspace.

12. Click and drag from the transform to the target object.

Activity: Using the SQL transform The contents of the Employee and Department tables must be merged, which can be done using

the SQL transform as a shortcut.

Objective • Use the SQL transform to select employee and department data.



Instructions 1. In the Omega project, create a new batch job called Alpha_Employees_Dept_Job with a data

flow called Alpha_Employees_Dept_DF.

2. In the data flow workspace, add the SQL transform as the source object.

3. Add the Emp_Dept table from the HR_datamart datastore as the target object, and connect

the transform to it.

4. In the transform editor for the SQL transform, specify the appropriate datastore name and

database type for the Alpha datastore.

5. Create a SQL statement to select the last name and first name for the employee from the

Employee table and the department in which the employee belongs by looking up the value

in the Department table based on the Department ID.


select employee.EMPLOYEEID, employee.LASTNAME, employee.FIRSTNAME,

department.DEPARTMENTNAME from Alpha.employee, Alpha.department where

employee.DEPARTMENTID = department.DEPARTMENTID

6. Update the output schema based on your SQL statement.

7. Set the EMPLOYEEID columns as the primary key.

8. Execute Employees_Dept_Job with the default execution properties and save all objects you

have created.


You should have 40 rows in your target table, because there were 8 employees in the

employee table with department IDs that were not defined in the department table.

A solution file called SOLUTION_SQL.atl is included in your resource CD. To check the solution,

import the file and open it to view the data flow design and mapping logic. Do not execute the

solution job, as this may override the results in your target table.



Quiz: Using platform transforms 1. What would you use to change a row type from NORMAL to INSERT?

2. What is the Case transform used for?

3. Name the transform that you would use to combine incoming data sets to produce a single

output data set with the same schema as the input data sets.

4. A validation rule consists of a condition and an action on failure. When can you use the

action on failure options in the validation rule?

5. When would you use the Merge transform versus the SQL transform to merge records?




• Describe platform transforms






Setting up Error Handling—Learner’s Guide 205


Lesson 7

Setting up Error Handling

Lesson introduction For sophisticated error handling, you can use recoverable work flows and try/catch blocks to

recover data.


• Set up recoverable work flows



Using recovery mechanisms

Introduction If a Data Services job does not complete properly, you must resolve the problems that prevented

the successful execution of the job.


• Explain how to avoid data recovery situations

• Explain the levels of data recovery strategies

• Recover a failed job using automatic recovery

• Recover missing values and rows

• Define alternative work flows

Avoiding data recovery situations The best solution to data recovery situations is obviously not to get into them in the first place.

Some of those situations are unavoidable, such as server failures. Others, however, can easily

be sidestepped by constructing your jobs so that they take into account the issues that frequently

cause them to fail.

One example is when an external file is required to run a job. In this situation, you could use

the wait_for_file function or a while loop and the file_exists function to check that the file exists

in a specified location before executing the job.

While loops The while loop is a single-use object that you can use in a work flow. The while loop repeats a

sequence of steps as long as a condition is true.

Typically, the steps done during the while loop result in a change in the condition so that the

condition is eventually no longer satisfied and the work flow exits from the while loop. If the

condition does not change, the while loop does not end.

For example, you might want a work flow to wait until the system writes a particular file. You

can use a while loop to check for the existence of the file using the file_exists function. As long

as the file does not exist, you can have the work flow go into sleep mode for a particular length

of time before checking again.

Because the system might never write the file, you must add another check to the loop, such

as a counter, to ensure that the while loop eventually exits. In other words, change the while

loop to check for the existence of the file and the value of the counter. As long as the file does

not exist and the counter is less than a particular value, repeat the while loop. In each iteration

of the loop, put the work flow in sleep mode and then increment the counter.



Describing levels of data recovery strategies When a job fails to complete successfully during execution, some data flows may not have

completed. When this happens, some tables may have been loaded, partially loaded, or altered.

You need to design your data movement jobs so that you can recover your data by rerunning

the job and retrieving all the data without introducing duplicate or missing data.

There are different levels of data recovery and recovery strategies. You can:

• Recover your entire database: Use your standard RDBMS services to restore crashed data

cache to an entire database. This option is outside of the scope of this course.

• Recover a partially-loaded job: Use automatic recovery.

• Recover from partially-loaded tables: Use the Table Comparison transform, do a full

replacement of the target, use the auto-correct load feature, include a preload SQL command

to avoid duplicate loading of rows when recovering from partially loaded tables.

• Recover missing values or rows: Use the Validation transform or the Query transform with

WHERE clauses to identify missing values, and use overflow files to manage rows that

could not be inserted.

• Define alternative work flows: Use conditionals, try/catch blocks, and scripts to ensure all

exceptions are managed in a work flow.

Depending on the relationships between data flows in your application, you may use a

combination of these techniques to recover from exceptions.

Note: It is important to note that some recovery mechanisms are for use in production systems and

are not supported in development environments.

Configuring work flows and data flows In some cases, steps in a work flow depend on each other and must be executed together. When

there is a dependency like this, you should designate the work flow as a recovery unit. This

requires the entire work flow to complete successfully. If the work flow does not complete

successfully, Data Services executes the entire work flow during recovery, including the steps

that executed successfully in prior work flow runs.

Conversely, you may need to specify that a work flow or data flow should only execute once.

When this setting is enabled, the job never re-executes that object. It is not recommended to

mark a work flow or data flow as “Execute only once” if the parent work flow is a recovery

unit.

To specify a work flow as a recovery unit 1. In the project area or on the Work Flows tab of the Local Object Library, right-click the work

flow and select Properties from the menu.


2. On the General tab, select the Recover as a unit check box.

3. Click OK.



To specify that an object executes only once 1. In the project area or on the appropriate tab of the Local Object Library, right-click the work

flow or data flow and select Properties from the menu.


2. On the General tab, select the Execute only once check box.

3. Click OK.

Using recovery mode If a job with automated recovery enabled fails during execution, you can execute the job again

in recovery mode. During recovery mode, Data Services retrieves the results for

successfully-completed steps and reruns uncompleted or failed steps under the same conditions

as the original job.

In recovery mode, Data Services executes the steps or recovery units that did not complete

successfully in a previous execution. This includes steps that failed and steps that generated

an exception but completed successfully, such as those in a try/catch block. As in normal job

execution, Data Services executes the steps in parallel if they are not connected in the work

flow diagrams and in serial if they are connected.

For example, suppose a daily update job running overnight successfully loads dimension tables

in a warehouse. However, while the job is running, the database log overflows and stops the

job from loading fact tables. The next day, you truncate the log file and run the job again in

recovery mode. The recovery job does not reload the dimension tables in a failed job because

the original job, even though it failed, successfully loaded the dimension tables.

To ensure that the fact tables are loaded with the data that corresponds properly to the data

already loaded in the dimension tables, ensure the following:

• Your recovery job must use the same extraction criteria that your original job used when

loading the dimension tables.

If your recovery job uses new extraction criteria, such as basing data extraction on the current

system date, the data in the fact tables will not correspond to the data previously extracted

into the dimension tables.

If your recovery job uses new values, the job execution may follow a completely different

path through conditional steps or try/catch blocks.

• Your recovery job must follow the exact execution path that the original job followed. Data

Services records any external inputs to the original job so that your recovery job can use

these stored values and follow the same execution path.

To enable automatic recovery in a job 1. In the project area, right-click the job and select Execute from the menu.


2. On the Parameters tab, select the Enable recovery check box.



If this check box is not selected, Data Services does not record the results from the steps

during the job and cannot recover the job if it fails.

3. Click OK. To recover from last execution

1. In the project area, right-click the job that failed and select Execute from the menu.


2. On the Parameters tab, select the Recover from last execution check box.

This option is not available when a job has not yet been executed, the previous job run

succeeded, or recovery mode was disabled during the previous run.

3. Click OK.

Recovering from partially-loaded data Executing a failed job again may result in duplication of rows that were loaded successfully

during the first job run.

Within your recoverable work flow, you can use several methods to ensure that you do not

insert duplicate rows:

• Include the Table Comparison transform (available in Data Integrator packages only) in

your data flow when you have tables with more rows and fewer fields, such as fact tables.

• Change the target table options to completely replace the target table during each execution.

This technique can be optimal when the changes to the target table are numerous compared

to the size of the table.

• Change the target table options to use the auto-correct load feature when you have tables

with fewer rows and more fields, such as dimension tables. The auto-correct load checks

the target table for existing rows before adding new rows to the table. Using the auto-correct

load option, however, can slow jobs executed in non-recovery mode. Consider this technique

when the target table is large and the changes to the table are relatively few.

• Include a SQL command to execute before the table loads. Preload SQL commands can

remove partial database updates that occur during incomplete execution of a step in a job.

Typically, the preload SQL command deletes rows based on a variable that is set before the

partial insertion step began.

For more information on preloading SQL commands, see “Using preload SQL to allow

re-executable Data Flows”, Chapter 18 in the Data Services Designer Guide.

Recovering missing values or rows

Missing values that are introduced into the target data during data integration and data quality

processes can be managed using the Validation or Query transforms.



Missing rows are rows that cannot be inserted into the target table. For example, rows may be

missing in instances where a primary key constraint is violated. Overflow files help you process

this type of data problem.

When you specify an overflow file and Data Services cannot load a row into a table, Data

Services writes the row to the overflow file instead. The trace log indicates the data flow in

which the load failed and the location of the file.

You can use the overflow information to identify invalid data in your source or problems

introduced in the data movement. Every new run will overwrite the existing overflow file.

To use an overflow file in a job 1. Open the target table editor for the target table in your data flow.

2. On the Options tab, under Error handling, select the Use overflow file check box.

3. In the File name field, enter or browse to the full path and file name for the file.

When you specify an overflow file, give a full path name to ensure that Data Services creates

a unique file when more than one file is created in the same job.

4. In the File format drop-down list, select what you want Data Services to write to the file

about the rows that failed to load:

• If you select Write data, you can use Data Services to specify the format of the

error-causing records in the overflow file.

• If you select Write sql, you can use the commands to load the target manually when the

target is accessible.

Defining alternative work flows You can set up your jobs to use alternative work flows that cover all possible exceptions and

have recovery mechanisms built in. This technique allows you to automate the process of

recovering your results.

Alternative work flows consist of several components:

1. A script to determine if recovery is required.

This script reads the value in a status table and populates a global variable with the same

value. The initial value in table is set to indicate that recovery is not required.

2. A conditional that calls the appropriate work flow based on whether recovery is required.



The conditional contains an If/Then/Else statement to specify that work flows that do not

require recovery are processed one way, and those that do require recovery are processed

another way.

3. A work flow with a try/catch block to execute a data flow without recovery.

The data flow where recovery is not required is set up without the auto correct load option

set. This ensures that, wherever possible, the data flow is executed in a less resource-intensive

mode.

4. A script in the catch object to update the status table.

The script specifies that recovery is required if any exceptions are generated. 5. A work flow to execute a data flow with recovery and a script to update the status table.

The data flow is set up for more resource-intensive processing that will resolve the exceptions.

The script updates the status table to indicate that recovery is not required.



Conditionals Conditionals are single-use objects used to implement conditional logic in a work flow. When

you define a conditional, you must specify a condition and two logical branches:

Statement Description

If

Then

A Boolean expression that evaluates to TRUE

or FALSE. You can use functions, variables,

and standard operators to construct the

expression.

Work flow element to execute if the IF

expression evaluates to TRUE.

Else Work flow element to execute if the IF

expression evaluates to FALSE.

Both the Then and Else branches of the conditional can contain any object that you can have in

a work flow, including other work flows, data flows, nested conditionals, try/catch blocks,

scripts, and so on.

Try/Catch Blocks A try/catch block allows you to specify alternative work flows if errors occur during job

execution. Try/catch blocks catch classes of errors, apply solutions that you provide, and

continue execution.

For each catch in the try/catch block, you will specify:

• One exception or group of exceptions handled by the catch. To handle more than one

exception or group of exceptions, add more catches to the try/catch block.

• The work flow to execute if the indicated exception occurs. Use an existing work flow or

define a work flow in the catch editor.

If an exception is thrown during the execution of a try/catch block, and if no catch is looking

for that exception, then the exception is handled by normal error logic.

Using try/catch blocks and automatic recovery Data Services does not save the result of a try/catch block for re-use during recovery. If an

exception is thrown inside a try/catch block, during recovery Data Services executes the step

that threw the exception and subsequent steps.

Because the execution path through the try/catch block might be different in the recovered

job, using variables set in the try/catch block could alter the results during automatic recovery.

For example, suppose you create a job that defines the value of variable $I within a try/catch

block. If an exception occurs, you set an alternate value for $I. Subsequent steps are based on

the new value of $I.



During the first job execution, the first work flow contains an error that generates an exception,

which is caught. However, the job fails in the subsequent work flow.

You fix the error and run the job in recovery mode. During the recovery execution, the first

work flow no longer generates the exception. Thus the value of variable $I is different, and the

job selects a different subsequent work flow, producing different results.



To ensure proper results with automatic recovery when a job contains a try/catch block, do

not use values set inside the try/catch block or reference output variables from a try/catch

block in any subsequent steps.

To create an alternative work flow 1. Create a job.

2. Add a global variable to your job called $G_recovery_needed with a datatype of int.

The purpose of this global variable is to store a flag that indicates whether or not recovery

is needed. This flag is based on the value in a recovery status table, which contains a flag of

1 or 0, depending on whether recovery is needed. 3. In the job workspace, add a work flow using the tool palette.

4. In the work flow workspace, add a script called GetStatus using the tool palette.

5. In the script workspace, construct an expression to update the value of the

$G_recovery_needed global variable to the same value as is in the recovery status table.

The script content depends on the RDBMS on which the status table resides. The following

is an example of the expression:

$G_recovery_needed = sql('DEMO_Target', 'select recovery_flag from

recovery_status');

6. Return to the work flow workspace.

7. Add a conditional to the workspace using the tool palette and connect it to the script.

8. Open the conditional.

The transform editor for the conditional allows you to specify the IF expression and

Then/Else branches.



9. In the IF field, enter the expression that evaluates whether recovery is required.

The following is an example of the expression:

$G_recovery_needed = 0

This means the objects in the Then pane will run if recovery is not required. If recovery is

needed, the objects in the Else pane will run.

10. Add a try object to the Then pane of the transform editor using the tool palette.

11. In the Local Object Library, click and drag a work flow or data flow to the Then pane after

the try object.

12. Add a catch object to the Then pane after the work flow or data flow using the tool palette.

13. Connect the objects in the Then pane.

14. Open the workspace for the catch object.

All exception types are lists in the Available exceptions pane.



15. To change which exceptions act as triggers, expand the tree in the Available exceptions

pane, select the appropriate exceptions, and click Set to move them to the Trigger on these

exceptions pane.

By default, Data Services catches all exceptions. 16. Add a script called Fail to the lower pane using the Tool.

This object will be executed if there are any exceptions. If desired, you can add a data flow

here instead of a script.

17. In the script workspace, construct an expression update the flag in the recovery status table

to 1, indicating that recovery is needed.



sql('DEMO_Target','update recovery_status set recovery_flag = 1');

18. Return to the conditional workspace.

19. Connect the objects in the Then pane.

20. In the Local Object Library, click and drag the work flow or data flow that represents the

recovery process to the Else pane.

This combination means that if recovery is not needed, then the first object will be executed;

if recovery is required, the second object will be executed.

21. Add a script called Pass to the lower pane using the tool palette.

22. In the script workspace, construct an expression to update the flag in the recovery status

table to 0, indicating that recovery is not needed.





sql('DEMO_Target','update recovery_status set recovery_flag = 0');

23. Return to the conditional workspace.

24. Connect the objects in the Else pane.

25. Validate and save all objects.

26. Execute the job.

The first time this job is executed, the job succeeds because the recovery_flag value in the

status table is set to 0 and the target table is empty, so there is no primary key constraint.

27. Execute the job again.

The second time this job is executed, the job fails because the target table already contains

records, so there is a primary key exception.

28. Check the contents of the status table.

The recovery_flag field now contains a value of 1.

29. Execute the job again.

The third time this job is executed, the version of the data flow with the Auto correct load

option selected runs because the recovery_flag value in the status table is set to 1. The job

succeeds because the auto correct load feature checks for existing values before trying to

insert new rows.

30. Check the contents of the status table again.

The recovery_flag field contains a value of 0.

Activity: Creating an alternative work flow With the influx of new employees resulting from Alpha's acquisition of new companies, the

Employee Department information needs to be updated regularly. Because this information is

used for payroll, it is critical that no records are lost if a job is interrupted, so you need to set

up the job in such a way that exceptions will always be managed. This involves setting up a

conditional that will try to run a less resource-intensive update of the table first; if that generates

an exception, the conditional then tries a version of the same data flow that is configured to

auto correct the load.

Objective • Set up a try/catch block with a conditional to catch exceptions.

Instructions 1. Delete all of the data from the Emp_Dept table in the HR_datamart datastore.

a. From the Start menu, click Programs ➤ MySQL ➤ MySQL Server 5.0 ➤ MySQL

Command Line Client .

b. Enter a password of root and press Enter.

c. At the mysql prompt, enter delete from hr_datamart.emp_dept; and press Enter.

The system confirms that 40 rows were deleted.



2. In the Local Object Library, replicate Alpha_Employees_Dept_DF and rename the new

version Alpha_Employees_Dept_AC_DF.

3. In the target table editor for the Emp_Dept table in Alpha_Employees_Dept_DF, ensure

that the Delete data from table before loading and Auto correct load options are not

selected.

4. In the target table editor for the Emp_Dept table in Alpha_Employees_Dept_AC_DF, ensure

that the Delete data from table before loading option is not selected.

5. Select the Auto correct load option.

6. In the Omega project, create a new batch job called Alpha_Employees_Dept_Recovery_Job.

7. Add a global variable called $G_Recovery_Needed with a datatype of int to your job.

8. Add a work flow to your job called Alpha_Employees_Dept_Recovery_WF.

9. In the work flow workspace, add a script called GetStatus and construct an expression to

update the value of the $G_Recovery_Needed global variable to the same value as in the

recovery_flag column in the recovery_status table in the HR datamart.


$G_Recovery_Needed = sql('hr_datamart', 'select recovery_flag from

recovery_status');

10. In the work flow workspace, add a conditional called Alpha_Employees_Dept_Con and

connect it to the script.

11. In the editor for the conditional, enter an IF expression that states that recovery is not

required.


$G_Recovery_Needed = 0

12. In the Then pane, create a new try object called Alpha_Employees_Dept_Try.

13. Add Alpha_Employees_Dept_DF and connect it to the try object.

14. Create a new catch object called Alpha_Employees_Dept_Catch, and connect it to

Alpha_Employees_Dept_DF.

15. In the editor for the catch object, add a script called Recovery_Fail to the lower pane and

construct an expression to update the flag in the recovery status table to 1, indicating that

recovery is needed.


sql('hr_datamart','update recovery_status set recovery_flag = 1');

16. In the conditional workspace, add Alpha_Employees_Dept_AC_DF to the Else pane.

17. Add a script called Recovery_Pass to the Else pane next to Alpha_Employees_Dept_AC_DF

and connect the objects.



18. In the script, construct an expression to update the flag in the recovery status table to 0,

indicating that recovery is not needed.


sql('hr_datamart','update recovery_status set recovery_flag = 0');

19. Execute Alpha_Employees_Dept_Recovery_Job for the first time with the default execution

properties and save all objects you have created.

In the log, note that Alpha_Employees_Dept_DF executed.

20. Execute Alpha_Employees_Dept_Recovery_Job again.

In the log, note that the job fails.

21. Execute Alpha_Employees_Dept_Recovery_Job for a third time.

In the log, note that, this time, Alpha_Employees_Dept_AC_DF executed.

A solution file called SOLUTION_Recovery.atl is included in your resource CD. To check the





Quiz: Setting up error handling 1. List the different strategies you can use to avoid duplicate rows of data when re-loading a

job.

2. True or false? You can only run a job in recovery mode after the initial run of the job has

been set to run with automatic recovery enabled.

3. What are the two scripts in a manual recovery work flow used for? 4. Which of the following types of exception can you NOT catch using a try/catch block?

a. Database access errors

b. Syntax errors

c. System exception errors

d. Execution errors

e. File access errors

Setting up Error Handling-Learner's Guide 221


Lesson summary

After completing this lesson, you are now able to:

• Set up recoverable work flows

Capturing Changes in Data—Learner’s Guide 223


Lesson 8

Capturing Changes in Data

Lesson introduction The design of your data warehouse must take into account how you are going to handle changes

in your target system when the respective data in your source system changes. Data Integrator

transforms provide you with a mechanism to do this.


• Update data over time

• Use source-based CDC

• Use target-based CDC



Updating data over time

Introduction Data Integrator transforms provide support for updating changing data in your data warehouse.


• Describe the options for updating changes to data

• Explain the purpose of Changed Data Capture (CDC)

• Explain the role of surrogate keys in managing changes to data

• Define the differences between source-based and target-based CDC

Explaining Slowly Changing Dimensions (SCD) SCDs are dimensions that have data that changes over time. The following methods of handling

SCDs are available:

Type Description

Type 1

No history preservation

Type 2

Unlimited history preservation and new rows

Natural consequence of normalization.

• New rows generated for significant

changes.

• Requires use of a unique key. The key

relates to facts/time.

• Optional Effective_Date field.

Type 3

Limited history preservation

• Two states of data are preserved: current

and old.

• New fields are generated to store history

data.

• Requires an Effective_Date field.

Because SCD Type 2 resolves most of the issues related to slowly changing dimensions, it is

explored last.

SCD Type 1 For an SCD Type 1 change, you find and update the appropriate attributes on a specific

dimensional record. For example, to update a record in the SALES_PERSON_DIMENSION

table to show a change to an individual’s SALES_PERSON_NAME field, you simply update

one record in the SALES_PERSON_DIMENSION table. This action would update or correct

that record for all fact records across time. In a dimensional model, facts have no meaning until



you link them with their dimensions. If you change a dimensional attribute without

appropriately accounting for the time dimension, the change becomes global across all fact

records.

This is the data before the change:

SALES_PERSON_KEY SALES_PERSON_ID NAME SALES_TEAM

15 000120 Doe, John B Northwest

This is the same table after the salesperson’s name has been changed:


15 000120 Smith, John B Northwest

However, suppose a salesperson transfers to a new sales team. Updating the salesperson’s

dimensional record would update all previous facts so that the salesperson would appear to

have always belonged to the new sales team. This may cause issues in terms of reporting sales

numbers for both teams. If you want to preserve an accurate history of who was on which sales

team, Type 1 is not appropriate.

SCD Type 3 To implement a Type 3 change, you change the dimension structure so that it renames the

existing attribute and adds two attributes, one to record the new value and one to record the

date of the change.

A Type 3 implementation has three disadvantages:

• You can preserve only one change per attribute, such as old and new or first and last.

• Each Type 3 change requires a minimum of one additional field per attribute and another

additional field if you want to record the date of the change.

• Although the dimension’s structure contains all the data needed, the SQL code required to

extract the information can be complex. Extracting a specific value is not difficult, but if you

want to obtain a value for a specific point in time or multiple attributes with separate old

and new values, the SQL statements become long and have multiple conditions.

In summary, SCD Type 3 can store a change in data, but can neither accommodate multiple

changes, nor adequately serve the need for summary reporting.




This is the same table after the new dimensions have been added and the salesperson’s sales

team has been changed:



SALES_PERSON_

NAME

OLD_TEAM

NEW_TEAM

EFF_TO_DATE

SALES_

PERSON_ID

Doe, John B

SCD Type 2

Northwest Northeast Oct_31_2004 00120

With a Type 2 change, you do not need to make structural changes to the

SALES_PERSON_DIMENSION table. Instead, you add a record.




After you implement the Type 2 change, two records appear, as in the following table:



133 000120 Doe, John B Southeast

Updating changes to data When you have a large amount of data to update regularly and a small amount of system down

time for scheduled maintenance on a data warehouse, you must choose the most appropriate

method for updating your data over time, also known as “delta load”. You can choose to do a

full refresh of your data or you can choose to extract only new or modified data and update

the target system:

• Full refresh: Full refresh is easy to implement and easy to manage. This method ensures

that no data is overlooked or left out due to technical or programming errors. For an

environment with a manageable amount of source data, full refresh is an easy method you

can use to perform a delta load to a target system.

• Capturing only changes: After an initial load is complete, you can choose to extract only

new or modified data and update the target system. Identifying and loading only changed

data is called Changed Data Capture (CDC). CDC is recommended for large tables. If the

tables that you are working with are small, you may want to consider reloading the entire

table instead. The benefit of using CDC instead of doing a full refresh is that it:

○ Improves performance because the job takes less time to process with less data to extract,

transform, and load.

○ Change history can be tracked by the target system so that data can be correctly analyzed

over time. For example, if a sales person is assigned a new sales region, simply updating

the customer record to reflect the new region negatively affects any analysis by region



over time because the purchases made by that customer before the move are attributed

to the new region.

Explaining history preservation and surrogate keys History preservation allows the data warehouse or data mart to maintain the history of data

in dimension tables so you can analyze it over time.

For example, if a customer moves from one sales region to another, simply updating the

customer record to reflect the new region would give you misleading results in an analysis by

region over time, because all purchases made by the customer before the move would incorrectly

be attributed to the new region.

The solution to this involves introducing a new record for the same customer that reflects the

new sales region so that you can preserve the previous record. In this way, accurate reporting

is available for both sales regions. To support this, Data Services is set up to treat all changes

to records as INSERT rows by default.

However, you also need to manage the primary key constraint issues in your target tables that

arise when you have more than one record in your dimension tables for a single entity, such

as a customer or an employee.

For example, with your sales records, the Sales Rep ID is usually the primary key and is used

to link that record to all of the rep's sales orders. If you try to add a new record with the same

primary key, it will throw an exception. On the other hand, if you assign a new Sales Rep ID

to the new record for that rep, you will compromise your ability to report accurately on the

rep's’s total sales.

To address this issue, you will create a surrogate key, which is a new column in the target table

that becomes the new primary key for the records. At the same time, you will change the

properties of the former primary key so that it is simply a data column.

When a new record is inserted for the same rep, a unique surrogate key is assigned allowing

you to continue to use the Sales Rep ID to maintain the link to the rep’s orders.



You can create surrogate keys either by using the gen_row_num or key_generation functions

in the Query transform to create a new output column that automatically increments whenever

a new record is inserted, or by using the Key Generation transform, which serves the same

purpose.

Comparing source-based and target-based CDC Setting up a full CDC solution within Data Services may not be required. Many databases now

have CDC support built into them, such as Oracle, SQL Server, and DB2. Alternatively, you

could combine surrogate keys with the Map Operation transform to change all UPDATE row

types to INSERT row types to capture changes.

However, if you do want to set up a full CDC solution, there are two general incremental CDC

methods to choose from: source-based and target-based CDC.

Source-based CDC evaluates the source tables to determine what has changed and only extracts

changed rows to load into the target tables.

Target-based CDC extracts all the data from the source, compares the source and target rows

using table comparison, and then loads only the changed rows into the target.

Source-based CDC is almost always preferable to target-based CDC for performance reasons.

However, some source systems do not provide enough information to make use of the

source-based CDC techniques. You will usually use a combination of the two techniques.



Using source-based CDC

Introduction Source-based CDC is the preferred method because it improves performance by extracting the

fewest rows.


• Define the methods of performing source-based CDC

• Explain how to use timestamps in source-based CDC

• Manage issues related to using timestamps for source-based CDC

Using source tables to identify changed data Source-based CDC, sometimes also referred to as incremental extraction, extracts only the

changed rows from the source. To use source-based CDC, your source data must have some

indication of the change. There are two methods:

• Timestamps: You can use the timestamps in your source data to determine what rows have

been added or changed since the last time data was extracted from the source. To support

this type of source-based CDC, your database tables must have at least an update timestamp;

it is preferable to have a create timestamp as well.

• Change logs: You can also use the information captured by the RDBMS in the log files for

the audit trail to determine what data is has been changed.

Log-based data is more complex and is outside the scope of this course. For more information

on using logs for CDC, see “Techniques for Capturing Data”, in the Data Services Designer Guide.

Using CDC with timestamps Timestamp-based CDC is an ideal solution to track changes if:

• There are date and time fields in the tables being updated.

• You are updating a large table that has a small percentage of changes between extracts and

an index on the date and time fields.

• You are not concerned about capturing intermediate results of each transaction between

extracts (for example, if a customer changes regions twice in the same day).

It is not recommended that you use timestamp-based CDC if:

• You have a large table with a large percentage of it changes between extracts and there is

no index on the timestamps.

• You need to capture physical row deletes.

• You need to capture multiple events occurring on the same row between extracts.

Some systems have timestamps with dates and times, some with just the dates, and some with

monotonically-generated increasing numbers. You can treat dates and generated numbers in

the same manner. It is important to note that for timestamps based on real time, time zones



can become important. If you keep track of timestamps using the nomenclature of the source

system (that is, using the source time or source-generated number), you can treat both temporal

(specific time) and logical (time relative to another time or event) timestamps in the same way.

The basic technique for using timestamps is to add a column to your source and target tables

that tracks the timestamps of rows loaded in a job. When the job executes, this column is updated

along with the rest of the data. The next job then reads the latest timestamp from the target

table and selects only the rows in the source table for which the timestamp is later.

This example illustrates the technique. Assume that the last load occurred at 2:00 PM on January

1, 2008. At that time, the source table had only one row (key=1) with a timestamp earlier than

the previous load. Data Services loads this row into the target table with the original timestamp

of 1:10 PM on January 1, 2008. After 2:00 PM, Data Services adds more rows to the source table:

At 3:00 PM on January 1, 2008, the job runs again. The job:

1. Reads the Last_Update field from the target table (01/01/2008 01:10 PM).

2. Selects rows from the source table that have timestamps that are later than the value of

Last_Update. The SQL command to select these rows is:

SELECT * FROM Source WHERE Last_Update > '01/01/2007 01:10 pm'

This operation returns the second and third rows (key=2 and key=3). 3. Loads these new rows into the target table.



For timestamped CDC, you must create a work flow that contains the following:

• A script that reads the target table and sets the value of a global variable to the latest

timestamp.

• A data flow that uses the global variable in a WHERE clause to filter the data.

The data flow contains a source table, a query, and a target table. The query extracts only those

rows that have timestamps later than the last update.

To set up a timestamp-based CDC delta job 1. In the Variables and Parameters dialog box, add a global variable called $G_Last_Update

with a datatype of datetime to your job.



The purpose of this global variable is to store a string conversion of the timestamp for the

last time the job executed.

2. In the job workspace, add a script called GetTimestamp using the tool palette.

3. In the script workspace, construct an expression to do the following:

• Select the last time the job was executed from the last update column in the table.

• Assign the actual timestamp value to the $G_Last_Update global variable.



$G_Last_Update = sql('DEMO_Target','select max(last_update) from employee_dim');

4. Return to the job workspace.

5. Add a data flow to the right of the script using the tool palette.

6. In the data flow workspace, add the source, Query transform, and target objects and connect

them.

The target table for CDC cannot be a template table. 7. In the Query transform, add the columns from the input schema to the output schema as

required.

8. If required, in the output schema, right-click the primary key (if it is not already set to the

surrogate key) and clear the Primary Key option in the menu.

9. Right-click the surrogate key column and select the Primary Key option in the menu.

10. On the Mapping tab for the surrogate key column, construct an expression to use the

key_generation function to generate new keys based on that column in the target table,

incrementing by 1.



key_generation('DEMO_Target.demo_target.employee_dim', 'Emp_Surr_Key', 1)

11. On the WHERE tab, construct an expression to select only those records with a timestamp

that is later than the $G_Last_Update global variable.

The following is an example of the expression:

employee_dim.last_update > $G_Last_Update

12. Connect the GetTimestamp script to the data flow.

13. Validate and save all objects.

14. Execute the job.



Managing overlaps Unless source data is rigorously isolated during the extraction process (which typically is not

practical), there is a window of time when changes can be lost between two extraction runs.

This overlap period affects source-based CDC because this kind of data capture relies on a

static timestamp to determine changed data.

For example, suppose a table has 10,000 rows. If a change is made to one of the rows after it

was loaded but before the job ends, the second update can be lost.

There are three techniques for handling this situation:

• Overlap avoidance

• Overlap reconciliation

• Presampling

For more information see “Source-based and target-based CDC” in “Techniques for Capturing

Changed Data” in the Data Services Designer Guide.

Overlap avoidance In some cases, it is possible to set up a system where there is no possibility of an overlap. You

can avoid overlaps if there is a processing interval where no updates are occurring on the target

system.

For example, if you can guarantee the data extraction from the source system does not last

more than one hour, you can run a job at 1:00 AM every night that selects only the data updated

the previous day until midnight. While this regular job does not give you up-to-the-minute

updates, it guarantees that you never have an overlap and greatly simplifies timestamp

management.

Overlap reconciliation Overlap reconciliation requires a special extraction process that re-applies changes that could

have occurred during the overlap period. This extraction can be executed separately from the

regular extraction. For example, if the highest timestamp loaded from the previous job was

01/01/2008 10:30 PM and the overlap period is one hour, overlap reconciliation re-applies the

data updated between 9:30 PM and 10:30 PM on January 1, 2008.

The overlap period is usually equal to the maximum possible extraction time. If it can take up

to N hours to extract the data from the source system, an overlap period of N (or N plus a small

increment) hours is recommended. For example, if it takes at most two hours to run the job,

an overlap period of at least two hours is recommended.

Presampling Presampling is an extension of the basic timestamp processing technique. The main difference

is that the status table contains both a start and an end timestamp, instead of the last update

timestamp. The start timestamp for presampling is the same as the end timestamp of the

previous job. The end timestamp for presampling is established at the beginning of the job. It

is the most recent timestamp from the source table, commonly set as the system date.



Activity: Using source-based CDC You need to set up a job to update employee records in the Omega data warehouse whenever

they change. The employee records include timestamps to indicate when they were last updated,

so you can use source-based CDC.

Objective • Use timestamps to enable changed data capture for employee records.

Instructions 1. In the Omega project, create a new batch job called Alpha_Employees_Dim_Job.

2. Add a global variable called $G_LastUpdate with a datatype of datetime to your job.

3. In the job workspace, add a script called GetTimestamp and construct an expression to do

the following:

• Select the last time the job was executed from the last update column in the employee

dimension table.

• If the last update column is NULL, assign a value of January 1, 1901 to the $G_LastUpdate

global variable. When the job executes for the first time for the initial load, this ensures

that all records are processed.

• If the last update column is not NULL, assign the actual timestamp value to the

$G_LastUpdate global variable.


$G_LastUpdate = sql('omega','select max(LAST_UPDATE) from omega.emp_dim');

if ($G_LastUpdate us null) $G_LastUpdate = to_date('1901.01.01','YYYY.MM.DD');

else print('Last update was ' || $G_LastUpdate);

4. In the job workspace, add a data flow called Alpha_Employees_Dim_DF and connect it to the

script.

5. Add the Employee table from the Alpha datastore as the source object and the Emp_Dim

table from the Omega datastore as the target object.

6. Add the Query transform and connect the objects.

7. In the transform editor for the Query transform, map the columns as follows:


EMPLOYEEID EMPLOYEEID

LASTNAME LASTNAME

FIRSTNAME FIRSTNAME

BIRTHDATE BIRTHDATE




HIREDATE HIREDATE

ADDRESS ADDRESS

PHONE PHONE

EMAIL EMAIL

REPORTSTO REPORTSTO

LastUpdate LAST_UPDATE

discharge_date DISCHARGE_DATE 8. Create a mapping expression for the SURR_KEY column that generates new keys based on

the Emp_Dim target table, incrementing by 1.


key_generation('Omega.omega.emp_dim', 'SURR_KEY', 1)

9. Create a mapping expression for the CITY column to look up the city name from the City

table in the Alpha datastore based on the city ID.


lookup_ext([Alpha.alpha.city,'PRE_LOAD_CACHE','MAX'],

[CITYNAME],[NULL],[CITYID,'=',employee.CITYID]) SET


10. Create a mapping expression for the REGION column to look up the region name from the

Region table in the Alpha datastore based on the region ID.



[REGIONNAME],[NULL],[REGIONID,'=',employee.REGIONID]) SET


11. Create a mapping expression for the COUNTRY column to look up the country name from

the Country table in the Alpha datastore based on the country ID.



[COUNTRYNAME],[NULL],[COUNTRYID,'=',employee.COUNTRYID]) SET




12. Create a mapping expression for the DEPARTMENT column to look up the department

name from the Department table in the Alpha datastore based on the department ID.


lookup_ext([Alpha.alpha.department,'PRE_LOAD_CACHE','MAX'],

[DEPARTMENTNAME],[NULL],[DEPARTMENTID,'=',employee.DEPARTMENTID]) SET


13. On the WHERE tab, construct an expression to select only those records with a timestamp

that is later than the $G_LASTUPDATE global variable.


employee.LastUpdate > $G_LASTUPDATE

14. Execute Alpha_Employees_Dim_Job with the default execution properties and save all


According to the log, the last update for the table was on 2007.11.07.

15. Return to the data flow workspace and view data for the target table. Sort the records by

the LAST_UPDATE column.

A solution file called SOLUTION_SourceCDC.atl is included in your resource CD. To check the





Using target-based CDC

Introduction Target-based CDC compares the source to the target to determine which records have changed.


• Define the Data Integrator transforms involved in target-based CDC

Using target tables to identify changed data Source-based CDC evaluates the source tables to determine what has changed and only extracts

changed rows to load into the target tables. Target-based CDC, by contrast, extracts all the data

from the source, compares the source and target rows, and then loads only the changed rows

into the target with new surrogate keys.

Source-based changed-data capture is almost always preferable to target-based capture for

performance reasons; however, some source systems do not provide enough information to

make use of the source-based CDC techniques. Target-based CDC allows you to use the

technique when source-based change information is limited.

You can preserve history by creating a data flow that contains the following:

• A source table contains the rows to be evaluated.

• A Query transform maps columns from the source.

• A Table Comparison transform compares the data in the source table with the data in the

target table to determine what has changed. It generates a list of INSERT and UPDATE rows

based on those changes. This circumvents the default behavior in Data Services of treating

all changes as INSERT rows.

• A History Preserving transform converts certain UPDATE rows to INSERT rows based on

the columns in which values have changed. This produces a second row in the target instead

of overwriting the first row.

• A Key Generation transform generates new keys for the updated rows that are now flagged

as INSERT.

• A target table receives the rows. The target table cannot be a template table.



Identifying history preserving transforms Data Services supports history preservation with three Data Integrator transforms:


History Preserving

Key Generation

Converts rows flagged as

UPDATE to UPDATE plus

INSERT, so that the original

values are preserved in the

target. You specify the column

in which to look for updated

data.

Generates new keys for source

data, starting from a value

based on existing keys in the

table you specify.

Compares two data sets and

produces the difference

Table Comparison between them as a data set

with rows flagged as INSERT

and UPDATE.

Explaining the Table Comparison transform The Table Comparison transform allows you to detect and forward changes that have occurred

since the last time a target was updated. This transform compares two data sets and produces

the difference between them as a data set with rows flagged as INSERT or UPDATE.



For example, the transform compares the input and comparison tables and determines that

row 10 has a new address, row 40 has a name change, and row 50 is a new record. The output

includes all three records, flagged as appropriate:

The next section gives a brief description of the function, data input requirements, options, and

data output results for the Table Comparison transform. For more information on the Pivot


Input/output The transform compares two data sets, one from the input to the transform (input data set),

and one from a database table specified in the transform (the comparison table). The transform

selects rows from the comparison table based on the primary key values from the input data

set. The transform compares columns that exist in the schemas for both inputs.

The input data set must be flagged as NORMAL.

The output data set contains only the rows that make up the difference between the tables. The

schema of the output data set is the same as the schema of the comparison table. No DELETE

operations are produced.

If a column has a date datatype in one table and a datetime datatype in the other, the transform

compares only the date section of the data. The columns can also be time and datetime datatypes,

in which case Data Integrator only compares the time section of the data.

For each row in the input data set, there are three possible outcomes from the transform:



• An INSERT column is added: The primary key value from the input data set does not match

a value in the comparison table. The transform produces an INSERT row with the values

from the input data set row.

If there are columns in the comparison table that are not present in the input data set, the

transform adds these columns to the output schema and fills them with NULL values.

• An UPDATE row is added: The primary key value from the input data set matches a value

in the comparison table, and values in the non-key compare columns differ in the

corresponding rows from the input data set and the comparison table.

The transform produces an UPDATE row with the values from the input data set row.

If there are columns in the comparison table that are not present in the input data set, the

transform adds these columns to the output schema and fills them with values from the

comparison table.

• The row is ignored: The primary key value from the input data set matches a value in the

comparison table, but the comparison does not indicate any changes to the row values.

Options The Table transform offers several options:

Option Description

Table name

Specifies the fully qualified name of the source

table from which the maximum existing key

is determined (key source table). This table

must already be imported into the repository.

Table name is represented as

datastore.owner.table where datastore is the

name of the datastore Data Services uses to

access the key source table and owner depends

on the database type associated with the table.

Specifies a column in the comparison table.

When there is more than one row in the

Generated key column comparison table with a given primary key

value, this transform compares the row with

the largest generated key value of these rows

and ignores the other rows. This is optional.

Input contains duplicate keys Provides support for input rows with duplicate

primary key values.

Detect deleted row(s) from comparison table Flags the transform to identify rows that have

been deleted from the source.



Option Description

Comparison method

Input primary key column(s)

Allows you to select the method for accessing

the comparison table. You can select from

Row-by-row select, Cached comparison table,

and Sorted input.

Specifies the columns in the input data set that

uniquely identify each row. These columns

must be present in the comparison table with

the same column names and datatypes.

Improves performance by comparing only the

sub-set of columns you drag into this box from

Compare columns the input schema. If no columns are listed, all

columns in the input data set that are also in

the comparison table are used as compare

columns. This is optional.

Explaining the History Preserving transform The History Preserving transform ignores everything but rows flagged as UPDATE. For these

rows, it compares the values of specified columns and, if the values have changed, flags the

row as INSERT. This produces a second row in the target instead of overwriting the first row.

For example, a target table that contains employee information is updated periodically from a

source table. In this case, the Table Comparison transform has flagged the name change for

row 40 as an update. However, the History Preserving transform is set up to preserve history

on the LastName column, so the output changes the operation code for that record from

UPDATE to INSERT.




data output results for the History Preserving transform. For more information on the History

Preserving transform see “Transforms” Chapter 5 in the Data Services Reference Guide.

Input/output The input data set is the result of a comparison between two versions of the same data in which

rows with changed data from the newer version are flagged as UPDATE rows and new data

from the newer version are flagged as INSERT rows.

The output data set contains rows flagged as INSERT or UPDATE.

Options The History Preserving transform offers these options:

Option Description

Valid from

Valid to

Specifies a date or datetime column from the

source schema. Specify a Valid from date

column if the target uses an effective date to

track changes in data.

Specifies a date value in the following format:

YYYY.MM.DD. The Valid to date cannot be

the same as the Valid from date.



Option Description

Specifies a column from the source schema

that identifies the current valid row from a set

Column of rows with the same primary key. The flag

column indicates whether a row is the most

current data in the target for a given primary

key.

Defines an expression that outputs a value

with the same datatype as the value in the Set

Set value flag column. This value is used to update the

current flag column in the new row in the

target added to preserve history of an existing

row.

Defines an expression that outputs a value

with the same datatype as the value in the

Reset value Reset flag column. This value is used to update

the current flag column in an existing row in

the target that included changes in one or more

of the compare columns.

Preserve delete row(s) as update row(s)

Compare columns

Converts DELETE rows to UPDATE rows in

the target. If you previously set effective date

values (Valid from and Valid to), sets the

Valid to value to the execution date. This

option is used to maintain slowly changing

dimensions by feeding a complete data set first

through the Table Comparison transform with

its Detect deleted row(s) from comparison

table option selected.

Lists the column or columns in the input data

set that are to be compared for changes.

• If the values in the specified compare

columns in each version match, the

transform flags the row as UPDATE. The

row from the before version is updated.

The date and flag information is also

updated.

• If the values in each version do not match,

the row from the latest version is flagged

as INSERT when output from the

transform. This adds a new row to the



Option Description

warehouse with the values from the new

row.

Updates to non-history preserving columns

update all versions of the row if the update is

performed on the natural key (for example,

Customer), but only update the latest version

if the update is on the generated key (for

example, GKey).

Explaining the Key Generation transform The Key Generation transform generates new keys before inserting the data set into the target

in the same way as the key_generation function does. When it is necessary to generate artificial

keys in a table, this transform looks up the maximum existing key value from a table and uses

it as the starting value to generate new keys. The transform expects the generated key column

to be part of the input schema.

For example, suppose the History Preserving transform produces rows to add to a warehouse,

and these rows have the same primary key as rows that already exist in the warehouse. In this

case, you can add a generated key to the warehouse table to distinguish these two rows that

have the same primary key.


data output results for the Key Generation transform. For more information on the Key

Generation transform see “Transforms” Chapter 5 in the Data Services Reference Guide.

Input/output The input data set is the result of a comparison between two versions of the same data in which

changed data from the newer version are flagged as UPDATE rows and new data from the

newer version are flagged as INSERT rows.

The output data set is a duplicate of the input data set, with the addition of key values in the

generated key column for input rows flagged as INSERT.

Options The Key Generation transform offers these options:

Option Description

Specifies the fully qualified name of the source

table from which the maximum existing key

Table name is determined (key source table). This table

must be already imported into the repository.

Table name is represented as

datastore.owner.table where datastore is the



Option Description

name of the datastore Data Services uses to

access the key source table and owner depends

on the database type associated with the table.

Specifies the column in the key source table

containing the existing keys values. A column

Generated key column with the same name must exist in the input

data set; the new keys are inserted in this

column.

Increment values Indicates the interval between generated key

values.

Activity: Using target-based CDC You need to set up a job to update product records in the Omega data warehouse whenever

they change. The product records do not include timestamps to indicate when they were last

updated, so you must use target-based CDC to extract all records from the source and compare

them to the target.

Objective • Use target-based CDC to preserve history for the Product dimension.

Instructions 1. In the Omega project, create a new batch job called Alpha_Product_Dim_Job with a data

flow called Alpha_Product_Dim_DF.

2. Add the Product table from the Alpha datastore as the source object and the Prod_Dim table

from the Omega datastore as the target object.

3. Add the Query, Table Comparison, History Preserving, and Key Generation transforms.

4. Connect the source table to the Query transform and the Query transform to the target table

to set up the schema prior to configuring the rest of the transforms.

5. In the transform editor for the Query transform, map the columns as follows:


PRODUCTID PRODUCTID

PRODUCTNAME PRODUCTNAME

CATEGORYID CATEGORYID




COST COST

6. Until the key can be generated, specify a mapping expression for the SURR_KEY column

to populate it with NULL.

7. Specify a mapping expression for the EFFECTIVE_DATE column to indicate the current

date as sysdate( ).

8. Delete the link from the Query transform to the target table.

9. Connect the transforms in the following order: Query, Table Comparison, History Preserving,

and Key Generation.

10. Connect the Key Generation transform to the target table.

11. In the transform editor for the Table Comparison transform, use the Prod_Dim table in the

Omega datastore as the comparison table and set Surr_Key as the generated key column.

12. Set the input primary key column to PRODUCTID, and compare the PRODUCTNAME,

CATEGORYID, and COST columns.

13. Do not configure the History Preserving transform.

14. In the transform editor for the Key Generation transform, set up key generation based on

the Surr_Key column of the Prod_Dim table in the Omega datastore, incrementing by 1.

15. In the workspace, before executing the job, display the data in both the source and target

tables.

Note that the OmegaSoft product has been added in the source, but has not yet been updated

in the target.

16. Execute Alpha_Product_Dim_Job with the default execution properties and save all objects

you have created.


Note that the new records were added for product IDs 2, 3, 6, 8, and 13, and that OmegaSoft

has been added to the target.

A solution file called SOLUTION_TargetCDC.atl is included in your resource CD. To check the





Quiz: Capturing changes in data 1. What are the two most important reasons for using CDC?

2. Which method of CDC is preferred for the performance gain of extracting the fewest rows?

3. What is the difference between an initial load and a delta load?

4. What transforms do you typically use for target-based CDC?




• Update data over time

• Use source-based CDC

• Use target-based CDC

Using Data Integrator Transforms—Learner’s Guide 249


Lesson 9

Using Data Integrator Transforms

Lesson introduction Data Integrator transforms are used to enhance your data integration projects beyond the core

functionality of the platform transforms.


• Describe the Data Integrator transforms

• Use the Pivot transform

• Use the Hierarchy Flattening transform

• Describe performance optimization

• Use the Data Transfer transform

• Use the XML Pipeline transform



Describing Data Integrator transforms

Introduction Data Integrator transforms perform key operations on data sets to manipulate their structure

as they are passed from source to target.


• Describe Data Integrator transforms available in Data Services

Defining Data Integrator transforms The following transforms are available in the Data Integrator branch of the Transforms tab in

the Local Object Library:


Data Transfer

Allows a data flow to split its processing into two sub-data

flows and push down resource-consuming operations to

the database server.

Date Generation Generates a column filled with date values based on the

start and end dates and increment you specify.

Effective Date Generates an additional effective to column based on the

primary key’s effective date.

Hierarchy Flattening

Flattens hierarchical data into relational tables so that it can

participate in a star schema. Hierarchy flattening can be

both vertical and horizontal.

Sorts input data, maps output data, and resolves before and

after versions for UPDATE rows.

Map CDC Operation While commonly used to support Oracle or mainframe

changed data capture, this transform supports any data

stream if its input requirements are met.

Pivot Rotates the values in specified columns to rows.

Reverse Pivot Rotates the values in specified rows to columns.

XML Pipeline Processes large XML inputs in small batches.



Using the Pivot transform

Introduction The Pivot and Reverse Pivot transforms let you convert columns to rows and rows back into

columns.



Explaining the Pivot transform The Pivot transform creates a new row for each value in a column that you identify as a pivot

column.

It allows you to change how the relationship between rows is displayed. For each value in each

pivot column, Data Services produces a row in the output data set. You can create pivot sets

to specify more than one pivot column.

For example, you could produce a list of discounts by quantity for certain payment terms so

that each type of discount is listed as a separate record, rather than each being displayed in a

unique column.

The Reverse Pivot transform reverses the process, converting rows into columns.


data output results for the Pivot transform. For more information on the Pivot transform see




Inputs/Outputs Data inputs include a data set with rows flagged as NORMAL.

Data outputs include a data set with rows flagged as NORMAL. This target includes the

non-pivoted columns, a column for the sequence number, the data field column, and the pivot

header column.

Options The Pivot transform offers several options:

Option Description

Pivot sequence column

Non-pivot columns

Assign a name to the sequence number

column. For each row created from a pivot

column, Data Services increments and stores

a sequence number.

Select the columns in the source that are to

appear in the target without modification.

Pivot set

Identify a number for the pivot set. For each

pivot set, you define a group of pivot columns,

a pivot data field, and a pivot header name.

Data column field

Specify the column that contains the pivoted

data. This column contains all of the Pivot

columns values.

Header column

Pivot columns

Specify the name of the column that contains

the pivoted column names. This column lists

the names of the columns where the

corresponding data originated.

Select the columns to be rotated into rows.

Describe these columns in the Header column.

Describe the data in these columns in the Data

field column.

To pivot a table 1. Open the data flow workspace.


3. On the Transforms tab of the Local Object Library, click and drag the Pivot or Reverse Pivot

transform to the workspace to the right of your source object.



4. Add your target object to the workspace.

5. Connect the source object to the transform.


7. Double-click the Pivot transform to open the transform editor.

8. Click and drag any columns that will not be changed by the transform from the input schema

area to the Non-Pivot Columns area.

9. Click and drag any columns that will be pivoted from the input schema area to the Pivot

Columns area.

If required, you can create more than one pivot set by clicking Add.

10. If desired, change the values in the Pivot sequence column, Data field column, and Header

column fields.

These are the new columns that will be added to the target object by the transform. 11. Click Back to return to the data flow workspace.



Activity: Using the Pivot transform Currently, employee compensation information is loaded into a table with a separate column

each for salary, bonus, and vacation days. For reporting purposes, you need for each of these

items to be a separate record in the HR datamart.

Objective • Use the Pivot transform to create a separate row for each entry in a new employee

compensation table.

Instructions 1. In the Omega project, create a new batch job called Alpha_HR_Comp_Job with a data flow

called Alpha_HR_Comp_DF.

2. Add the HR_Comp_Update table from the Alpha datastore to the workspace as the source

object.

3. Add the Pivot transform and connect it to the source object.

4. Add the Query transform and connect it to the Pivot transform.

5. Create a new template table called Employee_Comp in the Delta datastore as the target object.

6. In the transform editor for the Pivot transform, specify that the EmployeeID and

date_updated fields are non-pivot columns.

7. Specify that the Emp_Salary, Emp_Bonus, and Emp_VacationDays fields are pivot columns.

8. Specify that the data field column is called Comp, and the header column is called Comp_Type.

9. In the transform editor for the Query transform, map all fields from input schema to output

schema.

10. On the WHERE tab, filter out NULL values for the Comp column.


Pivot.Comp is not null

11. Execute Alpha_HR_Comp_Job with the default execution properties and save all objects

you have created.


A solution file called SOLUTION_Pivot.atl is included in your resource CD. To check the





Using the Hierarchy Flattening transform

Introduction The Hierarchy Flattening transform enables you to break down hierarchical table structures

into a single table to speed up data access.



Explaining the Hierarchy Flattening transform The Hierarchy Flattening transform constructs a complete hierarchy from parent/child

relationships, and then produces a description of the hierarchy in horizontally- or

vertically-flattened format.

For horizontally-flattened hierarchies, each row of the output describes a single node in the

hierarchy and the path to that node from the root.

For vertically-flattened hierarchies, each row of the output describes a single relationship

between ancestor and descendent and the number of nodes the relationship includes. There is

a row in the output for each node and all of the descendants of that node. Each node is

considered its own descendent and, therefore, is listed one time as both ancestor and descendent.




data output results for the Hierarchy Flattening transform. For more information on the

Hierarchy Flattening transform see “Transforms” Chapter 5 in the Data Services Reference Guide.

Inputs/Outputs Data input includes rows describing individual parent-child relationships. Each row must

contain two columns that function as the keys of the parent and child in the relationship. The

input can also include columns containing attributes describing the parent and/or child.

The input data set cannot include rows with operations other than NORMAL, but can contain

hierarchical data.

For a listing of the target columns, consult the Data Services Reference Guide.

Options The Hierarchy Flattening transform offers several options:

Option Description

Parent column Identifies the column of the source data that contains the

parent identifier in each parent-child relationship.

Child column Identifies the column in the source data that contains the

child identifier in each parent-child relationship.

Flattening type Indicates how the hierarchical relationships are described

in the output.



Option Description

Use maximum length paths

Indicates whether longest or shortest paths are used to

describe relationships between descendants and ancestors

when the descendent has more than one parent.

Maximum depth Indicates the maximum depth of the hierarchy.

Parent attribute list Identifies a column or columns that are associated with

the parent column.

Child attribute list Identifies a column or columns that are associated with

the child column.

Run as a separate process

Creates a separate sub-data flow process for the

Hierarchy Flattening transform when Data Services

executes the data flow.

Activity: Using the Hierarchy Flattening transform The Employee table in the Alpha datastore contains employee data in a recursive hierarchy.

To determine all reports, direct or indirect, to a given executive or manager would require

complex SQL statements to traverse the hierarchy.

Objective • Flatten the hierarchy to allow more efficient reporting on data.

Instructions 1. In the Omega project, create a new batch job called Alpha_Employees_Report_Job with a

data flow called Alpha_Employees_Report_DF.

2. In the data flow workspace, add the Employee table from the Alpha datastore as the source

object.

3. Create a template table called Manager_Emps in the HR_datamart datastore as the target

object.

4. Add a Hierarchy Flattening transform to the right of the source table and connect the source

table to the transform.

5. In the transform editor for the Hierarchy Flattening transform, select the following options:

Option Value

Flattening Type Vertical

Parent Column REPORTSTO



Option Value

Child Column EMPLOYEEID

Child Attribute List

LASTNAME

FIRSTNAME

BIRTHDATE

HIREDATE

ADDRESS

CITYID

REGIONID

COUNTRYID

PHONE

EMAIL

DEPARTMENTID

LastUpdate

discharge_date 6. Add a Query transform to the left of the Hierarchy Flattening transform and connect the

transforms.

7. In the transform editor of the Query transform, create the following output columns:

Column Datatype

MANAGERID varchar(10)

MANAGER_NAME varchar(50)

EMPLOYEEID varchar(10)

EMPLOYEE_NAME varchar(102)

DEPARTMENT varchar(50)

HIREDATE datetime

LASTUPDATE datetime

PHONE varchar(20)



Column Datatype

EMAIL varchar(50)

ADDRESS varchar(200)

CITY varchar(50)

REGION varchar(50)

COUNTRY varchar(50)

DISCHARGE_DATE datetime

DEPTH int

ROOT_FLAG int

LEAF_FLAG

8. Map the output columns as follows:

int


ANCESTOR MANAGERID

DESCENDENT EMPLOYEEID

DEPTH DEPTH

ROOT_FLAG ROOT_FLAG

LEAF_FLAG LEAF_FLAG

C_ADDRESS ADDRESS

C_discharge_date DISCHARGE_DATE

C_EMAIL EMAIL

C_HIREDATE HIREDATE

C_LastUpdate LASTUPDATE




C_PHONE PHONE

9. Create a mapping expression for the MANAGER_NAME column to look up the manager's

last name from the Employee table in the Alpha datastore based on the employee ID in the

ANCESTOR column of the Hierarchy Flattening transform.


lookup_ext([Alpha.alpha.employee, 'PRE_LOAD_CACHE', 'MAX'], [LASTNAME], [NULL],

[EMPLOYEEID, '=', Hierarchy_Flattening.ANCESTOR]) SET


10. Create a mapping expression for the EMPLOYEE_NAME column to concatenate the

employee's last name and first name, separated by a comma.


Hierarchy_Flattening.C_LASTNAME || ', ' || Hierarchy_Flattening.C_FIRSTNAME

11. Create a mapping expression for the DEPARTMENT column to look up the name of the

employee's department from the Department table in the Alpha datastore based on the

C_DEPARTMENTID column of the Hierarchy Flattening transform.


lookup_ext([Alpha.alpha.department, 'PRE_LOAD_CACHE', 'MAX'], [DEPARTMENTNAME],

[NULL], [DEPARTMENTID, '=', Hierarchy_Flattening.C_DEPARTMENTID]) SET


12. Create a mapping expression for the CITY column to look up the name of the employee's

city from the City table in the Alpha datastore based on the C_CITYID column of the

Hierarchy Flattening transform.


lookup_ext([Alpha.alpha.city, 'PRE_LOAD_CACHE', 'MAX'], [CITYNAME], [NULL],

[CITYID, '=', Hierarchy_Flattening.C_CITYID]) SET


13. Create a mapping expression for the REGION column to look up the name of the employee's

region from the Region table in the Alpha datastore based on the C_REGIONID column of

the Hierarchy Flattening transform.


lookup_ext([Alpha.alpha.region, 'PRE_LOAD_CACHE', 'MAX'], [REGIONNAME], [NULL],

[REGIONID, '=',Hierarchy_Flattening.C_REGIONID]) SET




14. Create a mapping expression for the COUNTRY column to look up the name of the

employee's country from the Country table in the Alpha datastore based on the

C_COUNTRYID column of the Hierarchy Flattening transform.


lookup_ext([Alpha.alpha.country, 'PRE_LOAD_CACHE', 'MAX'], [COUNTRYNAME],

[NULL], [COUNTRYID, '=', Hierarchy_Flattening.C_COUNTRYID]) SET


15. Add a WHERE clause to the Query transform to return only rows where the depth is greater

than zero.


Hierarchy_Flattening.DEPTH > 0

16. Execute Alpha_Employees_Report_Job with the default execution properties and save all



Note that 179 rows were written to the target table.

A solution file called SOLUTION_HierarchyFlattening.atl is included in your resource CD.

To check the solution, import the file and open it to view the data flow design and mapping

logic. Do not execute the solution job, as this may override the results in your target table.



Describing performance optimization

Introduction You can improve the performance of your jobs by pushing down operations to the source or

target database to reduce the number of rows and operations that the engine must retrieve and

process.


• List operations that Data Services pushes down to the database

• View SQL code generated by a data flow

• Explore data caching options

• Explain process slicing

Describing push-down operations Data Services examines the database and its environment when determining which operations

to push down to the database:

• Full push-down operations

The Data Services optimizer always tries to do a full push-down operation. Full push-down

operation s are operations that can be pushed down to the databases and the data streams

directly from the source database to the target database. For example, Data Services sends

SQL INSERT INTO... SELECT statements to the target database and it sends SELECT to

retrieve data from the source.

Data Services can only do full push-down operation s to the source and target databases

when the following conditions are met:

○ All of the operations between the source table and target table can be pushed down

○ The source and target tables are from the same datastore or they are in datastores that

have a database link defined between them.

• Partial push-down operations

When a full push-down operation is not possible , Data Services tries to push down the

SELECT statement to the source database. Operations within the SELECT statement that

can be pushed to the database include:

Operation Description

Aggregations

Aggregate functions, typically used with a

Group by statement, always produce a data

set smaller than or the same size as the

original data set.



Operation Description

Distinct rows Data Services will only output unique rows

when you use distinct rows.

Filtering Filtering can produce a data set smaller than

or equal to the original data set.

Joins Joins typically produce a data set smaller

than or similar in size to the original tables.

Ordering

Projections

Ordering does not affect data set size. Data

Services can efficiently sort data sets that fit

in memory. Since Data Services does not

perform paging (writing out intermediate

results to disk), it is recommended that you

use a dedicated disk-sorting program such

as SyncSort or the DBMS itself to order very

large data sets.

A projection normally produces a smaller

data set because it only returns columns

referenced by a data flow.

Functions

Most Data Services functions that have

equivalents in the underlaying database are

appropriately translated. Operations that cannot be pushed down

Data Services cannot push some transform operations to the database. For example:

• Expressions that include Data Services functions that do not have database correspondents.

• Load operations that contain triggers.

• Transforms other than Query.

• Joins between sources that are on different database servers that do not have database links

defined between them.

Similarly, not all operations can be combined into single requests. For example, when a stored

procedure contains a COMMIT statement or does not return a value, you cannot combine the

stored procedure SQL with the SQL for other operations in a query. You can only push

operations supported by the RDBMS down to that RDBMS.

Note: You cannot push built-in functions or transforms to the source database. For best

performance, do not intersperse built-in transforms among operations that can be pushed down

to the database. Database-specific functions can only be used in situations where they will be

pushed down to the database for execution.



Viewing SQL generated by a data flow Before running a job, you can view the SQL generated by the data flow and adjust your design

to maximize the SQL that is pushed down to improve performance. Alter your design to

improve the data flow when necessary.

Keep in mind that Data Services only shows the SQL generated for table sources. Data Services

does not show the SQL generated for SQL sources that are not table sources, such as the lookup

function, the Key Generation transform, the key_generation function, the Table Comparison

transform, and target tables.

To view SQL 1. In the Data Flows tab of the Local Object Library, right-click the data flow and select Display

Optimized SQL from the menu.

The Optimized SQL dialog box displays.

2. In the left pane, select the datastore for the data flow.

The optimized SQL for the datastore displays in the right pane.

Caching data You can improve the performance of data transformations that occur in memory by caching

as much data as possible. By caching data, you limit the number of times the system must

access the database. Cached data must fit into available memory.

Pageable caching Data Services allows administrators to select a pageable cache location to save content over the

2 GB RAM limit. The pageable cache location is set up in Server Manager and the option to use

pageable cache is selected on the Dataflow Properties dialog box.



Persistent caching Persistent cache datastores can be created through the Create New Datastore dialog box by

selecting Persistent Cache as the database type. The newly-created persistent cache datastore

will appear in the list of datastores, and can be used as a source in jobs.

For more information about advanced caching features, see the Data Services Performance

Optimization Guide.

Slicing processes You can also optimize your jobs through process slicing, which involves splitting data flows

into sub-data flows.

Sub-data flows work on smaller data sets and/or fewer transforms so there is less virtual

memory to consume per process. This way, you can leverage more physical memory per data

flow as each sub-data flow can access 2 GB of memory.

This functionality is available through the Advanced tab for the Query transform. You can run

each memory-intensive operation as a separate process.

For more information on process slicing, see the Data Services Performance Optimization Guide.



Using the Data Transfer transform

Introduction The Data Transfer transform allows a data flow to split its processing into two sub-data flows

and push down resource-consuming operations to the database server.



Explaining the Data Transfer transform The Data Transfer transform moves data from a source or the output from another transform

into a transfer object and subsequently reads data from the transfer object. You can use the

Data Transfer transform to push down resource-intensive database operations that occur

anywhere within the data flow. The transfer type can be a relational database table, persistent

cache table, file, or pipeline.

Use the Data Transfer transform to:

• Push down operations to the database server when the transfer type is a database table. You

can push down resource-consuming operations such as joins, GROUP BY, and sorts.

• Define points in your data flow where you want to split processing into multiple sub-data

flows that each process part of the data. Data Services does not need to process the entire

input data in memory at one time. Instead, the Data Transfer transform splits the processing

among multiple sub-data flows that each use a portion of memory.


data output results for the Data Transfer transform. For more information on the Data Transfer


Inputs/Outputs When the input data set for the Data Transfer transform is a table or file transfer type, the rows

must be flagged with the NORMAL operation code. When input data set is a pipeline transfer

type, the rows can be flagged as any operation code.

The input data set must not contain hierarchical (nested) data.

Output data sets have the same schema and same operation code as the input data sets. In the

push down scenario, the output rows are in the sort or GROUP BY order.

The sub-data flow names use the following format, where n is the number of the data flow: dataflowname_n

The execution of the output depends on the temporary transfer type:

For Table or File temporary transfer types, Data Services automatically splits the data flow into

sub-data flows and executes them serially.



For Pipeline transfer types, Data Services splits the data flow into sub-data flows if you specify

the Run as a separate process option in another operation in the data flow. Data Services

executes these sub-data flows that use pipeline in parallel.

Activity: Using the Data Transfer transform The Data Transfer transform can be used to push data down to a database table so that it can

be processed by the database server rather than the Data Services Job Server. In this activity,

you will join data from two database schemas. When the Data Transfer transform is not used,

the join will occur on the Data Services Job Server. When the Data Transfer transform is added

to the data flow the join can be seen in the SQL Query by displaying the optimized SQL for the

data flow.

Objective • Use the Data Transfer transform to optimize performance.

Instructions 1. In the Omega project, create a new batch job called No_Data_Transfer_Job with a data flow

called No_Data_Transfer_DF.

2. In the Delta datastore, import the Employee_Comp table and add it to the

No_Data_Transfer_DF workspace as a source table.

3. Add the Employee table from the Alpha datastore as a source table.

4. Add a Query transform to the data flow workspace and attach both source tables to the

transform.

5. In the transform editor for the Query transform, add the LastName and BirthDate columns

from the Employee table and the Comp_Type and Comp columns from the Employee_Comp

table to the output schema.

6. Add a WHERE clause to join the tables on the EmployeeID columns.

7. Create a template table called Employee_Temp in the Delta datastore as the target object

and connect it to the Query transform.

8. Save the job.

9. In the Local Object Library, use the right-click menu for the No_Data_Transfer_DF data

flow to display the optimized SQL.

Note that the WHERE clause does not appear in either SQL statement.

10. In the Local Object Library, replicate the No_Data_Transfer_DF data flow and rename the

copy Data_Transfer_DF.

11. In the Local Object Library, replicate the No_Data_Transfer_Job job and rename the copy

Data_Transfer_Job.

12. Add the Data_Transfer_Job job to the Omega project.



13. Delete the No_Data_Transfer_DF data flow from the Data_Transfer_Job and add the

Data_Transfer_DF data flow to the job by dragging it from the Local Object Library to the

job's workspace.

14. Delete the connection between the Employee_Comp table and the Query transform.

15. Add a Data Transfer transform between the Employee_Comp table and the Query transform

and connect the three objects.

16. In the transform editor for the Data Transfer transform, select the following options:

Option Value

Transfer Type Table

Table Name alpha.pushdown_data

Note: You must manually enter the name of the table. The table is created when the job

runs and is dropped automatically at the end.

17. In the transform editor for the Query transform, update the WHERE clause to join the

Data_Transfer.employeeid and employee.employeeid fields. Verify the Comp_Type and

Comp columns are mapped to the Data Transfer transform.

18. Save the job.

19. In the Local Object Library, use the right-click menu for the Data_Transfer_DF data flow to

display the optimized SQL.

Note that the WHERE clause appears in the SQL statements.

A solution file called SOLUTION_DataTransfer.atl is included in your resource CD. To check





Using the XML Pipeline transform

Introduction The XML Pipeline transform is used to process large XML files more efficiently by separating

them into small batches.



Explaining the XML Pipeline transform The XML Pipeline transform is used to process large XML files, one instance of a specified

repeatable structure at a time.

With this transform, Data Services does not need to read the entire XML input into memory

and build an internal data structure before performing the transformation.

This means that an NRDM structure is not required to represent the entire XML data input.

Instead, this transform uses a portion of memory to process each instance of a repeatable

structure, then continually releases and re-uses the memory to continuously flow XML data

through the transform.

During execution, Data Services pushes operations of the streaming transform to the XML

source. Therefore, you cannot use a breakpoint between your XML source and an XML Pipeline

transform.

Note:

You can use the XML Pipeline transform to load into a relational or nested schema target. This

course focuses on loading XML data into a relational target.

For more information on constructing nested schemas for your target, refer to the Data Services

Designer Guide.

Inputs/Outputs You can use an XML file or XML message. You can also connect more than one XML Pipeline

transform to an XML source.

When connected to an XML source, the transform editor shows the input and output schema

structures as a root schema containing repeating and non-repeating sub-schemas represented

by these icons:

Icon Schema structure

Root schema and repeating sub-schema



Icon Schema structure

Non-repeating sub-schema

Keep in mind these rules when using the XML Pipeline transform:

• You cannot drag and drop the root level schema.

• You can drag and drop the same child object repeated times to the output schema, but only

if you give each instance of that object a unique name. Rename the mapped instance before

attempting to drag and drop the same object to the output again.

• When you drag and drop a column or sub-schema to the output schema, you cannot then

map the parent schema for that column or sub-schema. Similarly, when you drag and drop

a parent schema, you cannot then map an individual column or sub-schema from under

that parent.

• You cannot map items from two sibling repeating sub-schemas because the XML Pipeline

transform does not support Cartesian products (combining every row from one table with

every row in another table) of two repeatable schemas.

To take advantage of the XML Pipeline transform’s performance, always select a repeatable

column to be mapped. For example, if you map a repeatable schema column, the XML source

produces one row after parsing one item.

Avoid selecting non-repeatable columns that occur structurally after the repeatable schema

because the XML source must then assemble the entire structure of items in memory before

processing. Selecting non-repeatable columns that occur structurally after the repeatable schema

increases memory consumption to process the output into your target.

To map both the repeatable schema and a non-repeatable column that occurs after the repeatable

one, use two XML Pipeline transforms, and use the Query transform to combine the outputs

of the two XML Pipeline transforms and map the columns into one single target.

Options The XML Pipeline is streamlined to support massive throughput of XML data; therefore, it

does not contain additional options other than input and output schemas, and the Mapping

tab.

Activity: Using the XML Pipeline transform Purchase order information is stored in XML files that have repeatable purchase orders and

items, and a non-repeated Total Purchase Orders column. You must combine the customer

name, order date, order items, and the totals into a single relational target table, with one row

per customer per item.

Objectives • Use the XML Pipeline transform to extract XML data.



• Combine the rows required from both XML sources into a single target table joined using

a Query transform

Instructions 1. On the Formats tab of the Local Object Library, create a new file format for an XML schema

called PurchaseOrders_Format, based on the purchaseOrders.xsd file in the Activity_Source

folder. Use a root element of PurchaseOrders.

2. In the Omega project, create a new job called Alpha_Purchase_Orders_Job, with a data flow

called Alpha_Purchase_Orders_DF.

3. In the data flow workspace for Purchase_Orders_DF, add the PurchaseOrders_Format file

format as the XML file source object.

4. In the format editor for the file format, point the file format to the pos.xml file in the

Activity_Source folder.

5. Add two instances of the XML Pipeline transform to the data flow workspace and connect

the source object to each.

6. In the transform editor for the first XML Pipeline transform, map the following columns:


customerDate customerDate

orderDate orderDate 7. Map the entire item repeatable schema from the input schema to the output schema.

8. In the transform editor for the second XML Pipeline transform, map the following columns:


customerDate customerDate

orderDate orderDate

totalPOs totalPOs

9. Add a Query transform to the data flow workspace and connect both XML Pipeline transform

to it.

10. In the transform editor for the Query transform, map both columns and the repeatable

schema from the first XML Pipeline transform from the input schema to the output schema.

Also map the totalPOs columns from the second XML Pipeline transform.

11. Unnest the item repeatable schema.

12. Create a WHERE clause to join the inputs from the two XML Pipeline transforms on the

customerName column.




XML_Pipeline.customerName = XML_Pipeline_1.customerName

13. Add a new template table called Item_POs to the Delta datastore and connect the Query

transform to it.

14. Execute Alpha_Purchase_Orders_Job with the default execution properties and save all



A solution file called SOLUTION_XMLPipeline.atl is included in your resource CD. To check





Quiz: Using Data Integrator transforms 1. What is the Pivot transform used for?

2. What is the purpose of the Hierarchy Flattening transform?

3. What is the difference between the horizontal and vertical flattening hierarchies?

4. List three things you can do to improve job performance.

5. Name three options that can be pushed down to the database.




• Describe the Data Integrator transforms



• Describe performance optimization



Answer Key-Learner's Guide 275


Answer Key

This section contains the answers to the reviews and/ or activities for the applicable lessons.

Answer Key—Learner’s Guide 277


Quiz: Describing Data Services Page 28

1. List two benefits of using Data Services.

Answer:

○ Create a single infrastructure for data movement to enable faster and lower cost

implementation.

○ Manage data as a corporate asset independent of any single system.

○ Integrate data across many systems and re-use that data for many purposes.

○ Improve performance.

○ Reduce burden on enterprise systems.

○ Prepackage data solutions for fast deployment and quick return on investment (ROI).

○ Cleanse customer and operational data anywhere across the enterprise.

○ Enhance customer and operational data by appending additional information to increase

the value of the data.

○ Match and consolidate data at multiple levels within a single pass for individuals,

households, or corporations.

2. Which of these objects is single-use?

Answer:

b. Project

3. Place these objects in order by their hierarchy: data flows, jobs, projects, and work flows.

Answer: Projects, jobs, work flows, data flows. 4. Which tool do you use to associate a job server with a repository?

Answer: The Data Services Server Manager. 5. Which tool allows you to create a repository?

Answer: The Data Services Repository Manager. 6. What is the purpose of the Access Server?

Answer: The Access Server is a real-time, request-reply message broker that collects incoming

XML message requests, routes them to a real-time service, and delivers a message reply

within a user-specified time frame.



Quiz: Defining source and target metadata Page 66

1. What is the difference between a datastore and a database?

Answer: A datastore is a connection to a database. 2. What are the two methods in which metadata can be manipulated in Data Services objects?

What does each of these do?

Answer:

You can use an object’s options and properties settings to manipulate Data Services objects.

Options control the operation of objects. For example, the name of the database to connect

to is a datastore option.

Properties document the object. For example, the name of the datastore and the date on

which it was created are datastore properties. Properties are merely descriptive of the object

and do not affect its operation.

3. Which of the following is NOT a datastore type?

Answer:

d. File Format

4. What is the difference between a repository and a datastore?

Answer: A repository is a set of tables that hold system objects, source and target metadata,

and transformation rules. A datastore is an actual connection to a database that holds data.



Quiz: Creating batch jobs Page 99

1. Does a job have to be part of a project to be executed in the Designer?

Answer: Yes. Jobs can be created separately in the Local Object Library, but they must be

associated with a project in order to be executed.

2. How do you add a new template table?

Answer: Click and drag the Template Table icon from the tool palette or from the Datastores

tab of the Local Object Library to the workspace.

3. Name the objects contained within a project.

Answer: Examples of objects are: jobs, work flows, and data flows. 4. What factors might you consider when determining whether to run work flows or data

flows serially or in parallel?

Answer:


○ Whether or not the flows are independent of each other

○ Whether or not the server can handle the processing requirements of flows running at

the same time (in parallel)



Quiz: Troubleshooting batch jobs Page 128

1. List some reasons why a job might fail to execute.

Answer: Incorrect syntax, Job Server not running, port numbers for Designer and Job Server

not matching.

2. Explain the View Data feature.

Answer: View Data allows you to look at the data for a source or target file. 3. What must you define in order to audit a data flow?

Answer: You must define audit points and audit rules when you want to audit a data flow. 4. True or false? The auditing feature is disabled when you run a job with the debugger.

Answer: True.



Quiz: Using functions, scripts, and variables Page 173

1. Describe the differences between a function and a transform.

Answer: Functions operate on single values, such as values in specific columns in a data

set. Transforms operate on data sets, creating, updating, and deleting rows of data.

2. Why are functions used in expressions?

Answer: Functions can be used in expressions to map return values as new output columns.

Adding output columns allows columns that are not in an input data set to be specified in

an output data set.

3. What does a lookup function do? How do the different variations of the lookup function

differ?

Answer: All lookup functions return one row for each row in the source. They differ in how

they choose which of several matching rows to return.

4. What value would the Lookup_ext function return if multiple matching records were found

on the translate table?

Answer: Depends on Return Policy (Min or Max) 5. Explain the differences between a variable and a parameter.

Answer: A parameter is an expression that passes a piece of information to a work flow,

data flow, or custom function when it is called in a job. A variable is a symbolic placeholder

for values.

6. When would you use a global variable instead of a local variable?

Answer:

○ When the variable will need to be used multiple times within a job.

○ When you want to reduce the development time required for passing values between

job components.

○ When you need to create a dependency between job level global variable name and job

components.

7. What is the recommended naming convention for variables in Data Services?

Answer: Variable names must be preceded by a dollar sign ($). Local variables start with

$L_, while global variables can be denoted by $G_. 8. Which object would you use to define a value that is constant in one environment, but may

change when a job is migrated to another environment?

Answer:

d. Substitution parameter



Quiz: Using platform transforms Page 203

1. What would you use to change a row type from NORMAL to INSERT?

Answer: The Map Operation transform. 2. What is the Case transform used for?

Answer: The Case transform simplifies branch logic in data flows by consolidating case or

decision-making logic in one transform with multiple paths defined in an expression table.

It simplifies branch logic in data flows by consolidating case or decision-making logic into

one transform.

3. Name the transform that you would use to combine incoming data sets to produce a single

output data set with the same schema as the input data sets.

Answer: The Merge transform. 4. A validation rule consists of a condition and an action on failure. When can you use the

action on failure options in the validation rule?

Answer:

You can use the action on failure option only if:

○ The column value failed the validation rule.

○ Send to Pass or Send to both option is selected. 5. When would you use the Merge transform versus the SQL transform to merge records?

Answer: The SQL transform performs better than the Merge transform, so it should be used

whenever possible. However, the SQL transform cannot join records from file formats, so

you would need to use the Merge transform for those source objects.



Quiz: Setting up error handling Page 220

1. List the different strategies you can use to avoid duplicate rows of data when re-loading a

job.

Answer:

○ Using the auto-correct load option in the target table.

○ Including the Table Comparison transform in the data flow.

○ Designing the data flow to completely replace the target table during each execution.

○ Including a preload SQL statement to execute before the table loads. 2. True or false? You can only run a job in recovery mode after the initial run of the job has

been set to run with automatic recovery enabled.

Answer: True. 3. What are the two scripts in a manual recovery work flow used for?

Answer: The first script determines if recovery is required, usually by reading the status in

a status table. The second script updates the status table to indicate successful job execution.

4. Which of the following types of exception can you NOT catch using a try/catch block?

Answer:

b. Syntax errors



Quiz: Capturing changes in data Page 247

1. What are the two most important reasons for using CDC?

Answer: Improving performance and preserving history. 2. Which method of CDC is preferred for the performance gain of extracting the fewest rows?

Answer: Source-based CDC. 3. What is the difference between an initial load and a delta load?

Answer:

An initial load is the first population of a database using data acquisition modules for

extraction, transformation, and load. The first time you execute a batch job, Designer performs

an initial load to create the data tables and populate them.

A delta load incrementally loads data that has been changed or added since the last load

iteration. When you execute your job, the delta load may run several times, loading data

from the specified number of rows each time until all new data has been written to the target

database.

4. What transforms do you typically use for target-based CDC?

Answer: Table Comparison, History Preserving, and Key Generation.



Quiz: Using Data Integrator transforms Page 273

1. What is the Pivot transform used for?

Answer: Use the Pivot transform when you want to group data from multiple columns into

one column while at the same time maintaining information linked to the columns.

2. What is the purpose of the Hierarchy Flattening transform?

Answer: The Hierarchy Flattening transform enables you to break down hierarchical table

structures into a single table to speed data access.

3. What is the difference between the horizontal and vertical flattening hierarchies?

Answer:

With horizontally-flattened hierarchies, each row of the output describes a single node in

the hierarchy and the path to that node from the root.

With vertical-flattened hierarchies, each row of the output describes a single relationship

between ancestor and descendent and the number of nodes the relationship includes. There

is a row in the output for each node and all of the descendants of that node. Each node is

considered its own descendent and, therefore, is listed one time as both ancestor and

descendent.

4. List three things you can do to improve job performance.

Answer:

Choose from the following:

○ Utilize the push-down operations.

○ View SQL generated by a data flow and adjust your design to maximize the SQL that is

pushed down to improve performance.

○ Use data caching.

○ Use process slicing. 5. Name three options that can be pushed down to the database.

Answer:

Choose from the following:

○ Aggregations (typically performed with a GROUP BY)

○ Distinct rows

○ Filtering

○ Joins

○ Ordering

○ Projections

○ Functions that have equivalents in the underlying database


Notes

student guide sap data services xi 30 - data integrator

Documents