teradata best practices using a 7[1].1

Teradata Best Practices with Informatica PowerCenter 7.1.2

Informatica Confidential. Do not duplicate. 1 of 22Revision: 4/11/2023

Introduction

This document discusses configuration and how-tos using PowerCenter 7.1.2 and NCR’s Teradata RDBMS. It covers Teradata basics and also describes some “tweaks” which experience has shown may be necessary to adequately deal with some of the “common” practices one may encounter at a Teradata account. The Teradata documentation (especially the MultiLoad, FastLoad and Tpump reference) is highly recommended reading material, as is the “External Loader” section of the PowerCenter’s Server Manager Guide.

Additional Information: All Teradata documentation can be downloaded from the NCR web site (http://www.info.ncr.com/Teradata/eTeradata-BrowseBy.cfm), it is also available on the Informatica Tech Support website (tsspider.informatica.com/Docs/page1.html). There is a nice Teradata FAQ in the Informatica Tech Support knowledge base (it contains a section on how to handle “timestamp” columns). Finally, there is a “Teradata Forum” that provides a wealth of sometimes useful information ( http://www.Teradataforum.com ).

Teradata BasicsTeradata is a relational database management system from NCR. It offers high performance for very large databases tables because of its highly parallel architecture. It is a major player in the retail space. While Teradata can run on other platforms, it is predominantly found on NCR hardware (which runs NCR’s version of Unix). It is very fast and very scalable.

Teradata HardwareThe NCR computers on which Teradata runs support both MPP (Massively Parallel Processing) and SMP (Symetric Multi-Processing). Each MPP “node” (or semi-autonomous processing unit) can support SMP.

Teradata can be configured to communicate directly with a mainframe’s I/O channel. This is known as “channel attached”. Alternatively, it can be “network attached”. That is, configured to communicate via TCP/IP over a LAN. Since PowerCenter runs on Unix, most of the time you will be dealing with a “network attached” configuration. However, once in a while, a client will want to use their existing “channel attached” configuration under the auspices of better performance. Do not necessarily assume that “channel attached” is always faster than “network attached”. Similar performance has been observed across a channel attachment as well as a 100MB LAN. In addition, “channel attached” requires an additional sequential data move because the data must be moved from the PowerCenter server to the mainframe prior to moving the data across the mainframe channel to Teradata.

Teradata SoftwareIn the Teradata world, there are Teradata Director Program Ids (TDPIDs), databases and users. The TDPID is simply the name that one uses to connect from a Teradata client to Teradata server (think Oracle “tnsnames.ora” entry). Teradata also looks at databases and users somewhat synonymously. A user has a userid, password and space to store tables. A database is basically a user without a login and password (or a user is a database with a userid and password).

Teradata AMPs are Access Module Processors. Think of AMPs as Teradata’s parallel database engines. Although they are strictly software (“virtual processors” according to NCR terminology), Teradata folks often seem to use AMP and hardware “node” interchangeable because in the “old days” an AMP was a piece of hardware.

Client Configuration Basics for TeradataThe client side configuration is wholly contained in the “hosts” file (/etc/hosts on Unix or winnt\system32\drivers\etc\hosts on Win). Since INFA does not run on NCR Unix, one should not ever have to deal with the server side. Teradata uses a naming nomenclature in the “hosts” file. The name of the Teradata instance (i.e. tdpid – Teradata Director Program Id) is indicated by the letters and numbers that precede the string “cop1” in a hosts file entry. For example:

127.0.0.1 localhost demo1099cop1192.168.80.113 curly pcop1

This tells Teradata that when a client tool references the instance “demo1099”, it should direct requests to “localhost” (or ip address 127.0.0.1), when a client tool references instance “p”, this located on the server “curly” (or ip address 192.168.80.113). There is no tie here to any kind of database server specific information (this is not similar to Oracle’s instance id. Tdpid <> Oracle instance id!!!). That is, the tdpid is used strictly to define the name a client uses to connect to a server. You can really call a server whatever you want. Teradata


ss185078, 08/15/05,

Sellman – can we jazz this up a bit ?

http://www.teradataforum.com/

does not care. It simply takes the name you specify, looks in the “host” file to map the <name>cop1 (or cop2, etc.) to an IP address, and then attempts to establish a connect with Teradata at the IP address.

Sometimes you’ll see multiple entries in a hosts file with similar tdpids:

127.0.0.1 localhost demo1099cop1192.168.80.113 curly_1 pcop1192.168.80.114 curly_2 pcop2192.168.80.115 curly_3 pcop3192.168.80.116 curly_4 pcop4

This setup allows load balancing of clients among multiple Teradata nodes. That is, most Teradata systems have many nodes, and each node has its own IP address. Without the multiple hosts file entries, every client will connect to one node and eventually this node will be doing more than its “fair share” of client processing. With multiple host file entries, if it takes too long for the node specified with the “cop1” suffix to respond (i.e. curly_1) to the client request to connect to “p”, then the client will automatically attempt to connect to the node with the “cop2” suffix (i.e. curly_2) and so forth.

Informatica / Teradata touch pointsInformatica 7.1.2 accesses Teradata with severalthrough various Teradata tools. Each will be defined and as to how it is configured within PowerCenter.

ODBC: Teradata provides 32-bit ODBC drivers for Windows and Unix platforms. If possible, use the ODBC driver from Teradata’s TTU7 release (or above) of their client software because this version supports “array reads”. Tests have shown these “new” drivers (3.02) can be 20%-30% faster than the “old” drivers (3.01). This lastest release of Teradata’s TTU8.0 uses ODBC v3.0421. Teradata’s ODBC is on a performance par with Teradata’s SQL CLI. In fact, ODBC is Teradata recommended SQL interface for their partners.

Do not use ODBC to write to Teradata unless you’re writing very small data sets (and even then, you should probably use Tpump defined later instead) because Teradata’s ODBC is optimized for query access and, hence, is not optimized for writing data. ODBC is good for sourcing and lookups. PowerCenter Designer uses Teradata’s ODBC to import Source and Target table.

If you are having performance problems you can use a cmd task with a shell script to call BTEQ. SQL with intermediate work table which could be sourced by PowerCenter


ss185078, 08/15/05,

Please redo wording on this.

ODBC Windows:

ODBC Unix When the PowerCenter server is running on Unix, then ODBC is required to read (both sourcing and lookups) from Teradata As with all Unix ODBC drivers, the key to configuring the Unix ODBC driver is adding the appropriate entries to the “.odbc.ini” file. To correctly configure the “.odbc.ini” file, there must be an entry under [ODBC Data Sources] that points to the Teradata ODBC driver shared library (tdata.sl on HP-UX, standard shared library extension on other flavors of Unix ).

The following example shows the required entries from an actual “.odbc.ini” file (note the path to the driver may be different on each computer):

[ODBC Data Sources]dBase=MERANT 3.60 dBase DriverOracle8=MERANT 3.60 Oracle 8 DriverText=MERANT 3.60 Text DriverSybase11=MERANT 3.60 Sybase 11 DriverInformix=MERANT 3.60 Informix DriverDB2=MERANT 3.60 DB2 DriverMS_SQLServer7=MERANT SQLServer driver


TeraTest=tdata.sl

[TeraTest]Driver=/usr/odbc/drivers/tdata.slDescription=Teradata Test SystemDBCName=148.162.247.34

Similar to the client “hosts” file setup, one can specify multiple IP addresses for the DBCName to balance the client load across multiple Teradata nodes. Consult with the Teradata administrator for exact details on this (or copy the entries from the PC client’s “hosts” file (see section Client Configuration Basics).

Important note: Make sure that the Merant ODBC path precedes the Teradata ODBC path information in the PATH and SHLIB_PATH (or LD_LIBRARY_PATH, etc.) environment variables. This is because both sets of ODBC software use some of the same file names. PowerCenter should use the Merant files because this is the software that has been certified.

Teradata external loaders PowerCenter 7.1.2 supports 4 different Teradata external loaders: Tpump, FastLoad, MultiLoad, TeradataWarehouse Builder ( TWB ). The actual Teradata loader executables ( tpump, mload, fastload, tbuild ) must be accessible by the Powercenter Server generally in the path statement. All of the Teradata loader connections will require a value to the TDPID attribute. Refer to the first section of this document to understand how to correctly enter the value. All of these loaders require:

a load file ( can be configured to be a stream/pipe and is autogenerated by PowerCenter ) a control file of commands to tell the loader what to do ( PowerCenter autogenerates)

All of these loaders will also produce a log file. This log file will be the means to debug the loader if something goes wrong. As these are external loaders, PowerCenter will only receive back from the loader whether it ran successfully or not.


By default, the input file, control file and log file will be created in $PMTargetFileDir of the PowerCenter Server executing the workflow


Any of these loaders can be used by configuring the target in the PowerCenter session to be a “File Writer” and then choose the appropriate loader.


The auto-generated control file can be overridden. Click the “Pencil” icon next to the loader connection name.


Scroll to the bottom of the connection attribute list and click the value next to the “control file content Override” attribute. Then click the down arrow.


Click the Generate button and change the control file as you wish. The changed control file is stored in the repository,


Most of the loaders also use some combination of internal Work, Error and Log tables. By default, these will be in the same database as the target table. All of these can now be overridden in the attributes of the connection.

To land the input flat file that the loaders need to disk, the “is staged” attribute must be checked. If the “is staged” attribute is not set, then the file will be piped/streamed to the loader.

If one selects the non-staged mode for a loader, one should also set the “checkpoint” property to 0. This effectively turns off the “checkpoint” processing. “Checkpoint” processing is used for recovery/restart of fastload and multiload sessions. However, if one is not using a physical file as input, but rather a named pipe, then the recovery/restart mechanism of the loaders does not work. Not only does this impact performance (i.e. the checkpoint processing is not free and we want to eliminate as much unnecessary overhead as possible), but a non-zero checkpoint value will sometimes cause seemingly random errors and session failures when used with named pipe input (as is the case with “streaming” mode).

Teradata loader Requirements for PowerCenter servers on UnixAll Teradata load utilities require a “non-null” standard output and standard error to run properly. “Standard output” (stdout) and “standard error” (stderr) are Unix conventions that determine the default location for a program to write output and error information. When you start the pmserver without explicitly defining stdout and stderr, these both point to the current terminal session. If you logout of Unix, Unix redirects stdout and stderr to /dev/null (i.e. a placeholder that throws out anything written to it). At this point, Teradata loader sessions will


fail because they do not permit stdout and stderr to be /dev/null. Therefore, you must start pmserver as follows (“cd” to the PowerCenter installation directory):

./pmserver ./pmserver.cfg > ./pmserver.out 2>&1

This starts the pmserver using the “pmserver.cfg” config file and points stdout and stderr to the file “pmserver.out”. In this way, stderr and stdout will be defined even after the terminal session logs out. Important note: There are no spaces in the token “2>&1”. This tells Unix to point stderr to the same place stdout is pointing.

As an alternative to this method, you can specify the console output file name in the pmserver.cfg file. That is, information written to “standard output” and “standard error” will go the file specified as follows:

ConsoleOutputFilename=<FILE_NAME>

With this entry in the pmserver.cfg file, one can start the pmserver normally (i.e. ./pmserver).

Partitioned Loading

With PowerCenter v7.x, if one sets a “round robin” partition point on the target definition and sets each target instance to be loaded using the same loader connection instance, then PowerCenter automatically writes all data to the first partition and only starts one instance of FastLoad or MultiLoad. You will know you are getting this behavior if you see the following entry in the session log:

MAPPING> DBG_21684 Target [TD_INVENTORY] does not support multiple partitions. All data will be routed to the first partition.

If you do not see this message, then chances are the session fails with the following error:

WRITER_1_*_1> WRT_8240 Error: The external loader [Teradata Mload Loader] does not support partitioned sessions.WRITER_1_*_1> Thu Jun 16 11:58:21 2005WRITER_1_*_1> WRT_8068 Writer initialization failed. Writer terminating.


TpumpTpump is an external loader that supports inserts, updates, upserts and deletes and data driven updates. Multiples Tpump’s can execute simultaneously against the same table as it doesn’t use many resource nor does it require table level locks. It is often used to “trickle load” a table. As stated earlier, it will be a faster way to update a table as opposed to ODBC, but will not be as fast as the other loaders.


MultiLoadThis is a sophisticated bulk load utility and is the primary method PowerCenter uses to load/update mass quantities of data into Teradata. Unlike bulk load utilities from other vendors, MultiLoad supports inserts, updates, upserts, delete and data driven operations in PowerCenter. You can also use variables and embed conditional logic into MultiLoad scripts. It is very fast (millions of rows in a few minutes). It can be resource intensive and will take a table lock.

Cleaning up after a failed MultiLoad: MultiLoad supports sophisticated error recovery. That is, it allows load jobs to be restarted without having to redo all of the prior work. However, for the types of problems normally encountered during a POC (loading null values into a column that does not support nulls, incorrectly formatted date columns), the error recovery mechanisms tend to get in the way. To learn about MultiLoad’s sophisticated error recovery read the MultiLoad manual. To learn how to work around the recovery mechanisms to restart a failed MultiLoad script from scratch, read this section.


MultiLoad puts the target table into the “MultiLoad” state. Upon successful completion, the target table is returned to the “normal” (non-“MultiLoad”) state. Therefore, when a MultiLoad fails for any reason, the table is left in “MultiLoad” state, and one cannot simply re-run the same MultiLoad. MultiLoad will report an error. In addition, MultiLoad also queries the target table’s MultiLoad log table to see if it contains any errors. If a MultiLoad log table exists for the target table, then you also will not be able to rerun your MultiLoad job.

To recover from a failed MultiLoad, one must “release” the target table from the “MultiLoad” state and also drop the MultiLoad log table. One can do this using BTEQ or QueryMan to issue the following commands:

drop table mldlog_<table name>;release mload <table name>;

Note: The “drop table” command assumes that you’re recovering from a MultiLoad script generated by PowerCenter (PowerCenter always names the MultiLoad log table “mldlog_<table name>). If you’re working with a hand-coded MultiLoad script, the name of the MultiLoad log table could be anything.

Here is the actual text from a BTEQ session which cleans up a failed load to the table “td_test” owned by the user “infatest”:

BTEQ -- Enter your DBC/SQL request or BTEQ command:drop table infatest.mldlog_td_test;

drop table infatest.mldlog_td_test;

*** Table has been dropped. *** Total elapsed time was 1 second.

BTEQ -- Enter your DBC/SQL request or BTEQ command:release mload infatest.td_test;

release mload infatest.td_test;

*** Mload has been released. *** Total elapsed time was 1 second.

Using one instance of MultiLoad to load multiple tablesMultiLoad is a big consumer of resources on a Teradata system. Some systems will have hard limits on the number of concurrent MultiLoad sessions allowed. By default, PowerCenter will start an instance of MultiLoad for every target file. Sometimes, this is illegal (if the multiple instances target the same table). Other times, it is just expensive. Therefore, a prospect may ask that PowerCenter use a single instance of MultiLoad to load multiple tables (or to load both inserts and updates into the same target table). To make this happen, we’re back to heavy editing of the generated MultiLoad script file. Note: This should not be an issue with Tpump because Tpump is not as resource intensive as MultiLoad (and a multiple concurrent instances of Tpump can target the same table). Here’s the workaround:

1) Use a dummy session (i.e. set test rows to 1 and target a test database) to generate MultiLoad control files for each of the targets.

2) Merge the multiple control files (one per target table) into a single control file (one for all target tables)3) Configure the session to call MultiLoad from a post-session script using the control file created in step 2. Integrated support

cannot be used because each input file is processed sequentially and this causes problems when combined with PowerCenter’s integrated named pipes and streaming.

Details on “merging” the control files:1) There is a single log file for each instance of MultiLoad. Therefore, you do not have to change or add anything the “LOGFILE”

statement. However, you might want to change the name of the log table since it may be a log that spans multiple tables.2) Copy the work and error table delete statements into the common control file3) Modify the “BEGIN MLOAD” statement to specify all the tables that the MultiLoad will be hitting


4) Copy the “Layout” sections into the common control file and give each a unique name. Organize the file such that all the layout sections are grouped together.

5) Copy the “DML” sections into the common control file and give each a unique name. Organize the file such that all the DML sections are grouped together.

6) Copy the “Import” statements into the common control file and modify them to reflect the unique names created for the referenced LAYOUT and DML sections created in steps 4) and 5). Organize the file such that all the Import sections are grouped together.

7) Run “chmod –w” on the newly minted control file so PowerCenter doesn’t overwrite it, or, better yet, name it something different so PowerCenter cannot overwrite it.

8) It’s just that easy!!! Also remember, a single instance of MultiLoad can target at most 5 tables. Therefore, don’t combine more than 5 target files into a common file.

Here’s an example of a control file merged from two default control files:

.DATEFORM ANSIDATE;

.LOGON demo1099/infatest,infatest;

.LOGTABLE infatest.mldlog_TD_TEST;

DROP TABLE infatest.UV_TD_TEST ;DROP TABLE infatest.WT_TD_TEST ;DROP TABLE infatest.ET_TD_TEST ;

DROP TABLE infatest.UV_TD_CUSTOMERS ;DROP TABLE infatest.WT_TD_CUSTOMERS ;DROP TABLE infatest.ET_TD_CUSTOMERS ;

.ROUTE MESSAGES WITH ECHO TO FILE

c:\LOGS\TgtFiles\td_test.out.ldrlog ;

.BEGIN IMPORT MLOAD TABLES infatest.TD_TEST, infatest.TD_CUSTOMERS ERRLIMIT 1 CHECKPOINT 10000 TENACITY 10000 SESSIONS 1 SLEEP 6;

/* Begin Layout Section */

.Layout InputFileLayout1;

.Field CUST_KEY 1 CHAR( 12) NULLIF CUST_KEY = '*' ;

.Field CUST_NAME 13 CHAR( 20) NULLIF CUST_NAME = '*' ;

.Field CUST_DATE 33 CHAR( 10) NULLIF CUST_DATE = '*' ;

.Field CUST_DATEmm 33 CHAR( 2) ;

.Field CUST_DATEdd 36 CHAR( 2) ;

.Field CUST_DATEyyyy 39 CHAR( 4) ;

.Field CUST_DATEtd CUST_DATEyyyy||'/'||CUST_DATEmm||'/'||CUST_DATEdd NULLIF CUST_DATE = '*' ;.Filler EOL_PAD 43 CHAR( 2) ;


.Layout InputFileLayout2;

.Field CUSTOMER_KEY 1 CHAR( 12) ;

.Field CUSTOMER_ID 13 CHAR( 12) ;

.Field COMPANY 25 CHAR( 50) NULLIF COMPANY = '*' ;

.Field FIRST_NAME 75 CHAR( 30) NULLIF FIRST_NAME = '*' ;

.Field LAST_NAME 105 CHAR( 30) NULLIF LAST_NAME = '*' ;

.Field ADDRESS1 135 CHAR( 72) NULLIF ADDRESS1 = '*' ;

.Field ADDRESS2 207 CHAR( 72) NULLIF ADDRESS2 = '*' ;

.Field CITY 279 CHAR( 30) NULLIF CITY = '*' ;

.Field STATE 309 CHAR( 2) NULLIF STATE = '*' ;

.Field POSTAL_CODE 311 CHAR( 10) NULLIF POSTAL_CODE = '*' ;

.Field PHONE 321 CHAR( 30) NULLIF PHONE = '*' ;

.Field EMAIL 351 CHAR( 30) NULLIF EMAIL = '*' ;

.Field REC_STATUS 381 CHAR( 1) NULLIF REC_STATUS = '*' ;

.Filler EOL_PAD 382 CHAR( 2) ;

/* End Layout Section */

/* begin DML Section */

.DML Label tagDML1;INSERT INTO infatest.TD_TEST ( CUST_KEY , CUST_NAME , CUST_DATE ) VALUES ( :CUST_KEY , :CUST_NAME , :CUST_DATEtd ) ;

.DML Label tagDML2;INSERT INTO infatest.TD_CUSTOMERS ( CUSTOMER_KEY , CUSTOMER_ID , COMPANY , FIRST_NAME , LAST_NAME , ADDRESS1 , ADDRESS2 , CITY , STATE , POSTAL_CODE , PHONE , EMAIL , REC_STATUS ) VALUES ( :CUSTOMER_KEY , :CUSTOMER_ID , :COMPANY ,


:FIRST_NAME , :LAST_NAME , :ADDRESS1 , :ADDRESS2 , :CITY , :STATE , :POSTAL_CODE , :PHONE , :EMAIL , :REC_STATUS ) ;

/* end DML Section */

/* Begin Import Section */

.Import Infile c:\LOGS\TgtFiles\td_test.out Layout InputFileLayout1 Format Unformat Apply tagDML1 ;

.Import Infile c:\LOGS\TgtFiles\td_customers.out Layout InputFileLayout2 Format Unformat Apply tagDML2;

/* End Import Section */

.END MLOAD;

.LOGOFF;


Multiple workflows that Multiload to the same tableSince multiload puts a lock on the table, we require that all multiload sessions must handle wait events so they don't try to access the table simultaneously. Also, any log files should be given unique names for the same reason.

FastLoadAs the name suggests, this is a very fast utility to load data into Teradata. It is the fastest method to load data into Teradata. However, there is one major restriction: the target table must be empty.


Teradata Warehouse Builder (TWB)Teradata Warehouse Builder (TWB) is a single utility that was intended to replace FastLoad, MultiLoad, Tpump and FastExport. It was to support a single scripting environment with different “modes”, where each “mode” roughly equates to one of the legacy utilities. It also was to support parallel loading (i.e. multiple instances of a TWB client could run and load the same table at the same time – something the legacy loaders cannot do). PowerCenter supports TWB. Unfortunately, NCR/Teradata does not. TWB has never been formally released (never went “GA”). According to NCR, the release was delayed primarily because of issues with the mainframe version. If you find a prospect willing to use TWB, please do. Its ability to support parallel load clients makes some things quite a bit easier.


PowerCenter 7.1.2 PAM for Teradata

Platform

Server Versio

nClient Software

& Version Source Target Comment

NCR Teradata Unix and NT V2R4TUF 6.1, TTU 6.1.1, TTU 6.2 x x

PowerCenter uses the following components of the Teradata Tools and Utilities (TTU): Teradata ODBC driver, FastLoad, MultiLoad, and Tpump. The version numbers of each of the Teradata Client components vary with release (e.g. TTU7 contains ODBC 3.02, FastLoad 7.05, MultiLoad 3.03, Tpump 1.07). The TeradataTools and Utilities (TTU) was previously called Teradata Utilities Foundation (TUF). Compatibility between a particular version of the Teradata RDBMS and the Teradata Client software is determined by Teradata -- not Informatica.

Unix and NT V2R4.1TUF 6.1, TTU 6.1.1, TTU 6.2, TTU 7 x x

Unix and NT V2R5 TTU 7 X X

The Teradata Client/Teradata RDBMS pairings listed here represent our understanding based on Teradata's documentation. Note that the minimum version number for the Teradata ODBC driver is 3.00.01.04. Teradata has made many fixes to the 3.02 ODBC driver. If you are using this driver, please contact NCR support for the latest maintenance release

Unix and NT V2R5.1 TTU 7 x x

Unix and NT V2R6 TTU 8 x x Supported Teradata ODBC driver is v3.0.4.1.


Future Support of Teradata:

FastExportFastExport is just as it’s name implies, that is a utility to extract data from Teradata very quickly. It will be supported in PowerCenter 7.1.3 for Teradata sources. It should extract large table used for lookups quicker that ODBC.

PushDownThe is also commonly referred to as ELT. The Zeus release of PowerCenter will have the ability to create SQL that will execute in the Teradata database server that will replace certain transformation that would normally run in the PowerCenter Server. This is critical to Teradata as it is common for the Source and Target to reside in the same database.


teradata best practices using a 7[1].1

Documents