best-practice etl for kalido® 8 using ascential

8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential

1/21

Best-Practice ETL for KALIDO 8 using

Ascential DataStage

October 2004

Gary Powell

Senior KALIDO Application Consultant


2/21

Copyright 2004 Kalido www.kalido.com2

Table of contents

1 Introduction....................................................................................................... 3The need for data integration tools in an enterprise data warehouse............................3

2 How ETL changes in a KALIDO environment ...................................................... 53 Understanding the ETL target ............................................................................ 94 Best Practice Techniques ..................................................................................11

4.1 Extracting data from the source .................................................................... 11Working with the business model ......................................................................... 11Full or delta extraction........................................................................................ 11Handling time-variant reference data using transaction date..................................... 12Handling changing codes..................................................................................... 13

4.2 Transforming data ...................................................................................... 13Data summarization and allocation ....................................................................... 13Currency conversions ......................................................................................... 14Writing reusable transforms................................................................................. 15

4.3 Loading data.............................................................................................. 154.4 Scheduling and job sequencing ..................................................................... 16

Minimize build and maintenance requirements........................................................ 16Trap and handle errors caused by process failure.................................................... 17Optimizing performance...................................................................................... 18

5 Conclusions.......................................................................................................20


3/21


1 IntroductionGlobal 2000 companies now face greater regulatory and shareholder pressure than ever to

increase corporate accountability, transparency and performance. As a result, they are

creating enterprise data warehouses which enable multiple, corporate-wide views of business

performance across disparate systems and organizations.

Unfortunately, it takes months (12-18 is not uncommon) to custom build or modify a data

warehouse, and many are never even completed. As a result, business people fail to receive

the timely management information they need to make high-quality decisions.

To address this challenge, companies are taking a more iterative approach to enterprise data

warehousing using KALIDO, which automates the creation and maintenance of enterprise

data warehouses and master data throughout their lifecycle. The KALIDO application suite

automatically adapts data warehouses and their associated master data to new business needs

based on changes made to real-world business models. Kalido customers create and modify

data warehouses 75% faster and at half the cost of traditional approaches.

The need for data integration tools in an enterprise data warehouse

A data warehouse typically sources data from a wide range of IT systems and organizations.

To help deliver this data to the data warehouse, companies deploy tools such as Ascential

Enterprise Integration Suite, which includes:

Ascential ProfileStage - data profiling to evaluate source data content and structure

Ascential QualityStage - data cleansing to find & reconcile low-quality or redundant data

Ascential DataStage - data extraction, transformation and loading (ETL)

Ascential MetaStage - metadata management for definitions and history of business data

Figure 1 - Kalido and Ascential Products

ANY SOURCE

CRM

ERP

SCM

RDBMS

Legacy

EAI /

Messaging

Web svc

XML/EDI

Legacy DW

Parallel Execution Engine

DISCOVERDISCOVER PREPAREPREPARE TRANSFORMTRANSFORM

ProfileStageProfileStage QualityStageQualityStage DataStageDataStage

Meta Data Management

Real-Time Integration Services

Enterprise Connectivity

and Event Management

Service-Oriented Architecture

Data WarehouseSchema & Content

Adaptive Services Core

Auto-Builds

Business Model &Master Data

Region

Country

District

Target

Industry

Sector

Size

Account

Contract

Geo Address

TargetCust

Manager

AccountGrp

Partners

DelPoint

Builds

Universes

Cubes

Marts

Reports

Master Data

Auto-Builds

Transactions

Ascential Enterprise Integration Suite KALIDO Business

Intelligence

ANY SOURCE

CRM

ERP

SCM

RDBMS

Legacy

EAI /

Messaging

Web svc

XML/EDI

Legacy DW

Parallel Execution Engine

DISCOVERDISCOVER PREPAREPREPARE TRANSFORMTRANSFORM

ProfileStageProfileStage QualityStageQualityStage DataStageDataStage

Meta Data Management

Real-Time Integration Services

Enterprise Connectivity

and Event Management

Service-Oriented Architecture

Data WarehouseSchema & Content

Adaptive Services Core

Auto-Builds


Region

Country

District

Target

Industry

Sector

Size

Account

Contract

Geo Address

TargetCust

Manager

AccountGrp

Partners

DelPoint


Region

Country

District

Target

Industry

Sector

Size

Account

Contract

Geo Address

TargetCust

Manager

AccountGrp

Partners

DelPoint

Region

Country

District

Region

Country

District

Region

Country

District

Target

Industry

Sector

SizeTarget

Industry

Sector

Size

Account

Contract

Geo Address

TargetCust

Manager

AccountGrp

Partners

DelPoint

Account

Contract

Geo Address

TargetCust

Manager

AccountGrp

Partners

DelPoint

Builds

Universes

Cubes

Marts

Reports

UniversesUniverses

Cubes

MartsMarts

ReportsReports

Master Data

Auto-Builds

Transactions

Ascential Enterprise Integration Suite KALIDO Business

Intelligence


4/21


This paper outlines a best-practice approach for using DataStage with KALIDO. As you will

see, the use of DataStage differs from a custom-build environment, with less emphasis on

data transformation and loading, but more emphasis on easy and efficient extraction from

multiple data sources, and the management of data as it is moved into the data warehouse.

In the rest of this paper a basic familiarity with standard ETL and data warehousing concepts

is assumed, however a specific knowledge of KALIDO or DataStage is not required.


5/21


2 How ETL changes in a KALIDO environmentA good way of thinking about ETL in a KALIDO data warehouse environment is that it allows

you to work with data at a logical levelrather than at thephysical level.

Typically, a data warehouse is built around a logical data model, which characterizes data as anumber of related entities. This is then implemented as a physical data model, consisting of a

number of database tables. There are many ways to map the logical to a physical data model

depending on data requirements, such as speed of retrieval and ease of maintenance.

Here is a logical data model drawn in KALIDO notation, a business-friendly way of expressing

logical data models for enterprise data warehouses. KALIDO refers to these as business

models.Fact data is located in the ovals, and reference data hierarchies are located in the

boxes. Arrows between reference data signify "is classified by, with dotted arrows meaning

the classification is optional. The green box means "classification can be to either level".

Product

Packed Product

Product

Group

Pack

TypeBrand

Time

Year

Quarter

Day

Customer

Customer

Account

IndustrialClassification

Industry Group

Industry

Month

Delivery Point

Product Sales Volume

Revenue

Distribution cost

Target Sales Target volume

Region

Figure 2 - Business Model (Logical Data Model)

Customarily, physical schema (star or snowflake usually) are built from this model:

ProductDimension

CustomerDimension

TimeDimension

ProductSales Facts

TargetSales Facts

M M

MM M

M

0/1 0/1

0/1 0/1 0/10/1

ProductAssociations

CustomerAssociations

TimeAssociations

ProductSales Facts

TargetSales Facts

M M

MM M

M

0/1 0/1

0/1 0/1 0/10/1

Product GroupAttributes

BrandAttributes

Pack TypeAttributes

Packed ProductAttributes

M M

M

1

0/1 0/1

0/1

1

RegionAttributes

Customer AcctAttributes

Industry GroupAttributesIndustry

Attributes

0/1

Delivery PointAttributes

0/1

0/1

0/1

1

1

MM

M

M

YearAttributes

MonthAttributes

DayAttributes

1

1

0/1

0/1

M

M

Figure 3 Star Physical Model Figure 4 Snowflake Physical Model


6/21


The role of DataStage in a custom-built data warehousing environment is to get the data from

the source systems into the correct tables in the data warehouse as defined by the physical

model. The architecture is as follows:

Figure 5 Role of DataStage in a traditional data warehousing environment

In this architecture, DataStage is tightly coupled with the physical data warehouse design.

This has three important implications:

1. Physical data models often change during implementation even though the business model

does not. For instance, as data warehouse volume grows it may be necessary to move

from a snowflake schema - which is easy to populate but inefficient to query - to a star

schema - which is easy to query but harder to populate. This usually involves considerable

rework of the associated DataStage jobs.

2. As the business model changes over time, the physical data model generally becomes

more complicated due to the need to maintain historical versions of the model. DataStage

jobs become correspondingly more complicated as a result.

3. Physical data models require surrogate keys for most data entities, which are used instead

of the natural unique identifier for an object (such as Payroll # for an employee) because

they enable the object ID to be preserved over time even if the natural key changes.

DataStage has to manage the mapping between surrogate keys in the data warehouse

and natural keys in the source, which adds development overhead to each DataStage job.

Now let's see what happens when DataStage is used in conjunction with a KALIDO data

warehouse. The great strength of KALIDO is that it allows its users to work at the business

model level - it transparently (without any user intervention) maps the business model onto a

physical set of tables and manages the interface between them. Only the KALIDO business

model is exposed to DataStage, and the role of ETL is simplified because: KALIDO maps between the logical and physical layer, so physical design changes in the

data warehouse do not affect DataStage jobs.

KALIDO handles changes to the business model over time, allowing historical and current

models to exist concurrently. DataStage jobs are only required to support the current

business model.

KALIDO maps natural keys to surrogates. DataStage jobs only need to supply natural

keys.

Sources

Data Warehouse

Ascential

DataStage

ProductAssociations

ProductAssociations

TimeAssociations

ProductSalesFacts

TargetSales Facts

M M

MM M

M

0/1 0/1

0/1 0/1 0/10/1

Product GroupAttributes

BrandAttributes

Pack TypeAttributes

PackedProductAttributes

M M

M

1

0/1 0/1

0/1

1

RegionAttributes

Customer AcctAttributes

Industry GroupAttributesIndustry

Attributes

0/1

DeliveryPointAttributes

0/1

0/1

0/1

1

1

MM

M

M

YearAttributes

MonthAttributes

DayAttributes

1

1

0/1

0/1

M

M


7/21


In practice, this means the role of DataStage is to provide a list of the instances of each object

defined in the business model. By "instance" we mean a specific Customer, Product, or Sales

Transaction. KALIDO loads this list into the data warehouse, performing the relevant logical-

to-physical mappings internally as it processes the data. This is illustrated in Figure 6:

Figure 6 - DataStage in a KALIDO architecture

Figure 6 shows how DataStage, instead of sending data directly to the data warehouse as it

would be done in a traditional, custom-built data warehousing architecture, now sends it to a

staging area. The staging area is just a place for storing the list of instances for each object

prior to loading into the KALIDO data warehouse, and is usually a collection of tables in their

own schema in the data warehouse database.

In the staging area, the objects are stored without any associations, which illustrates another

key difference between KALIDO and a traditional warehouse, namely thatApplication logic ismanaged by KALIDO and not by DataStage. Validation rules such as 'every Delivery Point

must refer to a Customer' are encoded into the KALIDO business model, and KALIDO verifies

these rules as it receives the data. Therefore, although the Delivery Point list in the staging

area has a column referencing the Customer list, the DataStage job does not have to check

the validity of the references. This is another major simplification over the custom

environment where much of the application logic is encoded in the ETL layer.

So, KALIDO dramatically reduces the complexity of ETL jobs by handling the logical to physical

data model mappings centrally rather than relying on each ETL job to deal with these

mappings locally. Does this mean it also reduces the need for DataStage? No! In fact, the

need for DataStage has increased. This is because KALIDO delivers a unique benefit that is

almost unheard of in traditional data warehousing - very rapid response to change. As new

systems are integrated, as the scope of the data warehouse increases, as the business model

evolves, KALIDO quickly adapts to the changes. This puts great pressure on the ETL tool to

provide up-to-date data. If data cannot be extracted from a new source, then whether the

data warehouse is ready to receive it is irrelevant.

Sources

Staging Area

Ascential

DataStage

Brands

Product

Pack T es

Packed

Industr

Re ions

Customer

Industries

Deliver Pts

Da s

Months

Le ac Sales

Tar et Sales

Years

Dynamic

InformationWarehouse

Product

PackedProduct

Product

Group

Pack

TypeBrand

Time

Year

Quarter

Day

Customer

Customer

Account

IndustrialClassification

IndustryGroup

Industry

Month

DeliveryPoint

LegacySales Volume Revenue

Distributioncost

Target Sales Targetvolume

Region

KALIDO


8/21


DataStage is designed for rapid development, and its functionality complements KALIDO:

Ascential PACKs for SAP, PeopleSoft, Siebel, Oracle and other enterprise applications.

The predefined interfaces, transformations and metadata models offered in the PACKs

make it easier to find and use the relevant data.

DataStage graphical design environment. Greater productivity over manual

programming greater reuse and manageability. Except for the most complextransformations, it is much easier to work with a visual data flow representation than lines

of code.

DataStage performance. As the KALIDO data warehouse scope increases, more data is

processed in shorter timeframes. DataStage has a parallel engine that scales as data

volume increases.

DataStage scheduling and sequencing. Moving data in and out of KALIDO data

warehouses requires the management of many interdependent processes. DataStage has

a graphical tool for building sequences with conditional dependencies, error handling and

parallel execution.

DataStage and KALIDO are highly complementary, allowing for simpler and iterative RAD-styledevelopment. However to get the maximum benefit they must be integrated properly. To do

this, the ETL target structure must be understood, which is described next.


9/21


3 Understanding the ETL targetThe target of your ETL jobs in context of a KALIDO data warehouse is a staging area, and each

job delivers a list of instances for one object in the business model. KALIDO loads the lists into

the data warehouse.

In KALIDO, a staging-area-to-warehouse mapping (called a load definition) is defined foreach object. There are two load definition types: one for reference data (such as Customers

and Products) and the other one for fact data (Product Sales). There are slight differences

between the two. Looking at a reference data load definition, KALIDO uses a similar style of

interface to DataStage for this type of definition, so it should appear straightforward:

Figure 7 - KALIDO reference data load definition

The right-hand panel displays the object from the business model. The left-hand and central

panes show the mappings between the column headings of the input table in the staging area

and the business model object components. Every object has at minimum two components:

A code - the natural code from the source system, which will be unique among the

instances of that object

A name - which describes the instance, and doesnt need to be unique

Objects may also have parents or additional attributes. In this case, the model defines Packed

Product as having three mandatory parents: Brand, Pack Type and Product Group, so these

must be included as part of the mapping.

Every definition also has a Transaction Date. Over time, reference data changes - new

Customers are added, Products are assigned to different Brands, etc. The transaction date is

simply the date when this change occurred.

The column names on the left are the field names in the staging table. They can be anything

as long as they match the name given in the mapping. Alternative field names might be:

Figure 8 - KALIDO reference data load definition, alternative field names


10/21


This second style of field names can be very useful when reusing generic staging area tables

and DataStage jobs for more than one object, which will be described later.

Load definitions for fact data are very similar. Here is a definition for Product Sales:

Figure 9 - KALIDO fact data load definition

This mapping shows the staging area table has three columns containing the natural codes for

the reference data associated with the sale: the Delivery Point code, Packed Product code and

Sale Day. It also has the numeric data associated with the transaction Distribution Costs,Revenue and Volume. Lastly, it has a transaction date. The Transaction Date serves a slightly

different purpose from the day of Sale - it relates the transaction to the time-variant reference

data. Suppose the name of a particular product has recently changed; because this is a fact

data load definition we only supply the code for the product and not its name. The transaction

date then allows us to determine if the sale occurred before or after the name change.

This is essentially all thats necessary to know about the ETL targets - straightforward tables

with one row per instance of an object, one column per object component, and a date column

that lets KALIDO track how the reference data is changing over time.

Thats all the background needed to discuss best-practice techniques for using DataStage with

KALIDO. This is what we turn to for the rest of the paper.


11/21


4 Best Practice TechniquesThis section deals with best practices for use of KALIDO in conjunction with DataStage and is

broken down into the four functional areas of ETL:

1. Extracting data from any source

2.

Transforming data3. Loading into the staging area

4. Process control and scheduling

4.1 Extracting data from the sourceWorking with the business model

Data extraction is driven by the KALIDO business model, which defines the objects, mandatory

and optional object components, and relationships to other objects. For each object there is a

corresponding ETL output, and the DataStage Designer has to locate in the source systems

data corresponding to the object components. In some cases this is easier than others - an

object in the model may come from a single table in a single source system, or it may be

composed of records and fields spread across several tables and systems.

Kalido recommends that you do not start extensive ETL development until the KALIDO

business model is stable. Allow business modelers to sign off on a release that will be used for

the initial ETL jobs. During this exploratory stage, DataStage and other tools in the Ascential

Enterprise Integration Suite will be used to verify the correctness of the model, but it's too

early to begin production development. Even after an initial version of the model has been

signed off, it will continually change throughout the lifecycle of the data warehouse. Because

of this, you should establish a clearly understood and enforced change-control process to track

all modifications.

By default, you should only create one extract per object. When you prototype a KALIDO data

warehouse, it is possible to create KALIDO load definitions that combine objects such as

multiple levels of a hierarchy or a mix of reference data and fact data. In production it is

strongly recommended that the designer follows the rule of creating one extract per object.

Over the lifecycle of the data warehouse the uniformity and consistency of one extract per

object pays dividends because the ETL components are easier to understand, reuse and

modify.

Full or delta extraction

When extracting data, a key decision is whether to extract all object instances, or just new or

modified instances. For example, there may be one million customers, and each week about

five thousand customers are added and a few hundred change their address or other details.

DataStage can pass all one million Customer records to KALIDO and ask it to calculate what's

different, or the changes can be calculated during the ETL and only these passed to KALIDO.

Which is best? This depends on which is fastest and also which is easiest to implement.

For some objects, detecting changes is easy a 'last modified' field or a change log at the

source can be used to filter records. In this case, change detection should be done at the

source. If there is no obvious way of filtering the data, the designer has to weigh the relative


12/21


merits of building a change detection algorithm in the ETL versus passing the full dataset to

KALIDO. Generally, delta detection is done in DataStage if there are a great number of

records. In our example of a million customer records, with 0.5% changing per week, it is

likely that the performance benefits of building a change detection algorithm into the ETL job

outweigh development costs. This is because if delta detection is done by KALIDO, then before

it starts processing, one million records have to be extracted from the source, passed through

DataStage, and loaded into staging tables. Delta detection at the source avoids this overhead.

There are many algorithms for change detection. A simple approach is to concatenate the

component values of an object into a string and compare it to the same string when the object

was last extracted. Another is to use the SQL 'minus' operator which subtracts one set of

records from another leaving just the differences. The best algorithm will depend on the exact

circumstances - if in doubt Kalido consultants can provide experienced advice.

In contrast to reference data, fact data does not often change after it has been created. A

Product name may slowly change over time, but fact data is typically a point-in-time event -

either a Product was sold on a particular day or it wasn't. If fact data does change - in our

model we have a fact table called 'Target Sales' which may undergo revisions - KALIDO

handles changes similarly to that of reference data. Given the high data volumes involved, thistype of delta detection would normally be done during the ETL.

Handling time-variant reference data using transaction date

The correct handling of time-variant reference data is essential for the delivery of meaningful

business intelligence. If we reclassify a product as a different brand, all new product sales will

be recorded against the new brand. A report by brand will show a sudden drop in revenue for

one brand and a rise in another. But, to understand the yearly growth of brand sales,

management will need at least two other versions of this report:

A report as if the product had remained in the old brand

A report as if the product had always been part of the new brand.

Figure 10 - The importance of time variance

Brand Sales

Q3-01 Q4-01 Q1-02 Q2-02 Q3-02 Q4-02 Q1-03 Q2-03 Q3-03 Q4-03 Q1-04 Q2-04

Quarter

Sales

New productadded to brand

Brand sales withoutnew product

Brand Sales

Q3-01 Q4-01 Q1-02 Q2-02 Q3-02 Q4-02 Q1-03 Q2-03 Q3-03 Q4-03 Q1-04 Q2-04

Quarter

Sales

New productadded to brand

Brand sales withoutnew product


13/21


Figure 10 illustrates the importance of time variance to the business. In a traditional, custom-

built data warehouse, including fully flexible time variance makes the physical schema much

more complicated, with a corresponding rise in the complexity of DataStage jobs. With

KALIDO, time variance is handled automatically for all reference data. All the ETL has to do is

to provide KALIDO with a timestamp indicating the date of any reference data changes.

So what value do we choose for the transaction date? There are two types of changes we need

to consider - new reference data coming into existence, and modifications to existing data. In

practice, the dates we use for these changes are closely tied in with the fact data - a Product

must come into existence before it can be sold, and if a product has been re-branded, the sale

must have a transaction date that corresponds to the correct brand of the product at the time.

Ideally there will be some date field in the source that can be used for this purpose. If no

suitable date exists, our recommendation is as follows:

For creation dates use a constant historical date, such as 1/1/2000 - it generally does

not matter if a Product is deemed to have been created some time before it is sold,

whereas the opposite will cause a data quality error.

For modifications use the extract date. If we extract Product data nightly it is normally

sufficient to record that the change happened sometime during that day. For greater

accuracy increase the extract frequency.

Handling changing codes

A final consideration during extraction is to make sure that a data item can still be correctly

identified by the data warehouse if its natural code changes. KALIDO must be told about the

code change, and this can be done via a load definition in the normal way. If you look back to

the load definition in Figure 7, you'll see the component 'New Packed Product Code' - that's

what this is used for. You need to make sure that any extracts loaded into KALIDO before the

code change use the old code, and all subsequent extracts use the new code.

Summary

Design DataStage jobs around the business model with one extract per object - use a

formal change control process to track changes to the model

For objects with a million or more records, do delta detection during the ETL stage

If no timestamp exists in the source for changing reference data, use a constant date

for the creation date of new objects, and use the extract date for modifications

Check for changing natural codes and make sure changes are loaded into KALIDO

before using the new codes

4.2 Transforming dataTransformation typically refers to the transformation of data from the physical schema of thesource to the physical schema of the target. This has already been discussed, so this section

concentrates on additional topics relating to transformation.

Data summarization and allocation

Customarily, data warehouses feature a host of transformations that summarize data up the

levels of a hierarchy, and allocate fact data to lower levels. This changes with KALIDO.


14/21


For example, daily sales totals from an ERP system are summarized into weekly totals. This

can be done before or after loading into the data warehouse. In a custom environment, the

ETL effort is typically the same - DataStage does the summarization in either case. With

KALIDO, summarizations are performed inside the data warehouse. KALIDO has extensive

tools to efficiently summarize data to the level required for business reporting after loading

into the data warehouse. Therefore ETL summarization is only required when summarizing

datapriorto loading into the data warehouse.

When we summarize prior to loading we lose information. The normal advice is to load data

into the data warehouse at the lowest level of granularity available and summarize it in

KALIDO. KALIDO copes well with large fact data volumes - there are multi-terabyte KALIDO

implementations and also implementations that load millions of fact records per day. Data only

needs to be summarized before loading if volumes are exceptionally high. A good compromise

is to load the most recent data at the lowest level and use KALIDO to summarize historic data

to a higher level and purge the low level data. So all data for the current year may be held at

the day level, but previous years will be stored at the weekly level. Careful architecting like

this within KALIDO usually enables data to be loaded at the lowest level available regardless of

the data volumes.

The opposite of a summarization is an allocation. In an allocation we process fact data to a

lower level than it exists in the source system. An example is an algorithm that takes

marketing costs per brand and allocates them across all the products in that brand as part of

calculating unit cost per product. Or a managers salary may be allocated across his or her

team members to estimate the true cost of hiring new employees.

KALIDO does not have built-in allocation functionality. Therefore, allocations must be done

during the ETL or reporting stages. Doing allocations in reports is possible but often requires

an expert report developer, whereas doing allocations during ETL is generally straightforward.

KALIDO is often used as the source for allocations because, as the central repository of data

from across the business, KALIDO already holds the necessary fact data in the most accessibleform. DataStage queries KALIDO for the source data, performs the allocations, and loads the

results back into KALIDO as additional facts.

There is no need to do allocations or summarizations just to bring all data to the same level

before loading into the data warehouse. The different sources are modeled as separate objects

in the business model, and KALIDO selects and summarizes the data as needed to satisfy

specific reports. Look at the business model in Figure 2 - we see that Target Sales are loaded

at a quarterly level but Product Sales at a daily level. We do not need to do anything at the

ETL stage to allow us to compare the two - KALIDO automatically summarizes actual sales to

the quarterly level before passing to the reporting tool.

Currency conversions

A common transformation is to convert fact data from one currency to another. KALIDO has

extensive built-in currency conversion functions, so the rule here is to take data in the original

currency of the source system, supply KALIDO with tables of exchange rate data, and let

KALIDO perform the currency conversion during report generation. KALIDO can also convert

units of measure - for instance ounces to grams.


15/21


Writing reusable transforms

In a non-KALIDO environment, the need to work at the physical schema level leads to

DataStage jobs that are difficult to reuse. With KALIDO, the physical schema of the staging

area is very simple and is the same for all jobs. Hence the potential for reuse is great. Many

transformations will apply across multiple ETL tasks, such as transforming dates into a

common format. Transforms should be written to be reused across jobs. DataStage has avariety of methods for doing this, the simplest being to write the transformation as a function.

Simple transformations, such as splitting or merging rows or columns, can be done in KALIDO.

However, these are only used when prototyping, when an ETL tool may not be available. In

general it is better to centralize all transformations in the same place for ease of maintenance.

Summary

Do not summarize fact data before loading into the data warehouse.

Do use DataStage for data allocations prior to loading into the data warehouse

Load source data in the source currency. KALIDO can convert currencies at report time

Code transformations to be reusable - there is much more potential for reusing them

in a KALIDO warehouse than a custom solution

4.3 Loading dataFrom the DataStage perspective, 'loading' means loading data into the staging area where

KALIDO picks it up and loads it into the data warehouse. After staging area data has been

processed it can be deleted.

KALIDO can load data from flat files or database tables. Throughout this paper we have

assumed the data is loaded from tables. This is because database tables are easier to

manipulate than flat files, which is especially useful if additional transformations are required

once data has been put into the staging area tables. Best practice is to locate the staging areatables as a separate schema within the data warehouse database. This simplifies

housekeeping tasks such as backups.

The staging area table structure can take two forms. First, there can be one table per business

model object. Each table will have column names corresponding to the system labels of the

object components in KALIDO (as shown in Figure 7). The advantage of this approach is

readability as it is obvious which column refers to which component. The disadvantage is that

a typical business model will have many logical objects so the staging area will have a large

number of tables. This can be a maintenance burden, although the burden can be minimized

with scripts that automate the initial table creation and other common processes.

Alternatively, groups of objects can be stored in the same table, with generic column headings

such as 'Entity 1,' 'Entity 2,' etc. (as shown in Figure 8). The generic table has an additional

column which contains the object name, and KALIDO filters the table so that only records for

that object are loaded. The advantage of this approach is that the staging area has as few as

two tables - one for reference data and one for transaction data. This simplifies staging area

management, and also simplifies the creation of processes that work across objects, such as

data purging. To avoid confusion over which columns refer to which components, views should

be created that map the generic column names to the component names. The DataStage job


16/21


inserts data through the views and KALIDO reads data from the views. This combines column

readability with the flexibility of generic tables. As the view generation can be easily

automated, this is the recommended staging area design.

Summary

Design the staging area as a small number of generic tables stored in a separate

schema of the data warehouse database

Use views to map component names onto the generic column headings for readability

4.4 Scheduling and job sequencingThis is the final focus of the paper, and it is an area of great importance.

A data warehouse is typically updated on a nightly basis, and a host of processes need to be

carefully coordinated to marshal data in and out of the data warehouse, including:

Extracting reference and fact data from sources and loading it into the staging area

KALIDO loading data into the warehouse from the staging area

KALIDO processing the data into star schemas and standalone data marts for reporting

Report generation by the reporting tool

Housekeeping processes, such as database backups and purging the staging area

The natural place to build and manage these processes is in DataStage. DataStage has a

powerful set of tools that organizes jobs into process flows called 'sequences.' Sequences can

control DataStage jobs as well as KALIDO and other third party processes.

Job sequence implementation and optimization should start early and be generously

resourced. Sequence designer goals include:

Minimize build and maintenance requirements

Trap and handle errors caused by process failure and poor data quality

Optimize performance

These are key topics which each require a full discussion.

Minimize build and maintenance requirements

This is best achieved by building "generic" sequences so that we only need to build and

maintain a small handful of distinct processes.

What is meant by generic? As an example, consider a sequence for loading reference data

into KALIDO from the staging area. Reference data is organized into hierarchies, and within

each hierarchy, parent objects must be loaded before their children. Typically you wouldcreate a graphical sequence for each hierarchy, manually dragging the job for each object into

the sequencer and linking them to run in the correct order. As the data model changes

throughout the data warehouse lifecycle, the sequences will require regular maintenance.

A better solution is to assume the business model will keep changing, and build a sequence

that is driven by the business model, working out dynamically the order in which data needs


17/21


to be loaded. KALIDO stores a complete metadata description of the model as a set of

database views. These can be queried as the sequence executes. A simple algorithm might be:

Use the metadata views to build a list of the objects in each dimension, sorted so that

parent objects are listed before child objects

Loop through this list loading data for each object in turn

This algorithm can be built graphically using the DataStage sequence designer (for examples

of DataStage sequences see Figures 10 and 11). The visual representation, like a flowchart, is

straightforward to understand and modify.

If sequences are built ad-hoc, it's easy to end up with a proliferation of sequences each of

which do similar things. By using KALIDO business model metadata, we can instead build a

small number of general-purpose, adaptable solutions. With a little care, and a good

knowledge of DataStage sequences and KALIDO metadata views, it's possible to dramatically

reduce both the number of sequences and the ongoing maintenance cost.

Trap and handle errors caused by process failureAny job can fail for a number of reasons: hardware crashes, lack of disk space, missing

mandatory fields, wrong date formats, duplicate records with inconsistent values, etc.

Therefore, all possible failure cases need to be identified, and error handling designed for each

one. Coding the error handling mechanism is usually straightforward because of the abundant

error handling functionality in DataStage. The hard part is deciding what to do.

This is especially true when handling data quality errors. Suppose the Customer associated

with a new Delivery Point is not entered into the source system. KALIDO will reject the record

when it tries to load it into the data warehouse. Do we fix the data in the source and re-

extract it - who is available to correct the data and how do we notify them of the problem? Do

we leave the source as it is and fix the data in the staging area - how do we ensure this does

not create problems later on because the source and the data warehouse have different

values? Or do we change the design of the source so that this problem cannot arise.

Establishing business processes to handle these problems can take a long time. Data

warehouses suffer acutely from 'business paralysis' because they cut across multiple levels of

the organization and bring together people and departments who have not previously worked

together. The business needs time to establish ownership for problems, develop procedures

for resolving them, and train staff. These matters need to be investigated right at the start of

the project, and the business needs to be involved from day one.

KALIDO MDM is an application in the KALIDO suite, which addresses this by integrating

business people into the process. It allows them to manage master reference datacollaboratively in the context of quality control workflows, and to formally approve reference

data before it is released to the data warehouse. Ascential QualityStage complements

KALIDO MDM by standardizing, matching and de-duplicating data according to business rules.

Detailed exploration of KALIDO MDM and QualityStage is outside the scope of this paper.


18/21


Optimizing performance

Like error handling, data warehouse performance is something that must be designed in from

the start. The faster data is delivered to the end user the more timely and useful it is. Data

warehouses are usually refreshed on an overnight basis, and if users are in multiple time

zones, 'overnight' may last just a few hours.

The goal of the process flow designer is to maximize available hardware so that the load is

spread evenly across time and hardware, rather than hardware experiencing short bursts of

activity and long periods of idleness. Jobs should also be scalable so they can take full

advantage of new hardware. A key to better performance is parallel processing. There are two

types of parallel processing, both of which should be used wherever possible:

Break individual jobs into parallel streams. Consider a DataStage job which extracts data from

a database server, transforms the data on the DataStage server and saves it to the staging

area database server. Often none of the hardware components is heavily stressed during this

process, and in such cases DataStage can divide source data into independent partitions and

run them in parallel. The number of partitions can be increased until all hardware is operating

efficiently independent of the DataStage job design. As a result, the job is processed severaltimes faster. If hardware is upgraded, the number of partitions can be increased. It is easy to

include such parallelism in DataStage jobs, especially if it is built in from the beginning.

Figure 11 Parallelism within a single DataStage job

During job execution, data is automatically divided into the number of partitions the user

specified and automatically re-partitioned between stages. This is described below.


19/21


Parallel Execution

In creating a DataStage dataflow diagram, the user concentrates on the sequential flow of

large collections of records through a sequence of processing steps. Users do not need to

worry about the underlying architecture of the multiprocessor computer that will be used for

running the application. DataStage Enterprise Edition provides a clean separation between the

sequential expression of the workflow of the data integration application and the parallel

execution of the application in the production computing environment.

DataStage Enterprise Edition exploits both pipeline parallelism and partition parallelism to

achieve high throughput and performance:

o Data pipelining means that when the application begins to run, records get pulled from the

source system and move through the sequence of processing functions defined in the

dataflow graph. The records are flowing through the pipeline using [virtual] data sets

which makes it possible to move the records through the sequence of processing functions

without having to land the records to disk.

o Data partitioning is an approach to parallelism that involves breaking up the record set

into partitions, or subsets of records. Data partitioning generally provides good, linear

increases in application performance. DataStage Enterprise Edition supports automatic

repartitioning of records as they are moving through the application flow, using a broad

range of partitioning approaches including hash, range, entire, random, round robin, same

and DB2.

Users create a simple sequential dataflow graph using the Enterprise Edition Designer canvas.

When constructing the sequential dataflow graph, users do not have to worry about the

underlying hardware architecture or number of processors. A separate configuration file

defines the resources (processors, memory, disk) of the underlying multiprocessor computing

system. The configuration provides a clean separation between the creation of the sequential

dataflow graph and the parallel execution of the application, which greatly simplifies thedevelopment of scalable data integration systems that execute in parallel.

DataStage Enterprise Editions architecture allows users to scale application performance

effortlessly by adding hardware resources without having to change the data integration

application. The same application can run on a one-processor system, an SMP system, a

cluster of SMP systems, or an MPP system with near-linear increases in performance without

changing the application. DataStage Enterprise Edition also supports grid computing. Grid


20/21


computing takes advantage of all distributed computing resources processor and memory

available on the network to create a single system image.

It is impossible to know in advance how much effort needs to be spent performance tuning

sequences. Expect to revise them over time as more is understood about the capacity of the

hardware and the data volumes, and allow plenty of time for this in the project plan.

Summary

Minimize the number of job sequences by writing them in generic form driven by the

KALIDO business model meta data

Investigate data quality issues and business processes to resolve them at project-start

Design sequences to leverage parallelism where possible

5 ConclusionsThis paper began by establishing the need for enterprise data warehouses which are

predicated on the assumption of business change. KALIDO is a data warehousing solution that

does this by allowing the designer to work at the logical data level rather than the physical

data level.

KALIDO is complementary to the Ascential Enterprise Integration Suite. The iterative KALIDO

approach relies on data integration software which can keep up with the rapid pace of

development. Ascential DataStage is a tool that can do this, provided it is used correctly.

KALIDO simplifies the transformation and loading part of ETL by loading data from a simple

staging area consisting of lists of data for each object in the business model. DataStage only

has to provide data in this straightforward, uniform format, rather than put the data through

further transformations to support the underlying physical table structure. This greatly reduces

the complexity of DataStage jobs. Other KALIDO features such as built-in data validation,surrogate key management, time variance, summarization and currency conversion reduce

this complexity still further. Extracting data easily from the source systems remains a key

task, and the DataStage connectivity to mainframes, enterprise applications, databases, and

real-time message queues are vital for quickly integrating new source systems.

DataStage is also responsible for marshaling the data from source system to KALIDO. The

challenging task of the sequence designer is to build solutions that are generic (reusable

across lots of individual jobs), handle poor quality data and other types of job failures, and use

parallelism to optimize performance. Sequence design should begin early, with plenty of

resources allocated throughout the lifecycle of the data warehouse.

For further information on any of the topics raised in this paper, please [email protected] or [email protected]. To find out more about KALIDO,

please visit the white paper section of our website at http://www.kalido.com/library. To learn

more about Ascential DataStage and other Ascential Enterprise Integration Suite software,

please visit www.ascential.com.


21/21

www.kalido.com

For more information please contact us

I: www.kalido.com

E: [email protected]

Kalido

25 Burlington Mall Road

Burlington, MA 01803

Tel: +1 781 229 6006

Kalido

8 York Road

London

SE1 7NAUnited Kingdom

Tel: +44 (0) 20 7934 3300

Kalido17 Square Edouard VIIF-75009ParisFrance

l ( )

www.kalido.com

For more information please contact us

I: www.kalido.com

E: [email protected]

Kalido

25 Burlington Mall Road

Burlington, MA 01803

Tel: +1 781 229 6006

Kalido

8 York Road

London

SE1 7NAUnited Kingdom

Tel: +44 (0) 20 7934 3300

Kalido17 Square Edouard VIIF-75009ParisFrance

l ( )

best-practice etl for kalido® 8 using ascential

Documents