best-practice etl for kalido® 8 using ascential
TRANSCRIPT
-
8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential
1/21
Best-Practice ETL for KALIDO 8 using
Ascential DataStage
October 2004
Gary Powell
Senior KALIDO Application Consultant
-
8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential
2/21
Copyright 2004 Kalido www.kalido.com2
Table of contents
1 Introduction....................................................................................................... 3The need for data integration tools in an enterprise data warehouse............................3
2 How ETL changes in a KALIDO environment ...................................................... 53 Understanding the ETL target ............................................................................ 94 Best Practice Techniques ..................................................................................11
4.1 Extracting data from the source .................................................................... 11Working with the business model ......................................................................... 11Full or delta extraction........................................................................................ 11Handling time-variant reference data using transaction date..................................... 12Handling changing codes..................................................................................... 13
4.2 Transforming data ...................................................................................... 13Data summarization and allocation ....................................................................... 13Currency conversions ......................................................................................... 14Writing reusable transforms................................................................................. 15
4.3 Loading data.............................................................................................. 154.4 Scheduling and job sequencing ..................................................................... 16
Minimize build and maintenance requirements........................................................ 16Trap and handle errors caused by process failure.................................................... 17Optimizing performance...................................................................................... 18
5 Conclusions.......................................................................................................20
-
8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential
3/21
Copyright 2004 Kalido www.kalido.com3
1 IntroductionGlobal 2000 companies now face greater regulatory and shareholder pressure than ever to
increase corporate accountability, transparency and performance. As a result, they are
creating enterprise data warehouses which enable multiple, corporate-wide views of business
performance across disparate systems and organizations.
Unfortunately, it takes months (12-18 is not uncommon) to custom build or modify a data
warehouse, and many are never even completed. As a result, business people fail to receive
the timely management information they need to make high-quality decisions.
To address this challenge, companies are taking a more iterative approach to enterprise data
warehousing using KALIDO, which automates the creation and maintenance of enterprise
data warehouses and master data throughout their lifecycle. The KALIDO application suite
automatically adapts data warehouses and their associated master data to new business needs
based on changes made to real-world business models. Kalido customers create and modify
data warehouses 75% faster and at half the cost of traditional approaches.
The need for data integration tools in an enterprise data warehouse
A data warehouse typically sources data from a wide range of IT systems and organizations.
To help deliver this data to the data warehouse, companies deploy tools such as Ascential
Enterprise Integration Suite, which includes:
Ascential ProfileStage - data profiling to evaluate source data content and structure
Ascential QualityStage - data cleansing to find & reconcile low-quality or redundant data
Ascential DataStage - data extraction, transformation and loading (ETL)
Ascential MetaStage - metadata management for definitions and history of business data
Figure 1 - Kalido and Ascential Products
ANY SOURCE
CRM
ERP
SCM
RDBMS
Legacy
EAI /
Messaging
Web svc
XML/EDI
Legacy DW
Parallel Execution Engine
DISCOVERDISCOVER PREPAREPREPARE TRANSFORMTRANSFORM
ProfileStageProfileStage QualityStageQualityStage DataStageDataStage
Meta Data Management
Real-Time Integration Services
Enterprise Connectivity
and Event Management
Service-Oriented Architecture
Data WarehouseSchema & Content
Adaptive Services Core
Auto-Builds
Business Model &Master Data
Region
Country
District
Target
Industry
Sector
Size
Account
Contract
Geo Address
TargetCust
Manager
AccountGrp
Partners
DelPoint
Builds
Universes
Cubes
Marts
Reports
Master Data
Auto-Builds
Transactions
Ascential Enterprise Integration Suite KALIDO Business
Intelligence
ANY SOURCE
CRM
ERP
SCM
RDBMS
Legacy
EAI /
Messaging
Web svc
XML/EDI
Legacy DW
Parallel Execution Engine
DISCOVERDISCOVER PREPAREPREPARE TRANSFORMTRANSFORM
ProfileStageProfileStage QualityStageQualityStage DataStageDataStage
Meta Data Management
Real-Time Integration Services
Enterprise Connectivity
and Event Management
Service-Oriented Architecture
Data WarehouseSchema & Content
Adaptive Services Core
Auto-Builds
Business Model &Master Data
Region
Country
District
Target
Industry
Sector
Size
Account
Contract
Geo Address
TargetCust
Manager
AccountGrp
Partners
DelPoint
Business Model &Master Data
Region
Country
District
Target
Industry
Sector
Size
Account
Contract
Geo Address
TargetCust
Manager
AccountGrp
Partners
DelPoint
Region
Country
District
Region
Country
District
Region
Country
District
Target
Industry
Sector
SizeTarget
Industry
Sector
Size
Account
Contract
Geo Address
TargetCust
Manager
AccountGrp
Partners
DelPoint
Account
Contract
Geo Address
TargetCust
Manager
AccountGrp
Partners
DelPoint
Builds
Universes
Cubes
Marts
Reports
UniversesUniverses
Cubes
MartsMarts
ReportsReports
Master Data
Auto-Builds
Transactions
Ascential Enterprise Integration Suite KALIDO Business
Intelligence
-
8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential
4/21
Copyright 2004 Kalido www.kalido.com4
This paper outlines a best-practice approach for using DataStage with KALIDO. As you will
see, the use of DataStage differs from a custom-build environment, with less emphasis on
data transformation and loading, but more emphasis on easy and efficient extraction from
multiple data sources, and the management of data as it is moved into the data warehouse.
In the rest of this paper a basic familiarity with standard ETL and data warehousing concepts
is assumed, however a specific knowledge of KALIDO or DataStage is not required.
-
8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential
5/21
Copyright 2004 Kalido www.kalido.com5
2 How ETL changes in a KALIDO environmentA good way of thinking about ETL in a KALIDO data warehouse environment is that it allows
you to work with data at a logical levelrather than at thephysical level.
Typically, a data warehouse is built around a logical data model, which characterizes data as anumber of related entities. This is then implemented as a physical data model, consisting of a
number of database tables. There are many ways to map the logical to a physical data model
depending on data requirements, such as speed of retrieval and ease of maintenance.
Here is a logical data model drawn in KALIDO notation, a business-friendly way of expressing
logical data models for enterprise data warehouses. KALIDO refers to these as business
models.Fact data is located in the ovals, and reference data hierarchies are located in the
boxes. Arrows between reference data signify "is classified by, with dotted arrows meaning
the classification is optional. The green box means "classification can be to either level".
Product
Packed Product
Product
Group
Pack
TypeBrand
Time
Year
Quarter
Day
Customer
Customer
Account
IndustrialClassification
Industry Group
Industry
Month
Delivery Point
Product Sales Volume
Revenue
Distribution cost
Target Sales Target volume
Region
Figure 2 - Business Model (Logical Data Model)
Customarily, physical schema (star or snowflake usually) are built from this model:
ProductDimension
CustomerDimension
TimeDimension
ProductSales Facts
TargetSales Facts
M M
MM M
M
0/1 0/1
0/1 0/1 0/10/1
ProductAssociations
CustomerAssociations
TimeAssociations
ProductSales Facts
TargetSales Facts
M M
MM M
M
0/1 0/1
0/1 0/1 0/10/1
Product GroupAttributes
BrandAttributes
Pack TypeAttributes
Packed ProductAttributes
M M
M
1
0/1 0/1
0/1
1
RegionAttributes
Customer AcctAttributes
Industry GroupAttributesIndustry
Attributes
0/1
Delivery PointAttributes
0/1
0/1
0/1
1
1
MM
M
M
YearAttributes
MonthAttributes
DayAttributes
1
1
0/1
0/1
M
M
Figure 3 Star Physical Model Figure 4 Snowflake Physical Model
-
8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential
6/21
Copyright 2004 Kalido www.kalido.com6
The role of DataStage in a custom-built data warehousing environment is to get the data from
the source systems into the correct tables in the data warehouse as defined by the physical
model. The architecture is as follows:
Figure 5 Role of DataStage in a traditional data warehousing environment
In this architecture, DataStage is tightly coupled with the physical data warehouse design.
This has three important implications:
1. Physical data models often change during implementation even though the business model
does not. For instance, as data warehouse volume grows it may be necessary to move
from a snowflake schema - which is easy to populate but inefficient to query - to a star
schema - which is easy to query but harder to populate. This usually involves considerable
rework of the associated DataStage jobs.
2. As the business model changes over time, the physical data model generally becomes
more complicated due to the need to maintain historical versions of the model. DataStage
jobs become correspondingly more complicated as a result.
3. Physical data models require surrogate keys for most data entities, which are used instead
of the natural unique identifier for an object (such as Payroll # for an employee) because
they enable the object ID to be preserved over time even if the natural key changes.
DataStage has to manage the mapping between surrogate keys in the data warehouse
and natural keys in the source, which adds development overhead to each DataStage job.
Now let's see what happens when DataStage is used in conjunction with a KALIDO data
warehouse. The great strength of KALIDO is that it allows its users to work at the business
model level - it transparently (without any user intervention) maps the business model onto a
physical set of tables and manages the interface between them. Only the KALIDO business
model is exposed to DataStage, and the role of ETL is simplified because: KALIDO maps between the logical and physical layer, so physical design changes in the
data warehouse do not affect DataStage jobs.
KALIDO handles changes to the business model over time, allowing historical and current
models to exist concurrently. DataStage jobs are only required to support the current
business model.
KALIDO maps natural keys to surrogates. DataStage jobs only need to supply natural
keys.
Sources
Data Warehouse
Ascential
DataStage
ProductAssociations
ProductAssociations
TimeAssociations
ProductSalesFacts
TargetSales Facts
M M
MM M
M
0/1 0/1
0/1 0/1 0/10/1
Product GroupAttributes
BrandAttributes
Pack TypeAttributes
PackedProductAttributes
M M
M
1
0/1 0/1
0/1
1
RegionAttributes
Customer AcctAttributes
Industry GroupAttributesIndustry
Attributes
0/1
DeliveryPointAttributes
0/1
0/1
0/1
1
1
MM
M
M
YearAttributes
MonthAttributes
DayAttributes
1
1
0/1
0/1
M
M
-
8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential
7/21
Copyright 2004 Kalido www.kalido.com7
In practice, this means the role of DataStage is to provide a list of the instances of each object
defined in the business model. By "instance" we mean a specific Customer, Product, or Sales
Transaction. KALIDO loads this list into the data warehouse, performing the relevant logical-
to-physical mappings internally as it processes the data. This is illustrated in Figure 6:
Figure 6 - DataStage in a KALIDO architecture
Figure 6 shows how DataStage, instead of sending data directly to the data warehouse as it
would be done in a traditional, custom-built data warehousing architecture, now sends it to a
staging area. The staging area is just a place for storing the list of instances for each object
prior to loading into the KALIDO data warehouse, and is usually a collection of tables in their
own schema in the data warehouse database.
In the staging area, the objects are stored without any associations, which illustrates another
key difference between KALIDO and a traditional warehouse, namely thatApplication logic ismanaged by KALIDO and not by DataStage. Validation rules such as 'every Delivery Point
must refer to a Customer' are encoded into the KALIDO business model, and KALIDO verifies
these rules as it receives the data. Therefore, although the Delivery Point list in the staging
area has a column referencing the Customer list, the DataStage job does not have to check
the validity of the references. This is another major simplification over the custom
environment where much of the application logic is encoded in the ETL layer.
So, KALIDO dramatically reduces the complexity of ETL jobs by handling the logical to physical
data model mappings centrally rather than relying on each ETL job to deal with these
mappings locally. Does this mean it also reduces the need for DataStage? No! In fact, the
need for DataStage has increased. This is because KALIDO delivers a unique benefit that is
almost unheard of in traditional data warehousing - very rapid response to change. As new
systems are integrated, as the scope of the data warehouse increases, as the business model
evolves, KALIDO quickly adapts to the changes. This puts great pressure on the ETL tool to
provide up-to-date data. If data cannot be extracted from a new source, then whether the
data warehouse is ready to receive it is irrelevant.
Sources
Staging Area
Ascential
DataStage
Brands
Product
Pack T es
Packed
Industr
Re ions
Customer
Industries
Deliver Pts
Da s
Months
Le ac Sales
Tar et Sales
Years
Dynamic
InformationWarehouse
Product
PackedProduct
Product
Group
Pack
TypeBrand
Time
Year
Quarter
Day
Customer
Customer
Account
IndustrialClassification
IndustryGroup
Industry
Month
DeliveryPoint
LegacySales Volume Revenue
Distributioncost
Target Sales Targetvolume
Region
KALIDO
-
8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential
8/21
Copyright 2004 Kalido www.kalido.com8
DataStage is designed for rapid development, and its functionality complements KALIDO:
Ascential PACKs for SAP, PeopleSoft, Siebel, Oracle and other enterprise applications.
The predefined interfaces, transformations and metadata models offered in the PACKs
make it easier to find and use the relevant data.
DataStage graphical design environment. Greater productivity over manual
programming greater reuse and manageability. Except for the most complextransformations, it is much easier to work with a visual data flow representation than lines
of code.
DataStage performance. As the KALIDO data warehouse scope increases, more data is
processed in shorter timeframes. DataStage has a parallel engine that scales as data
volume increases.
DataStage scheduling and sequencing. Moving data in and out of KALIDO data
warehouses requires the management of many interdependent processes. DataStage has
a graphical tool for building sequences with conditional dependencies, error handling and
parallel execution.
DataStage and KALIDO are highly complementary, allowing for simpler and iterative RAD-styledevelopment. However to get the maximum benefit they must be integrated properly. To do
this, the ETL target structure must be understood, which is described next.
-
8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential
9/21
Copyright 2004 Kalido www.kalido.com9
3 Understanding the ETL targetThe target of your ETL jobs in context of a KALIDO data warehouse is a staging area, and each
job delivers a list of instances for one object in the business model. KALIDO loads the lists into
the data warehouse.
In KALIDO, a staging-area-to-warehouse mapping (called a load definition) is defined foreach object. There are two load definition types: one for reference data (such as Customers
and Products) and the other one for fact data (Product Sales). There are slight differences
between the two. Looking at a reference data load definition, KALIDO uses a similar style of
interface to DataStage for this type of definition, so it should appear straightforward:
Figure 7 - KALIDO reference data load definition
The right-hand panel displays the object from the business model. The left-hand and central
panes show the mappings between the column headings of the input table in the staging area
and the business model object components. Every object has at minimum two components:
A code - the natural code from the source system, which will be unique among the
instances of that object
A name - which describes the instance, and doesnt need to be unique
Objects may also have parents or additional attributes. In this case, the model defines Packed
Product as having three mandatory parents: Brand, Pack Type and Product Group, so these
must be included as part of the mapping.
Every definition also has a Transaction Date. Over time, reference data changes - new
Customers are added, Products are assigned to different Brands, etc. The transaction date is
simply the date when this change occurred.
The column names on the left are the field names in the staging table. They can be anything
as long as they match the name given in the mapping. Alternative field names might be:
Figure 8 - KALIDO reference data load definition, alternative field names
-
8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential
10/21
Copyright 2004 Kalido www.kalido.com10
This second style of field names can be very useful when reusing generic staging area tables
and DataStage jobs for more than one object, which will be described later.
Load definitions for fact data are very similar. Here is a definition for Product Sales:
Figure 9 - KALIDO fact data load definition
This mapping shows the staging area table has three columns containing the natural codes for
the reference data associated with the sale: the Delivery Point code, Packed Product code and
Sale Day. It also has the numeric data associated with the transaction Distribution Costs,Revenue and Volume. Lastly, it has a transaction date. The Transaction Date serves a slightly
different purpose from the day of Sale - it relates the transaction to the time-variant reference
data. Suppose the name of a particular product has recently changed; because this is a fact
data load definition we only supply the code for the product and not its name. The transaction
date then allows us to determine if the sale occurred before or after the name change.
This is essentially all thats necessary to know about the ETL targets - straightforward tables
with one row per instance of an object, one column per object component, and a date column
that lets KALIDO track how the reference data is changing over time.
Thats all the background needed to discuss best-practice techniques for using DataStage with
KALIDO. This is what we turn to for the rest of the paper.
-
8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential
11/21
Copyright 2004 Kalido www.kalido.com11
4 Best Practice TechniquesThis section deals with best practices for use of KALIDO in conjunction with DataStage and is
broken down into the four functional areas of ETL:
1. Extracting data from any source
2.
Transforming data3. Loading into the staging area
4. Process control and scheduling
4.1 Extracting data from the sourceWorking with the business model
Data extraction is driven by the KALIDO business model, which defines the objects, mandatory
and optional object components, and relationships to other objects. For each object there is a
corresponding ETL output, and the DataStage Designer has to locate in the source systems
data corresponding to the object components. In some cases this is easier than others - an
object in the model may come from a single table in a single source system, or it may be
composed of records and fields spread across several tables and systems.
Kalido recommends that you do not start extensive ETL development until the KALIDO
business model is stable. Allow business modelers to sign off on a release that will be used for
the initial ETL jobs. During this exploratory stage, DataStage and other tools in the Ascential
Enterprise Integration Suite will be used to verify the correctness of the model, but it's too
early to begin production development. Even after an initial version of the model has been
signed off, it will continually change throughout the lifecycle of the data warehouse. Because
of this, you should establish a clearly understood and enforced change-control process to track
all modifications.
By default, you should only create one extract per object. When you prototype a KALIDO data
warehouse, it is possible to create KALIDO load definitions that combine objects such as
multiple levels of a hierarchy or a mix of reference data and fact data. In production it is
strongly recommended that the designer follows the rule of creating one extract per object.
Over the lifecycle of the data warehouse the uniformity and consistency of one extract per
object pays dividends because the ETL components are easier to understand, reuse and
modify.
Full or delta extraction
When extracting data, a key decision is whether to extract all object instances, or just new or
modified instances. For example, there may be one million customers, and each week about
five thousand customers are added and a few hundred change their address or other details.
DataStage can pass all one million Customer records to KALIDO and ask it to calculate what's
different, or the changes can be calculated during the ETL and only these passed to KALIDO.
Which is best? This depends on which is fastest and also which is easiest to implement.
For some objects, detecting changes is easy a 'last modified' field or a change log at the
source can be used to filter records. In this case, change detection should be done at the
source. If there is no obvious way of filtering the data, the designer has to weigh the relative
-
8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential
12/21
Copyright 2004 Kalido www.kalido.com12
merits of building a change detection algorithm in the ETL versus passing the full dataset to
KALIDO. Generally, delta detection is done in DataStage if there are a great number of
records. In our example of a million customer records, with 0.5% changing per week, it is
likely that the performance benefits of building a change detection algorithm into the ETL job
outweigh development costs. This is because if delta detection is done by KALIDO, then before
it starts processing, one million records have to be extracted from the source, passed through
DataStage, and loaded into staging tables. Delta detection at the source avoids this overhead.
There are many algorithms for change detection. A simple approach is to concatenate the
component values of an object into a string and compare it to the same string when the object
was last extracted. Another is to use the SQL 'minus' operator which subtracts one set of
records from another leaving just the differences. The best algorithm will depend on the exact
circumstances - if in doubt Kalido consultants can provide experienced advice.
In contrast to reference data, fact data does not often change after it has been created. A
Product name may slowly change over time, but fact data is typically a point-in-time event -
either a Product was sold on a particular day or it wasn't. If fact data does change - in our
model we have a fact table called 'Target Sales' which may undergo revisions - KALIDO
handles changes similarly to that of reference data. Given the high data volumes involved, thistype of delta detection would normally be done during the ETL.
Handling time-variant reference data using transaction date
The correct handling of time-variant reference data is essential for the delivery of meaningful
business intelligence. If we reclassify a product as a different brand, all new product sales will
be recorded against the new brand. A report by brand will show a sudden drop in revenue for
one brand and a rise in another. But, to understand the yearly growth of brand sales,
management will need at least two other versions of this report:
A report as if the product had remained in the old brand
A report as if the product had always been part of the new brand.
Figure 10 - The importance of time variance
Brand Sales
Q3-01 Q4-01 Q1-02 Q2-02 Q3-02 Q4-02 Q1-03 Q2-03 Q3-03 Q4-03 Q1-04 Q2-04
Quarter
Sales
New productadded to brand
Brand sales withoutnew product
Brand Sales
Q3-01 Q4-01 Q1-02 Q2-02 Q3-02 Q4-02 Q1-03 Q2-03 Q3-03 Q4-03 Q1-04 Q2-04
Quarter
Sales
New productadded to brand
Brand sales withoutnew product
-
8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential
13/21
Copyright 2004 Kalido www.kalido.com13
Figure 10 illustrates the importance of time variance to the business. In a traditional, custom-
built data warehouse, including fully flexible time variance makes the physical schema much
more complicated, with a corresponding rise in the complexity of DataStage jobs. With
KALIDO, time variance is handled automatically for all reference data. All the ETL has to do is
to provide KALIDO with a timestamp indicating the date of any reference data changes.
So what value do we choose for the transaction date? There are two types of changes we need
to consider - new reference data coming into existence, and modifications to existing data. In
practice, the dates we use for these changes are closely tied in with the fact data - a Product
must come into existence before it can be sold, and if a product has been re-branded, the sale
must have a transaction date that corresponds to the correct brand of the product at the time.
Ideally there will be some date field in the source that can be used for this purpose. If no
suitable date exists, our recommendation is as follows:
For creation dates use a constant historical date, such as 1/1/2000 - it generally does
not matter if a Product is deemed to have been created some time before it is sold,
whereas the opposite will cause a data quality error.
For modifications use the extract date. If we extract Product data nightly it is normally
sufficient to record that the change happened sometime during that day. For greater
accuracy increase the extract frequency.
Handling changing codes
A final consideration during extraction is to make sure that a data item can still be correctly
identified by the data warehouse if its natural code changes. KALIDO must be told about the
code change, and this can be done via a load definition in the normal way. If you look back to
the load definition in Figure 7, you'll see the component 'New Packed Product Code' - that's
what this is used for. You need to make sure that any extracts loaded into KALIDO before the
code change use the old code, and all subsequent extracts use the new code.
Summary
Design DataStage jobs around the business model with one extract per object - use a
formal change control process to track changes to the model
For objects with a million or more records, do delta detection during the ETL stage
If no timestamp exists in the source for changing reference data, use a constant date
for the creation date of new objects, and use the extract date for modifications
Check for changing natural codes and make sure changes are loaded into KALIDO
before using the new codes
4.2 Transforming dataTransformation typically refers to the transformation of data from the physical schema of thesource to the physical schema of the target. This has already been discussed, so this section
concentrates on additional topics relating to transformation.
Data summarization and allocation
Customarily, data warehouses feature a host of transformations that summarize data up the
levels of a hierarchy, and allocate fact data to lower levels. This changes with KALIDO.
-
8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential
14/21
Copyright 2004 Kalido www.kalido.com14
For example, daily sales totals from an ERP system are summarized into weekly totals. This
can be done before or after loading into the data warehouse. In a custom environment, the
ETL effort is typically the same - DataStage does the summarization in either case. With
KALIDO, summarizations are performed inside the data warehouse. KALIDO has extensive
tools to efficiently summarize data to the level required for business reporting after loading
into the data warehouse. Therefore ETL summarization is only required when summarizing
datapriorto loading into the data warehouse.
When we summarize prior to loading we lose information. The normal advice is to load data
into the data warehouse at the lowest level of granularity available and summarize it in
KALIDO. KALIDO copes well with large fact data volumes - there are multi-terabyte KALIDO
implementations and also implementations that load millions of fact records per day. Data only
needs to be summarized before loading if volumes are exceptionally high. A good compromise
is to load the most recent data at the lowest level and use KALIDO to summarize historic data
to a higher level and purge the low level data. So all data for the current year may be held at
the day level, but previous years will be stored at the weekly level. Careful architecting like
this within KALIDO usually enables data to be loaded at the lowest level available regardless of
the data volumes.
The opposite of a summarization is an allocation. In an allocation we process fact data to a
lower level than it exists in the source system. An example is an algorithm that takes
marketing costs per brand and allocates them across all the products in that brand as part of
calculating unit cost per product. Or a managers salary may be allocated across his or her
team members to estimate the true cost of hiring new employees.
KALIDO does not have built-in allocation functionality. Therefore, allocations must be done
during the ETL or reporting stages. Doing allocations in reports is possible but often requires
an expert report developer, whereas doing allocations during ETL is generally straightforward.
KALIDO is often used as the source for allocations because, as the central repository of data
from across the business, KALIDO already holds the necessary fact data in the most accessibleform. DataStage queries KALIDO for the source data, performs the allocations, and loads the
results back into KALIDO as additional facts.
There is no need to do allocations or summarizations just to bring all data to the same level
before loading into the data warehouse. The different sources are modeled as separate objects
in the business model, and KALIDO selects and summarizes the data as needed to satisfy
specific reports. Look at the business model in Figure 2 - we see that Target Sales are loaded
at a quarterly level but Product Sales at a daily level. We do not need to do anything at the
ETL stage to allow us to compare the two - KALIDO automatically summarizes actual sales to
the quarterly level before passing to the reporting tool.
Currency conversions
A common transformation is to convert fact data from one currency to another. KALIDO has
extensive built-in currency conversion functions, so the rule here is to take data in the original
currency of the source system, supply KALIDO with tables of exchange rate data, and let
KALIDO perform the currency conversion during report generation. KALIDO can also convert
units of measure - for instance ounces to grams.
-
8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential
15/21
Copyright 2004 Kalido www.kalido.com15
Writing reusable transforms
In a non-KALIDO environment, the need to work at the physical schema level leads to
DataStage jobs that are difficult to reuse. With KALIDO, the physical schema of the staging
area is very simple and is the same for all jobs. Hence the potential for reuse is great. Many
transformations will apply across multiple ETL tasks, such as transforming dates into a
common format. Transforms should be written to be reused across jobs. DataStage has avariety of methods for doing this, the simplest being to write the transformation as a function.
Simple transformations, such as splitting or merging rows or columns, can be done in KALIDO.
However, these are only used when prototyping, when an ETL tool may not be available. In
general it is better to centralize all transformations in the same place for ease of maintenance.
Summary
Do not summarize fact data before loading into the data warehouse.
Do use DataStage for data allocations prior to loading into the data warehouse
Load source data in the source currency. KALIDO can convert currencies at report time
Code transformations to be reusable - there is much more potential for reusing them
in a KALIDO warehouse than a custom solution
4.3 Loading dataFrom the DataStage perspective, 'loading' means loading data into the staging area where
KALIDO picks it up and loads it into the data warehouse. After staging area data has been
processed it can be deleted.
KALIDO can load data from flat files or database tables. Throughout this paper we have
assumed the data is loaded from tables. This is because database tables are easier to
manipulate than flat files, which is especially useful if additional transformations are required
once data has been put into the staging area tables. Best practice is to locate the staging areatables as a separate schema within the data warehouse database. This simplifies
housekeeping tasks such as backups.
The staging area table structure can take two forms. First, there can be one table per business
model object. Each table will have column names corresponding to the system labels of the
object components in KALIDO (as shown in Figure 7). The advantage of this approach is
readability as it is obvious which column refers to which component. The disadvantage is that
a typical business model will have many logical objects so the staging area will have a large
number of tables. This can be a maintenance burden, although the burden can be minimized
with scripts that automate the initial table creation and other common processes.
Alternatively, groups of objects can be stored in the same table, with generic column headings
such as 'Entity 1,' 'Entity 2,' etc. (as shown in Figure 8). The generic table has an additional
column which contains the object name, and KALIDO filters the table so that only records for
that object are loaded. The advantage of this approach is that the staging area has as few as
two tables - one for reference data and one for transaction data. This simplifies staging area
management, and also simplifies the creation of processes that work across objects, such as
data purging. To avoid confusion over which columns refer to which components, views should
be created that map the generic column names to the component names. The DataStage job
-
8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential
16/21
Copyright 2004 Kalido www.kalido.com16
inserts data through the views and KALIDO reads data from the views. This combines column
readability with the flexibility of generic tables. As the view generation can be easily
automated, this is the recommended staging area design.
Summary
Design the staging area as a small number of generic tables stored in a separate
schema of the data warehouse database
Use views to map component names onto the generic column headings for readability
4.4 Scheduling and job sequencingThis is the final focus of the paper, and it is an area of great importance.
A data warehouse is typically updated on a nightly basis, and a host of processes need to be
carefully coordinated to marshal data in and out of the data warehouse, including:
Extracting reference and fact data from sources and loading it into the staging area
KALIDO loading data into the warehouse from the staging area
KALIDO processing the data into star schemas and standalone data marts for reporting
Report generation by the reporting tool
Housekeeping processes, such as database backups and purging the staging area
The natural place to build and manage these processes is in DataStage. DataStage has a
powerful set of tools that organizes jobs into process flows called 'sequences.' Sequences can
control DataStage jobs as well as KALIDO and other third party processes.
Job sequence implementation and optimization should start early and be generously
resourced. Sequence designer goals include:
Minimize build and maintenance requirements
Trap and handle errors caused by process failure and poor data quality
Optimize performance
These are key topics which each require a full discussion.
Minimize build and maintenance requirements
This is best achieved by building "generic" sequences so that we only need to build and
maintain a small handful of distinct processes.
What is meant by generic? As an example, consider a sequence for loading reference data
into KALIDO from the staging area. Reference data is organized into hierarchies, and within
each hierarchy, parent objects must be loaded before their children. Typically you wouldcreate a graphical sequence for each hierarchy, manually dragging the job for each object into
the sequencer and linking them to run in the correct order. As the data model changes
throughout the data warehouse lifecycle, the sequences will require regular maintenance.
A better solution is to assume the business model will keep changing, and build a sequence
that is driven by the business model, working out dynamically the order in which data needs
-
8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential
17/21
Copyright 2004 Kalido www.kalido.com17
to be loaded. KALIDO stores a complete metadata description of the model as a set of
database views. These can be queried as the sequence executes. A simple algorithm might be:
Use the metadata views to build a list of the objects in each dimension, sorted so that
parent objects are listed before child objects
Loop through this list loading data for each object in turn
This algorithm can be built graphically using the DataStage sequence designer (for examples
of DataStage sequences see Figures 10 and 11). The visual representation, like a flowchart, is
straightforward to understand and modify.
If sequences are built ad-hoc, it's easy to end up with a proliferation of sequences each of
which do similar things. By using KALIDO business model metadata, we can instead build a
small number of general-purpose, adaptable solutions. With a little care, and a good
knowledge of DataStage sequences and KALIDO metadata views, it's possible to dramatically
reduce both the number of sequences and the ongoing maintenance cost.
Trap and handle errors caused by process failureAny job can fail for a number of reasons: hardware crashes, lack of disk space, missing
mandatory fields, wrong date formats, duplicate records with inconsistent values, etc.
Therefore, all possible failure cases need to be identified, and error handling designed for each
one. Coding the error handling mechanism is usually straightforward because of the abundant
error handling functionality in DataStage. The hard part is deciding what to do.
This is especially true when handling data quality errors. Suppose the Customer associated
with a new Delivery Point is not entered into the source system. KALIDO will reject the record
when it tries to load it into the data warehouse. Do we fix the data in the source and re-
extract it - who is available to correct the data and how do we notify them of the problem? Do
we leave the source as it is and fix the data in the staging area - how do we ensure this does
not create problems later on because the source and the data warehouse have different
values? Or do we change the design of the source so that this problem cannot arise.
Establishing business processes to handle these problems can take a long time. Data
warehouses suffer acutely from 'business paralysis' because they cut across multiple levels of
the organization and bring together people and departments who have not previously worked
together. The business needs time to establish ownership for problems, develop procedures
for resolving them, and train staff. These matters need to be investigated right at the start of
the project, and the business needs to be involved from day one.
KALIDO MDM is an application in the KALIDO suite, which addresses this by integrating
business people into the process. It allows them to manage master reference datacollaboratively in the context of quality control workflows, and to formally approve reference
data before it is released to the data warehouse. Ascential QualityStage complements
KALIDO MDM by standardizing, matching and de-duplicating data according to business rules.
Detailed exploration of KALIDO MDM and QualityStage is outside the scope of this paper.
-
8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential
18/21
Copyright 2004 Kalido www.kalido.com18
Optimizing performance
Like error handling, data warehouse performance is something that must be designed in from
the start. The faster data is delivered to the end user the more timely and useful it is. Data
warehouses are usually refreshed on an overnight basis, and if users are in multiple time
zones, 'overnight' may last just a few hours.
The goal of the process flow designer is to maximize available hardware so that the load is
spread evenly across time and hardware, rather than hardware experiencing short bursts of
activity and long periods of idleness. Jobs should also be scalable so they can take full
advantage of new hardware. A key to better performance is parallel processing. There are two
types of parallel processing, both of which should be used wherever possible:
Break individual jobs into parallel streams. Consider a DataStage job which extracts data from
a database server, transforms the data on the DataStage server and saves it to the staging
area database server. Often none of the hardware components is heavily stressed during this
process, and in such cases DataStage can divide source data into independent partitions and
run them in parallel. The number of partitions can be increased until all hardware is operating
efficiently independent of the DataStage job design. As a result, the job is processed severaltimes faster. If hardware is upgraded, the number of partitions can be increased. It is easy to
include such parallelism in DataStage jobs, especially if it is built in from the beginning.
Figure 11 Parallelism within a single DataStage job
During job execution, data is automatically divided into the number of partitions the user
specified and automatically re-partitioned between stages. This is described below.
-
8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential
19/21
Copyright 2004 Kalido www.kalido.com19
Parallel Execution
In creating a DataStage dataflow diagram, the user concentrates on the sequential flow of
large collections of records through a sequence of processing steps. Users do not need to
worry about the underlying architecture of the multiprocessor computer that will be used for
running the application. DataStage Enterprise Edition provides a clean separation between the
sequential expression of the workflow of the data integration application and the parallel
execution of the application in the production computing environment.
DataStage Enterprise Edition exploits both pipeline parallelism and partition parallelism to
achieve high throughput and performance:
o Data pipelining means that when the application begins to run, records get pulled from the
source system and move through the sequence of processing functions defined in the
dataflow graph. The records are flowing through the pipeline using [virtual] data sets
which makes it possible to move the records through the sequence of processing functions
without having to land the records to disk.
o Data partitioning is an approach to parallelism that involves breaking up the record set
into partitions, or subsets of records. Data partitioning generally provides good, linear
increases in application performance. DataStage Enterprise Edition supports automatic
repartitioning of records as they are moving through the application flow, using a broad
range of partitioning approaches including hash, range, entire, random, round robin, same
and DB2.
Users create a simple sequential dataflow graph using the Enterprise Edition Designer canvas.
When constructing the sequential dataflow graph, users do not have to worry about the
underlying hardware architecture or number of processors. A separate configuration file
defines the resources (processors, memory, disk) of the underlying multiprocessor computing
system. The configuration provides a clean separation between the creation of the sequential
dataflow graph and the parallel execution of the application, which greatly simplifies thedevelopment of scalable data integration systems that execute in parallel.
DataStage Enterprise Editions architecture allows users to scale application performance
effortlessly by adding hardware resources without having to change the data integration
application. The same application can run on a one-processor system, an SMP system, a
cluster of SMP systems, or an MPP system with near-linear increases in performance without
changing the application. DataStage Enterprise Edition also supports grid computing. Grid
-
8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential
20/21
Copyright 2004 Kalido www.kalido.com20
computing takes advantage of all distributed computing resources processor and memory
available on the network to create a single system image.
It is impossible to know in advance how much effort needs to be spent performance tuning
sequences. Expect to revise them over time as more is understood about the capacity of the
hardware and the data volumes, and allow plenty of time for this in the project plan.
Summary
Minimize the number of job sequences by writing them in generic form driven by the
KALIDO business model meta data
Investigate data quality issues and business processes to resolve them at project-start
Design sequences to leverage parallelism where possible
5 ConclusionsThis paper began by establishing the need for enterprise data warehouses which are
predicated on the assumption of business change. KALIDO is a data warehousing solution that
does this by allowing the designer to work at the logical data level rather than the physical
data level.
KALIDO is complementary to the Ascential Enterprise Integration Suite. The iterative KALIDO
approach relies on data integration software which can keep up with the rapid pace of
development. Ascential DataStage is a tool that can do this, provided it is used correctly.
KALIDO simplifies the transformation and loading part of ETL by loading data from a simple
staging area consisting of lists of data for each object in the business model. DataStage only
has to provide data in this straightforward, uniform format, rather than put the data through
further transformations to support the underlying physical table structure. This greatly reduces
the complexity of DataStage jobs. Other KALIDO features such as built-in data validation,surrogate key management, time variance, summarization and currency conversion reduce
this complexity still further. Extracting data easily from the source systems remains a key
task, and the DataStage connectivity to mainframes, enterprise applications, databases, and
real-time message queues are vital for quickly integrating new source systems.
DataStage is also responsible for marshaling the data from source system to KALIDO. The
challenging task of the sequence designer is to build solutions that are generic (reusable
across lots of individual jobs), handle poor quality data and other types of job failures, and use
parallelism to optimize performance. Sequence design should begin early, with plenty of
resources allocated throughout the lifecycle of the data warehouse.
For further information on any of the topics raised in this paper, please [email protected] or [email protected]. To find out more about KALIDO,
please visit the white paper section of our website at http://www.kalido.com/library. To learn
more about Ascential DataStage and other Ascential Enterprise Integration Suite software,
please visit www.ascential.com.
-
8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential
21/21
www.kalido.com
For more information please contact us
I: www.kalido.com
Kalido
25 Burlington Mall Road
Burlington, MA 01803
Tel: +1 781 229 6006
Kalido
8 York Road
London
SE1 7NAUnited Kingdom
Tel: +44 (0) 20 7934 3300
Kalido17 Square Edouard VIIF-75009ParisFrance
l ( )
www.kalido.com
For more information please contact us
I: www.kalido.com
Kalido
25 Burlington Mall Road
Burlington, MA 01803
Tel: +1 781 229 6006
Kalido
8 York Road
London
SE1 7NAUnited Kingdom
Tel: +44 (0) 20 7934 3300
Kalido17 Square Edouard VIIF-75009ParisFrance
l ( )