best-practice etl for kalido® 8 using ascential

Upload: torontoscorpions

Post on 30-May-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential

    1/21

    Best-Practice ETL for KALIDO 8 using

    Ascential DataStage

    October 2004

    Gary Powell

    Senior KALIDO Application Consultant

  • 8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential

    2/21

    Copyright 2004 Kalido www.kalido.com2

    Table of contents

    1 Introduction....................................................................................................... 3The need for data integration tools in an enterprise data warehouse............................3

    2 How ETL changes in a KALIDO environment ...................................................... 53 Understanding the ETL target ............................................................................ 94 Best Practice Techniques ..................................................................................11

    4.1 Extracting data from the source .................................................................... 11Working with the business model ......................................................................... 11Full or delta extraction........................................................................................ 11Handling time-variant reference data using transaction date..................................... 12Handling changing codes..................................................................................... 13

    4.2 Transforming data ...................................................................................... 13Data summarization and allocation ....................................................................... 13Currency conversions ......................................................................................... 14Writing reusable transforms................................................................................. 15

    4.3 Loading data.............................................................................................. 154.4 Scheduling and job sequencing ..................................................................... 16

    Minimize build and maintenance requirements........................................................ 16Trap and handle errors caused by process failure.................................................... 17Optimizing performance...................................................................................... 18

    5 Conclusions.......................................................................................................20

  • 8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential

    3/21

    Copyright 2004 Kalido www.kalido.com3

    1 IntroductionGlobal 2000 companies now face greater regulatory and shareholder pressure than ever to

    increase corporate accountability, transparency and performance. As a result, they are

    creating enterprise data warehouses which enable multiple, corporate-wide views of business

    performance across disparate systems and organizations.

    Unfortunately, it takes months (12-18 is not uncommon) to custom build or modify a data

    warehouse, and many are never even completed. As a result, business people fail to receive

    the timely management information they need to make high-quality decisions.

    To address this challenge, companies are taking a more iterative approach to enterprise data

    warehousing using KALIDO, which automates the creation and maintenance of enterprise

    data warehouses and master data throughout their lifecycle. The KALIDO application suite

    automatically adapts data warehouses and their associated master data to new business needs

    based on changes made to real-world business models. Kalido customers create and modify

    data warehouses 75% faster and at half the cost of traditional approaches.

    The need for data integration tools in an enterprise data warehouse

    A data warehouse typically sources data from a wide range of IT systems and organizations.

    To help deliver this data to the data warehouse, companies deploy tools such as Ascential

    Enterprise Integration Suite, which includes:

    Ascential ProfileStage - data profiling to evaluate source data content and structure

    Ascential QualityStage - data cleansing to find & reconcile low-quality or redundant data

    Ascential DataStage - data extraction, transformation and loading (ETL)

    Ascential MetaStage - metadata management for definitions and history of business data

    Figure 1 - Kalido and Ascential Products

    ANY SOURCE

    CRM

    ERP

    SCM

    RDBMS

    Legacy

    EAI /

    Messaging

    Web svc

    XML/EDI

    Legacy DW

    Parallel Execution Engine

    DISCOVERDISCOVER PREPAREPREPARE TRANSFORMTRANSFORM

    ProfileStageProfileStage QualityStageQualityStage DataStageDataStage

    Meta Data Management

    Real-Time Integration Services

    Enterprise Connectivity

    and Event Management

    Service-Oriented Architecture

    Data WarehouseSchema & Content

    Adaptive Services Core

    Auto-Builds

    Business Model &Master Data

    Region

    Country

    District

    Target

    Industry

    Sector

    Size

    Account

    Contract

    Geo Address

    TargetCust

    Manager

    AccountGrp

    Partners

    DelPoint

    Builds

    Universes

    Cubes

    Marts

    Reports

    Master Data

    Auto-Builds

    Transactions

    Ascential Enterprise Integration Suite KALIDO Business

    Intelligence

    ANY SOURCE

    CRM

    ERP

    SCM

    RDBMS

    Legacy

    EAI /

    Messaging

    Web svc

    XML/EDI

    Legacy DW

    Parallel Execution Engine

    DISCOVERDISCOVER PREPAREPREPARE TRANSFORMTRANSFORM

    ProfileStageProfileStage QualityStageQualityStage DataStageDataStage

    Meta Data Management

    Real-Time Integration Services

    Enterprise Connectivity

    and Event Management

    Service-Oriented Architecture

    Data WarehouseSchema & Content

    Adaptive Services Core

    Auto-Builds

    Business Model &Master Data

    Region

    Country

    District

    Target

    Industry

    Sector

    Size

    Account

    Contract

    Geo Address

    TargetCust

    Manager

    AccountGrp

    Partners

    DelPoint

    Business Model &Master Data

    Region

    Country

    District

    Target

    Industry

    Sector

    Size

    Account

    Contract

    Geo Address

    TargetCust

    Manager

    AccountGrp

    Partners

    DelPoint

    Region

    Country

    District

    Region

    Country

    District

    Region

    Country

    District

    Target

    Industry

    Sector

    SizeTarget

    Industry

    Sector

    Size

    Account

    Contract

    Geo Address

    TargetCust

    Manager

    AccountGrp

    Partners

    DelPoint

    Account

    Contract

    Geo Address

    TargetCust

    Manager

    AccountGrp

    Partners

    DelPoint

    Builds

    Universes

    Cubes

    Marts

    Reports

    UniversesUniverses

    Cubes

    MartsMarts

    ReportsReports

    Master Data

    Auto-Builds

    Transactions

    Ascential Enterprise Integration Suite KALIDO Business

    Intelligence

  • 8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential

    4/21

    Copyright 2004 Kalido www.kalido.com4

    This paper outlines a best-practice approach for using DataStage with KALIDO. As you will

    see, the use of DataStage differs from a custom-build environment, with less emphasis on

    data transformation and loading, but more emphasis on easy and efficient extraction from

    multiple data sources, and the management of data as it is moved into the data warehouse.

    In the rest of this paper a basic familiarity with standard ETL and data warehousing concepts

    is assumed, however a specific knowledge of KALIDO or DataStage is not required.

  • 8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential

    5/21

    Copyright 2004 Kalido www.kalido.com5

    2 How ETL changes in a KALIDO environmentA good way of thinking about ETL in a KALIDO data warehouse environment is that it allows

    you to work with data at a logical levelrather than at thephysical level.

    Typically, a data warehouse is built around a logical data model, which characterizes data as anumber of related entities. This is then implemented as a physical data model, consisting of a

    number of database tables. There are many ways to map the logical to a physical data model

    depending on data requirements, such as speed of retrieval and ease of maintenance.

    Here is a logical data model drawn in KALIDO notation, a business-friendly way of expressing

    logical data models for enterprise data warehouses. KALIDO refers to these as business

    models.Fact data is located in the ovals, and reference data hierarchies are located in the

    boxes. Arrows between reference data signify "is classified by, with dotted arrows meaning

    the classification is optional. The green box means "classification can be to either level".

    Product

    Packed Product

    Product

    Group

    Pack

    TypeBrand

    Time

    Year

    Quarter

    Day

    Customer

    Customer

    Account

    IndustrialClassification

    Industry Group

    Industry

    Month

    Delivery Point

    Product Sales Volume

    Revenue

    Distribution cost

    Target Sales Target volume

    Region

    Figure 2 - Business Model (Logical Data Model)

    Customarily, physical schema (star or snowflake usually) are built from this model:

    ProductDimension

    CustomerDimension

    TimeDimension

    ProductSales Facts

    TargetSales Facts

    M M

    MM M

    M

    0/1 0/1

    0/1 0/1 0/10/1

    ProductAssociations

    CustomerAssociations

    TimeAssociations

    ProductSales Facts

    TargetSales Facts

    M M

    MM M

    M

    0/1 0/1

    0/1 0/1 0/10/1

    Product GroupAttributes

    BrandAttributes

    Pack TypeAttributes

    Packed ProductAttributes

    M M

    M

    1

    0/1 0/1

    0/1

    1

    RegionAttributes

    Customer AcctAttributes

    Industry GroupAttributesIndustry

    Attributes

    0/1

    Delivery PointAttributes

    0/1

    0/1

    0/1

    1

    1

    MM

    M

    M

    YearAttributes

    MonthAttributes

    DayAttributes

    1

    1

    0/1

    0/1

    M

    M

    Figure 3 Star Physical Model Figure 4 Snowflake Physical Model

  • 8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential

    6/21

    Copyright 2004 Kalido www.kalido.com6

    The role of DataStage in a custom-built data warehousing environment is to get the data from

    the source systems into the correct tables in the data warehouse as defined by the physical

    model. The architecture is as follows:

    Figure 5 Role of DataStage in a traditional data warehousing environment

    In this architecture, DataStage is tightly coupled with the physical data warehouse design.

    This has three important implications:

    1. Physical data models often change during implementation even though the business model

    does not. For instance, as data warehouse volume grows it may be necessary to move

    from a snowflake schema - which is easy to populate but inefficient to query - to a star

    schema - which is easy to query but harder to populate. This usually involves considerable

    rework of the associated DataStage jobs.

    2. As the business model changes over time, the physical data model generally becomes

    more complicated due to the need to maintain historical versions of the model. DataStage

    jobs become correspondingly more complicated as a result.

    3. Physical data models require surrogate keys for most data entities, which are used instead

    of the natural unique identifier for an object (such as Payroll # for an employee) because

    they enable the object ID to be preserved over time even if the natural key changes.

    DataStage has to manage the mapping between surrogate keys in the data warehouse

    and natural keys in the source, which adds development overhead to each DataStage job.

    Now let's see what happens when DataStage is used in conjunction with a KALIDO data

    warehouse. The great strength of KALIDO is that it allows its users to work at the business

    model level - it transparently (without any user intervention) maps the business model onto a

    physical set of tables and manages the interface between them. Only the KALIDO business

    model is exposed to DataStage, and the role of ETL is simplified because: KALIDO maps between the logical and physical layer, so physical design changes in the

    data warehouse do not affect DataStage jobs.

    KALIDO handles changes to the business model over time, allowing historical and current

    models to exist concurrently. DataStage jobs are only required to support the current

    business model.

    KALIDO maps natural keys to surrogates. DataStage jobs only need to supply natural

    keys.

    Sources

    Data Warehouse

    Ascential

    DataStage

    ProductAssociations

    ProductAssociations

    TimeAssociations

    ProductSalesFacts

    TargetSales Facts

    M M

    MM M

    M

    0/1 0/1

    0/1 0/1 0/10/1

    Product GroupAttributes

    BrandAttributes

    Pack TypeAttributes

    PackedProductAttributes

    M M

    M

    1

    0/1 0/1

    0/1

    1

    RegionAttributes

    Customer AcctAttributes

    Industry GroupAttributesIndustry

    Attributes

    0/1

    DeliveryPointAttributes

    0/1

    0/1

    0/1

    1

    1

    MM

    M

    M

    YearAttributes

    MonthAttributes

    DayAttributes

    1

    1

    0/1

    0/1

    M

    M

  • 8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential

    7/21

    Copyright 2004 Kalido www.kalido.com7

    In practice, this means the role of DataStage is to provide a list of the instances of each object

    defined in the business model. By "instance" we mean a specific Customer, Product, or Sales

    Transaction. KALIDO loads this list into the data warehouse, performing the relevant logical-

    to-physical mappings internally as it processes the data. This is illustrated in Figure 6:

    Figure 6 - DataStage in a KALIDO architecture

    Figure 6 shows how DataStage, instead of sending data directly to the data warehouse as it

    would be done in a traditional, custom-built data warehousing architecture, now sends it to a

    staging area. The staging area is just a place for storing the list of instances for each object

    prior to loading into the KALIDO data warehouse, and is usually a collection of tables in their

    own schema in the data warehouse database.

    In the staging area, the objects are stored without any associations, which illustrates another

    key difference between KALIDO and a traditional warehouse, namely thatApplication logic ismanaged by KALIDO and not by DataStage. Validation rules such as 'every Delivery Point

    must refer to a Customer' are encoded into the KALIDO business model, and KALIDO verifies

    these rules as it receives the data. Therefore, although the Delivery Point list in the staging

    area has a column referencing the Customer list, the DataStage job does not have to check

    the validity of the references. This is another major simplification over the custom

    environment where much of the application logic is encoded in the ETL layer.

    So, KALIDO dramatically reduces the complexity of ETL jobs by handling the logical to physical

    data model mappings centrally rather than relying on each ETL job to deal with these

    mappings locally. Does this mean it also reduces the need for DataStage? No! In fact, the

    need for DataStage has increased. This is because KALIDO delivers a unique benefit that is

    almost unheard of in traditional data warehousing - very rapid response to change. As new

    systems are integrated, as the scope of the data warehouse increases, as the business model

    evolves, KALIDO quickly adapts to the changes. This puts great pressure on the ETL tool to

    provide up-to-date data. If data cannot be extracted from a new source, then whether the

    data warehouse is ready to receive it is irrelevant.

    Sources

    Staging Area

    Ascential

    DataStage

    Brands

    Product

    Pack T es

    Packed

    Industr

    Re ions

    Customer

    Industries

    Deliver Pts

    Da s

    Months

    Le ac Sales

    Tar et Sales

    Years

    Dynamic

    InformationWarehouse

    Product

    PackedProduct

    Product

    Group

    Pack

    TypeBrand

    Time

    Year

    Quarter

    Day

    Customer

    Customer

    Account

    IndustrialClassification

    IndustryGroup

    Industry

    Month

    DeliveryPoint

    LegacySales Volume Revenue

    Distributioncost

    Target Sales Targetvolume

    Region

    KALIDO

  • 8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential

    8/21

    Copyright 2004 Kalido www.kalido.com8

    DataStage is designed for rapid development, and its functionality complements KALIDO:

    Ascential PACKs for SAP, PeopleSoft, Siebel, Oracle and other enterprise applications.

    The predefined interfaces, transformations and metadata models offered in the PACKs

    make it easier to find and use the relevant data.

    DataStage graphical design environment. Greater productivity over manual

    programming greater reuse and manageability. Except for the most complextransformations, it is much easier to work with a visual data flow representation than lines

    of code.

    DataStage performance. As the KALIDO data warehouse scope increases, more data is

    processed in shorter timeframes. DataStage has a parallel engine that scales as data

    volume increases.

    DataStage scheduling and sequencing. Moving data in and out of KALIDO data

    warehouses requires the management of many interdependent processes. DataStage has

    a graphical tool for building sequences with conditional dependencies, error handling and

    parallel execution.

    DataStage and KALIDO are highly complementary, allowing for simpler and iterative RAD-styledevelopment. However to get the maximum benefit they must be integrated properly. To do

    this, the ETL target structure must be understood, which is described next.

  • 8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential

    9/21

    Copyright 2004 Kalido www.kalido.com9

    3 Understanding the ETL targetThe target of your ETL jobs in context of a KALIDO data warehouse is a staging area, and each

    job delivers a list of instances for one object in the business model. KALIDO loads the lists into

    the data warehouse.

    In KALIDO, a staging-area-to-warehouse mapping (called a load definition) is defined foreach object. There are two load definition types: one for reference data (such as Customers

    and Products) and the other one for fact data (Product Sales). There are slight differences

    between the two. Looking at a reference data load definition, KALIDO uses a similar style of

    interface to DataStage for this type of definition, so it should appear straightforward:

    Figure 7 - KALIDO reference data load definition

    The right-hand panel displays the object from the business model. The left-hand and central

    panes show the mappings between the column headings of the input table in the staging area

    and the business model object components. Every object has at minimum two components:

    A code - the natural code from the source system, which will be unique among the

    instances of that object

    A name - which describes the instance, and doesnt need to be unique

    Objects may also have parents or additional attributes. In this case, the model defines Packed

    Product as having three mandatory parents: Brand, Pack Type and Product Group, so these

    must be included as part of the mapping.

    Every definition also has a Transaction Date. Over time, reference data changes - new

    Customers are added, Products are assigned to different Brands, etc. The transaction date is

    simply the date when this change occurred.

    The column names on the left are the field names in the staging table. They can be anything

    as long as they match the name given in the mapping. Alternative field names might be:

    Figure 8 - KALIDO reference data load definition, alternative field names

  • 8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential

    10/21

    Copyright 2004 Kalido www.kalido.com10

    This second style of field names can be very useful when reusing generic staging area tables

    and DataStage jobs for more than one object, which will be described later.

    Load definitions for fact data are very similar. Here is a definition for Product Sales:

    Figure 9 - KALIDO fact data load definition

    This mapping shows the staging area table has three columns containing the natural codes for

    the reference data associated with the sale: the Delivery Point code, Packed Product code and

    Sale Day. It also has the numeric data associated with the transaction Distribution Costs,Revenue and Volume. Lastly, it has a transaction date. The Transaction Date serves a slightly

    different purpose from the day of Sale - it relates the transaction to the time-variant reference

    data. Suppose the name of a particular product has recently changed; because this is a fact

    data load definition we only supply the code for the product and not its name. The transaction

    date then allows us to determine if the sale occurred before or after the name change.

    This is essentially all thats necessary to know about the ETL targets - straightforward tables

    with one row per instance of an object, one column per object component, and a date column

    that lets KALIDO track how the reference data is changing over time.

    Thats all the background needed to discuss best-practice techniques for using DataStage with

    KALIDO. This is what we turn to for the rest of the paper.

  • 8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential

    11/21

    Copyright 2004 Kalido www.kalido.com11

    4 Best Practice TechniquesThis section deals with best practices for use of KALIDO in conjunction with DataStage and is

    broken down into the four functional areas of ETL:

    1. Extracting data from any source

    2.

    Transforming data3. Loading into the staging area

    4. Process control and scheduling

    4.1 Extracting data from the sourceWorking with the business model

    Data extraction is driven by the KALIDO business model, which defines the objects, mandatory

    and optional object components, and relationships to other objects. For each object there is a

    corresponding ETL output, and the DataStage Designer has to locate in the source systems

    data corresponding to the object components. In some cases this is easier than others - an

    object in the model may come from a single table in a single source system, or it may be

    composed of records and fields spread across several tables and systems.

    Kalido recommends that you do not start extensive ETL development until the KALIDO

    business model is stable. Allow business modelers to sign off on a release that will be used for

    the initial ETL jobs. During this exploratory stage, DataStage and other tools in the Ascential

    Enterprise Integration Suite will be used to verify the correctness of the model, but it's too

    early to begin production development. Even after an initial version of the model has been

    signed off, it will continually change throughout the lifecycle of the data warehouse. Because

    of this, you should establish a clearly understood and enforced change-control process to track

    all modifications.

    By default, you should only create one extract per object. When you prototype a KALIDO data

    warehouse, it is possible to create KALIDO load definitions that combine objects such as

    multiple levels of a hierarchy or a mix of reference data and fact data. In production it is

    strongly recommended that the designer follows the rule of creating one extract per object.

    Over the lifecycle of the data warehouse the uniformity and consistency of one extract per

    object pays dividends because the ETL components are easier to understand, reuse and

    modify.

    Full or delta extraction

    When extracting data, a key decision is whether to extract all object instances, or just new or

    modified instances. For example, there may be one million customers, and each week about

    five thousand customers are added and a few hundred change their address or other details.

    DataStage can pass all one million Customer records to KALIDO and ask it to calculate what's

    different, or the changes can be calculated during the ETL and only these passed to KALIDO.

    Which is best? This depends on which is fastest and also which is easiest to implement.

    For some objects, detecting changes is easy a 'last modified' field or a change log at the

    source can be used to filter records. In this case, change detection should be done at the

    source. If there is no obvious way of filtering the data, the designer has to weigh the relative

  • 8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential

    12/21

    Copyright 2004 Kalido www.kalido.com12

    merits of building a change detection algorithm in the ETL versus passing the full dataset to

    KALIDO. Generally, delta detection is done in DataStage if there are a great number of

    records. In our example of a million customer records, with 0.5% changing per week, it is

    likely that the performance benefits of building a change detection algorithm into the ETL job

    outweigh development costs. This is because if delta detection is done by KALIDO, then before

    it starts processing, one million records have to be extracted from the source, passed through

    DataStage, and loaded into staging tables. Delta detection at the source avoids this overhead.

    There are many algorithms for change detection. A simple approach is to concatenate the

    component values of an object into a string and compare it to the same string when the object

    was last extracted. Another is to use the SQL 'minus' operator which subtracts one set of

    records from another leaving just the differences. The best algorithm will depend on the exact

    circumstances - if in doubt Kalido consultants can provide experienced advice.

    In contrast to reference data, fact data does not often change after it has been created. A

    Product name may slowly change over time, but fact data is typically a point-in-time event -

    either a Product was sold on a particular day or it wasn't. If fact data does change - in our

    model we have a fact table called 'Target Sales' which may undergo revisions - KALIDO

    handles changes similarly to that of reference data. Given the high data volumes involved, thistype of delta detection would normally be done during the ETL.

    Handling time-variant reference data using transaction date

    The correct handling of time-variant reference data is essential for the delivery of meaningful

    business intelligence. If we reclassify a product as a different brand, all new product sales will

    be recorded against the new brand. A report by brand will show a sudden drop in revenue for

    one brand and a rise in another. But, to understand the yearly growth of brand sales,

    management will need at least two other versions of this report:

    A report as if the product had remained in the old brand

    A report as if the product had always been part of the new brand.

    Figure 10 - The importance of time variance

    Brand Sales

    Q3-01 Q4-01 Q1-02 Q2-02 Q3-02 Q4-02 Q1-03 Q2-03 Q3-03 Q4-03 Q1-04 Q2-04

    Quarter

    Sales

    New productadded to brand

    Brand sales withoutnew product

    Brand Sales

    Q3-01 Q4-01 Q1-02 Q2-02 Q3-02 Q4-02 Q1-03 Q2-03 Q3-03 Q4-03 Q1-04 Q2-04

    Quarter

    Sales

    New productadded to brand

    Brand sales withoutnew product

  • 8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential

    13/21

    Copyright 2004 Kalido www.kalido.com13

    Figure 10 illustrates the importance of time variance to the business. In a traditional, custom-

    built data warehouse, including fully flexible time variance makes the physical schema much

    more complicated, with a corresponding rise in the complexity of DataStage jobs. With

    KALIDO, time variance is handled automatically for all reference data. All the ETL has to do is

    to provide KALIDO with a timestamp indicating the date of any reference data changes.

    So what value do we choose for the transaction date? There are two types of changes we need

    to consider - new reference data coming into existence, and modifications to existing data. In

    practice, the dates we use for these changes are closely tied in with the fact data - a Product

    must come into existence before it can be sold, and if a product has been re-branded, the sale

    must have a transaction date that corresponds to the correct brand of the product at the time.

    Ideally there will be some date field in the source that can be used for this purpose. If no

    suitable date exists, our recommendation is as follows:

    For creation dates use a constant historical date, such as 1/1/2000 - it generally does

    not matter if a Product is deemed to have been created some time before it is sold,

    whereas the opposite will cause a data quality error.

    For modifications use the extract date. If we extract Product data nightly it is normally

    sufficient to record that the change happened sometime during that day. For greater

    accuracy increase the extract frequency.

    Handling changing codes

    A final consideration during extraction is to make sure that a data item can still be correctly

    identified by the data warehouse if its natural code changes. KALIDO must be told about the

    code change, and this can be done via a load definition in the normal way. If you look back to

    the load definition in Figure 7, you'll see the component 'New Packed Product Code' - that's

    what this is used for. You need to make sure that any extracts loaded into KALIDO before the

    code change use the old code, and all subsequent extracts use the new code.

    Summary

    Design DataStage jobs around the business model with one extract per object - use a

    formal change control process to track changes to the model

    For objects with a million or more records, do delta detection during the ETL stage

    If no timestamp exists in the source for changing reference data, use a constant date

    for the creation date of new objects, and use the extract date for modifications

    Check for changing natural codes and make sure changes are loaded into KALIDO

    before using the new codes

    4.2 Transforming dataTransformation typically refers to the transformation of data from the physical schema of thesource to the physical schema of the target. This has already been discussed, so this section

    concentrates on additional topics relating to transformation.

    Data summarization and allocation

    Customarily, data warehouses feature a host of transformations that summarize data up the

    levels of a hierarchy, and allocate fact data to lower levels. This changes with KALIDO.

  • 8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential

    14/21

    Copyright 2004 Kalido www.kalido.com14

    For example, daily sales totals from an ERP system are summarized into weekly totals. This

    can be done before or after loading into the data warehouse. In a custom environment, the

    ETL effort is typically the same - DataStage does the summarization in either case. With

    KALIDO, summarizations are performed inside the data warehouse. KALIDO has extensive

    tools to efficiently summarize data to the level required for business reporting after loading

    into the data warehouse. Therefore ETL summarization is only required when summarizing

    datapriorto loading into the data warehouse.

    When we summarize prior to loading we lose information. The normal advice is to load data

    into the data warehouse at the lowest level of granularity available and summarize it in

    KALIDO. KALIDO copes well with large fact data volumes - there are multi-terabyte KALIDO

    implementations and also implementations that load millions of fact records per day. Data only

    needs to be summarized before loading if volumes are exceptionally high. A good compromise

    is to load the most recent data at the lowest level and use KALIDO to summarize historic data

    to a higher level and purge the low level data. So all data for the current year may be held at

    the day level, but previous years will be stored at the weekly level. Careful architecting like

    this within KALIDO usually enables data to be loaded at the lowest level available regardless of

    the data volumes.

    The opposite of a summarization is an allocation. In an allocation we process fact data to a

    lower level than it exists in the source system. An example is an algorithm that takes

    marketing costs per brand and allocates them across all the products in that brand as part of

    calculating unit cost per product. Or a managers salary may be allocated across his or her

    team members to estimate the true cost of hiring new employees.

    KALIDO does not have built-in allocation functionality. Therefore, allocations must be done

    during the ETL or reporting stages. Doing allocations in reports is possible but often requires

    an expert report developer, whereas doing allocations during ETL is generally straightforward.

    KALIDO is often used as the source for allocations because, as the central repository of data

    from across the business, KALIDO already holds the necessary fact data in the most accessibleform. DataStage queries KALIDO for the source data, performs the allocations, and loads the

    results back into KALIDO as additional facts.

    There is no need to do allocations or summarizations just to bring all data to the same level

    before loading into the data warehouse. The different sources are modeled as separate objects

    in the business model, and KALIDO selects and summarizes the data as needed to satisfy

    specific reports. Look at the business model in Figure 2 - we see that Target Sales are loaded

    at a quarterly level but Product Sales at a daily level. We do not need to do anything at the

    ETL stage to allow us to compare the two - KALIDO automatically summarizes actual sales to

    the quarterly level before passing to the reporting tool.

    Currency conversions

    A common transformation is to convert fact data from one currency to another. KALIDO has

    extensive built-in currency conversion functions, so the rule here is to take data in the original

    currency of the source system, supply KALIDO with tables of exchange rate data, and let

    KALIDO perform the currency conversion during report generation. KALIDO can also convert

    units of measure - for instance ounces to grams.

  • 8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential

    15/21

    Copyright 2004 Kalido www.kalido.com15

    Writing reusable transforms

    In a non-KALIDO environment, the need to work at the physical schema level leads to

    DataStage jobs that are difficult to reuse. With KALIDO, the physical schema of the staging

    area is very simple and is the same for all jobs. Hence the potential for reuse is great. Many

    transformations will apply across multiple ETL tasks, such as transforming dates into a

    common format. Transforms should be written to be reused across jobs. DataStage has avariety of methods for doing this, the simplest being to write the transformation as a function.

    Simple transformations, such as splitting or merging rows or columns, can be done in KALIDO.

    However, these are only used when prototyping, when an ETL tool may not be available. In

    general it is better to centralize all transformations in the same place for ease of maintenance.

    Summary

    Do not summarize fact data before loading into the data warehouse.

    Do use DataStage for data allocations prior to loading into the data warehouse

    Load source data in the source currency. KALIDO can convert currencies at report time

    Code transformations to be reusable - there is much more potential for reusing them

    in a KALIDO warehouse than a custom solution

    4.3 Loading dataFrom the DataStage perspective, 'loading' means loading data into the staging area where

    KALIDO picks it up and loads it into the data warehouse. After staging area data has been

    processed it can be deleted.

    KALIDO can load data from flat files or database tables. Throughout this paper we have

    assumed the data is loaded from tables. This is because database tables are easier to

    manipulate than flat files, which is especially useful if additional transformations are required

    once data has been put into the staging area tables. Best practice is to locate the staging areatables as a separate schema within the data warehouse database. This simplifies

    housekeeping tasks such as backups.

    The staging area table structure can take two forms. First, there can be one table per business

    model object. Each table will have column names corresponding to the system labels of the

    object components in KALIDO (as shown in Figure 7). The advantage of this approach is

    readability as it is obvious which column refers to which component. The disadvantage is that

    a typical business model will have many logical objects so the staging area will have a large

    number of tables. This can be a maintenance burden, although the burden can be minimized

    with scripts that automate the initial table creation and other common processes.

    Alternatively, groups of objects can be stored in the same table, with generic column headings

    such as 'Entity 1,' 'Entity 2,' etc. (as shown in Figure 8). The generic table has an additional

    column which contains the object name, and KALIDO filters the table so that only records for

    that object are loaded. The advantage of this approach is that the staging area has as few as

    two tables - one for reference data and one for transaction data. This simplifies staging area

    management, and also simplifies the creation of processes that work across objects, such as

    data purging. To avoid confusion over which columns refer to which components, views should

    be created that map the generic column names to the component names. The DataStage job

  • 8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential

    16/21

    Copyright 2004 Kalido www.kalido.com16

    inserts data through the views and KALIDO reads data from the views. This combines column

    readability with the flexibility of generic tables. As the view generation can be easily

    automated, this is the recommended staging area design.

    Summary

    Design the staging area as a small number of generic tables stored in a separate

    schema of the data warehouse database

    Use views to map component names onto the generic column headings for readability

    4.4 Scheduling and job sequencingThis is the final focus of the paper, and it is an area of great importance.

    A data warehouse is typically updated on a nightly basis, and a host of processes need to be

    carefully coordinated to marshal data in and out of the data warehouse, including:

    Extracting reference and fact data from sources and loading it into the staging area

    KALIDO loading data into the warehouse from the staging area

    KALIDO processing the data into star schemas and standalone data marts for reporting

    Report generation by the reporting tool

    Housekeeping processes, such as database backups and purging the staging area

    The natural place to build and manage these processes is in DataStage. DataStage has a

    powerful set of tools that organizes jobs into process flows called 'sequences.' Sequences can

    control DataStage jobs as well as KALIDO and other third party processes.

    Job sequence implementation and optimization should start early and be generously

    resourced. Sequence designer goals include:

    Minimize build and maintenance requirements

    Trap and handle errors caused by process failure and poor data quality

    Optimize performance

    These are key topics which each require a full discussion.

    Minimize build and maintenance requirements

    This is best achieved by building "generic" sequences so that we only need to build and

    maintain a small handful of distinct processes.

    What is meant by generic? As an example, consider a sequence for loading reference data

    into KALIDO from the staging area. Reference data is organized into hierarchies, and within

    each hierarchy, parent objects must be loaded before their children. Typically you wouldcreate a graphical sequence for each hierarchy, manually dragging the job for each object into

    the sequencer and linking them to run in the correct order. As the data model changes

    throughout the data warehouse lifecycle, the sequences will require regular maintenance.

    A better solution is to assume the business model will keep changing, and build a sequence

    that is driven by the business model, working out dynamically the order in which data needs

  • 8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential

    17/21

    Copyright 2004 Kalido www.kalido.com17

    to be loaded. KALIDO stores a complete metadata description of the model as a set of

    database views. These can be queried as the sequence executes. A simple algorithm might be:

    Use the metadata views to build a list of the objects in each dimension, sorted so that

    parent objects are listed before child objects

    Loop through this list loading data for each object in turn

    This algorithm can be built graphically using the DataStage sequence designer (for examples

    of DataStage sequences see Figures 10 and 11). The visual representation, like a flowchart, is

    straightforward to understand and modify.

    If sequences are built ad-hoc, it's easy to end up with a proliferation of sequences each of

    which do similar things. By using KALIDO business model metadata, we can instead build a

    small number of general-purpose, adaptable solutions. With a little care, and a good

    knowledge of DataStage sequences and KALIDO metadata views, it's possible to dramatically

    reduce both the number of sequences and the ongoing maintenance cost.

    Trap and handle errors caused by process failureAny job can fail for a number of reasons: hardware crashes, lack of disk space, missing

    mandatory fields, wrong date formats, duplicate records with inconsistent values, etc.

    Therefore, all possible failure cases need to be identified, and error handling designed for each

    one. Coding the error handling mechanism is usually straightforward because of the abundant

    error handling functionality in DataStage. The hard part is deciding what to do.

    This is especially true when handling data quality errors. Suppose the Customer associated

    with a new Delivery Point is not entered into the source system. KALIDO will reject the record

    when it tries to load it into the data warehouse. Do we fix the data in the source and re-

    extract it - who is available to correct the data and how do we notify them of the problem? Do

    we leave the source as it is and fix the data in the staging area - how do we ensure this does

    not create problems later on because the source and the data warehouse have different

    values? Or do we change the design of the source so that this problem cannot arise.

    Establishing business processes to handle these problems can take a long time. Data

    warehouses suffer acutely from 'business paralysis' because they cut across multiple levels of

    the organization and bring together people and departments who have not previously worked

    together. The business needs time to establish ownership for problems, develop procedures

    for resolving them, and train staff. These matters need to be investigated right at the start of

    the project, and the business needs to be involved from day one.

    KALIDO MDM is an application in the KALIDO suite, which addresses this by integrating

    business people into the process. It allows them to manage master reference datacollaboratively in the context of quality control workflows, and to formally approve reference

    data before it is released to the data warehouse. Ascential QualityStage complements

    KALIDO MDM by standardizing, matching and de-duplicating data according to business rules.

    Detailed exploration of KALIDO MDM and QualityStage is outside the scope of this paper.

  • 8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential

    18/21

    Copyright 2004 Kalido www.kalido.com18

    Optimizing performance

    Like error handling, data warehouse performance is something that must be designed in from

    the start. The faster data is delivered to the end user the more timely and useful it is. Data

    warehouses are usually refreshed on an overnight basis, and if users are in multiple time

    zones, 'overnight' may last just a few hours.

    The goal of the process flow designer is to maximize available hardware so that the load is

    spread evenly across time and hardware, rather than hardware experiencing short bursts of

    activity and long periods of idleness. Jobs should also be scalable so they can take full

    advantage of new hardware. A key to better performance is parallel processing. There are two

    types of parallel processing, both of which should be used wherever possible:

    Break individual jobs into parallel streams. Consider a DataStage job which extracts data from

    a database server, transforms the data on the DataStage server and saves it to the staging

    area database server. Often none of the hardware components is heavily stressed during this

    process, and in such cases DataStage can divide source data into independent partitions and

    run them in parallel. The number of partitions can be increased until all hardware is operating

    efficiently independent of the DataStage job design. As a result, the job is processed severaltimes faster. If hardware is upgraded, the number of partitions can be increased. It is easy to

    include such parallelism in DataStage jobs, especially if it is built in from the beginning.

    Figure 11 Parallelism within a single DataStage job

    During job execution, data is automatically divided into the number of partitions the user

    specified and automatically re-partitioned between stages. This is described below.

  • 8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential

    19/21

    Copyright 2004 Kalido www.kalido.com19

    Parallel Execution

    In creating a DataStage dataflow diagram, the user concentrates on the sequential flow of

    large collections of records through a sequence of processing steps. Users do not need to

    worry about the underlying architecture of the multiprocessor computer that will be used for

    running the application. DataStage Enterprise Edition provides a clean separation between the

    sequential expression of the workflow of the data integration application and the parallel

    execution of the application in the production computing environment.

    DataStage Enterprise Edition exploits both pipeline parallelism and partition parallelism to

    achieve high throughput and performance:

    o Data pipelining means that when the application begins to run, records get pulled from the

    source system and move through the sequence of processing functions defined in the

    dataflow graph. The records are flowing through the pipeline using [virtual] data sets

    which makes it possible to move the records through the sequence of processing functions

    without having to land the records to disk.

    o Data partitioning is an approach to parallelism that involves breaking up the record set

    into partitions, or subsets of records. Data partitioning generally provides good, linear

    increases in application performance. DataStage Enterprise Edition supports automatic

    repartitioning of records as they are moving through the application flow, using a broad

    range of partitioning approaches including hash, range, entire, random, round robin, same

    and DB2.

    Users create a simple sequential dataflow graph using the Enterprise Edition Designer canvas.

    When constructing the sequential dataflow graph, users do not have to worry about the

    underlying hardware architecture or number of processors. A separate configuration file

    defines the resources (processors, memory, disk) of the underlying multiprocessor computing

    system. The configuration provides a clean separation between the creation of the sequential

    dataflow graph and the parallel execution of the application, which greatly simplifies thedevelopment of scalable data integration systems that execute in parallel.

    DataStage Enterprise Editions architecture allows users to scale application performance

    effortlessly by adding hardware resources without having to change the data integration

    application. The same application can run on a one-processor system, an SMP system, a

    cluster of SMP systems, or an MPP system with near-linear increases in performance without

    changing the application. DataStage Enterprise Edition also supports grid computing. Grid

  • 8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential

    20/21

    Copyright 2004 Kalido www.kalido.com20

    computing takes advantage of all distributed computing resources processor and memory

    available on the network to create a single system image.

    It is impossible to know in advance how much effort needs to be spent performance tuning

    sequences. Expect to revise them over time as more is understood about the capacity of the

    hardware and the data volumes, and allow plenty of time for this in the project plan.

    Summary

    Minimize the number of job sequences by writing them in generic form driven by the

    KALIDO business model meta data

    Investigate data quality issues and business processes to resolve them at project-start

    Design sequences to leverage parallelism where possible

    5 ConclusionsThis paper began by establishing the need for enterprise data warehouses which are

    predicated on the assumption of business change. KALIDO is a data warehousing solution that

    does this by allowing the designer to work at the logical data level rather than the physical

    data level.

    KALIDO is complementary to the Ascential Enterprise Integration Suite. The iterative KALIDO

    approach relies on data integration software which can keep up with the rapid pace of

    development. Ascential DataStage is a tool that can do this, provided it is used correctly.

    KALIDO simplifies the transformation and loading part of ETL by loading data from a simple

    staging area consisting of lists of data for each object in the business model. DataStage only

    has to provide data in this straightforward, uniform format, rather than put the data through

    further transformations to support the underlying physical table structure. This greatly reduces

    the complexity of DataStage jobs. Other KALIDO features such as built-in data validation,surrogate key management, time variance, summarization and currency conversion reduce

    this complexity still further. Extracting data easily from the source systems remains a key

    task, and the DataStage connectivity to mainframes, enterprise applications, databases, and

    real-time message queues are vital for quickly integrating new source systems.

    DataStage is also responsible for marshaling the data from source system to KALIDO. The

    challenging task of the sequence designer is to build solutions that are generic (reusable

    across lots of individual jobs), handle poor quality data and other types of job failures, and use

    parallelism to optimize performance. Sequence design should begin early, with plenty of

    resources allocated throughout the lifecycle of the data warehouse.

    For further information on any of the topics raised in this paper, please [email protected] or [email protected]. To find out more about KALIDO,

    please visit the white paper section of our website at http://www.kalido.com/library. To learn

    more about Ascential DataStage and other Ascential Enterprise Integration Suite software,

    please visit www.ascential.com.

  • 8/14/2019 Best-Practice ETL for KALIDO 8 Using Ascential

    21/21

    www.kalido.com

    For more information please contact us

    I: www.kalido.com

    E: [email protected]

    Kalido

    25 Burlington Mall Road

    Burlington, MA 01803

    Tel: +1 781 229 6006

    Kalido

    8 York Road

    London

    SE1 7NAUnited Kingdom

    Tel: +44 (0) 20 7934 3300

    Kalido17 Square Edouard VIIF-75009ParisFrance

    l ( )

    www.kalido.com

    For more information please contact us

    I: www.kalido.com

    E: [email protected]

    Kalido

    25 Burlington Mall Road

    Burlington, MA 01803

    Tel: +1 781 229 6006

    Kalido

    8 York Road

    London

    SE1 7NAUnited Kingdom

    Tel: +44 (0) 20 7934 3300

    Kalido17 Square Edouard VIIF-75009ParisFrance

    l ( )