o time data warehousing journal - · pdf filegiven the size and complexity of ... ascential...

16
On Time Consulting Services Inc. www.ontimec.com " We have found you and your team to be capable (your exper- tise in Oracle data- bases and Oracle Warehouse Builder is noteworthy), profes- sional, and a pleasure to work with. I would highly recommend On Time Consulting Ser- vices to anyone re- sponsible for prompt, accurate, and suc- cessful delivery of Oracle-based pro- jects ." -- Ray Riggins , Data Management & Appli- cations Support, OCTA On Time Consulting Services has extensive experience in utilizing the latest development tools from Oracle including 9iAS. Our team of Data Warehouse Developers can reorganize and optimize your data from any UNIX or Mi- crosoft Applications into an Oracle Data Ware- house using OWB. We have extensive experi- ence in the Public Transit Industry and other industries, creating Data Marts from the fol- lowing applications: Genfare International's (GFI) Odyssey Bus Fare Collection System Giro's Hastus Transit Scheduling ACORS Coach Operators ETS Maintenance Sungard's IFAS Accounting Lawson Software HRIS/Payroll O N T IME C ONSULTING S ERVICES O RACLE E XPERTS On Time's team of ex- perts can get you moving, starting with servers and ending with applications that put your data to work for you. We can de- sign the fastest, most efficient systems that maximize the return on your investment. We de- sign OFA (Optimal Flexi- ble Architecture) compli- ant databases that are both fast and fault toler- ant. We maintain a full staff of Oracle professionals and are trained on the latest techniques and tools that can make your project a success. In the 2005 first quarter edition of the newsletter we are focusing on ETL. It seems like every day we run into people who want to know what the best tools are and strate- gies to make ETL work better. We hope you en- joy this jammed packed issue devoted to ETL. O N T IME C ONSULTING S ERVICES I NC . NEWSLETTER DATE 03-01-05 VOLUME 2 I SSUE 1 GET T HE ON T IME ADVANTAGE! Highly trained Oracle Experts Data Warehouse Expertise Customized Train- ing Programs Breadth of Applica- tions Experience Flexible I NSIDE THIS ISSUE : ON TIME S TORY 1 E NTERPRISE ETL “THE REAL S COOP2 WHAT IS ETL 2 OCTA CASE S TUDY 6 S ET THE S TAGE WITH DATA P REPARATION 8 A S YSTEMATIC APPROACH TO ETL DESIGN 12 DETERMINE YOUR DATA REQUIREMENTS 13 UNDERSTANDING DIF ARCHITECTURE 14 ON TIME CAPABILITY 15 O N T IME D ATA W AREHOUSING J OURNAL

Upload: vunhi

Post on 05-Feb-2018

219 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: O TIME DATA WAREHOUSING JOURNAL - · PDF fileGiven the size and complexity of ... Ascential Software DataStage ... check-in, repository management, and versioning for objects, projects

On Time Consulting Services Inc. www.ontimec.com

" We have found you and your team to be capable (your exper-tise in Oracle data-bases and Oracle Warehouse Builder is noteworthy), profes-sional, and a pleasure to work with. I would highly recommend On Time Consulting Ser-vices to anyone re-sponsible for prompt, accurate, and suc-cessful delivery of Oracle-based pro-jects ."

-- Ray Riggins , Data Management & Appli-cations Support, OCTA

On Time Consulting Services has extensive experience in utilizing the latest development tools from Oracle including 9iAS. Our team of Data Warehouse Developers can reorganize and optimize your data from any UNIX or Mi-crosoft Applications into an Oracle Data Ware-house using OWB. We have extensive experi-ence in the Public Transit Industry and other industries, creating Data Marts from the fol-lowing applications:

• Genfare International's (GFI) Odyssey Bus Fare Collection System

• Giro's Hastus Transit Scheduling

• ACORS Coach Operators

• ETS Maintenance

• Sungard's IFAS Accounting

• Lawson Software HRIS/Payroll

ON TIME CONSULTING SERVICES

ORACLE EXPERTS On Time's team of ex-perts can get you moving, starting with servers and ending with applications that put your data to work for you. We can de-sign the fastest, most efficient systems that maximize the return on your investment. We de-sign OFA (Optimal Flexi-ble Architecture) compli-

ant databases that are both fast and fault toler-ant.

We maintain a full staff of Oracle professionals and are trained on the latest techniques and tools that can make your project a success.

In the 2005 first quarter edition of the newsletter

we are focusing on ETL. It seems like every day we run into people who want to know what the best tools are and strate-gies to make ETL work better. We hope you en-joy this jammed packed issue devoted to ETL.

ON TIME CONSULTING SERVICES INC.

NEWSLETTER DATE 03-01-05 VOLUME 2 ISSUE 1

GET THE ON TIME ADVANTAGE!

• Highly trained Oracle Experts

• Data Warehouse Expertise

• Customized Train-ing Programs

• Breadth of Applica-tions Experience

• Flexible

INSIDE THIS ISSUE:

ON TIME STORY 1

ENTERPRISE ETL “THE REAL SCOOP”

2

WHAT IS ETL 2

OCTA CASE STUDY 6

SET THE STAGE WITH DATA PREPARATION

8

A SYSTEMATIC APPROACH TO ETL DESIGN

12

DETERMINE YOUR DATA REQUIREMENTS

13

UNDERSTANDING DIF ARCHITECTURE

14

ON TIME CAPABILITY 15

ON TIME DATA WAREHOUSING JOURNAL

Page 2: O TIME DATA WAREHOUSING JOURNAL - · PDF fileGiven the size and complexity of ... Ascential Software DataStage ... check-in, repository management, and versioning for objects, projects

On Time Consulting Services Inc. www.ontimec.com

ENTERPRISE ETL—THE REAL SCOOP

“WITH INTEGRATION

PRODUCTS IMPLEMENTED,

COMPANIES ARE BEGINNING

TO MOVE TO THE NEXT LEVEL BY

MAKING THESE INTEROPERATE”

PAGE 2 VOLUME 2 ISSUE 1

In terms of choosing an ETL vendor, Forrester listed several recommendations in its report:

Remain diligent about scalability.Although scalability is now easier to attain, users should still select enterprise ETL and related integration products based on it, because a lack of scalability prevents a project from reaching its long-term goals. Scalability is a moving target that grows more challenging over time as data volumes and architectural complexity increase. Users should think big, even in early phases of an implementation when volume and complexity are small, by overbuilding to enable future scalability.

Replace flat files with direct connectivity.Flat files are the most latent form of data integration, and they typically contain more data than is required for a business purpose. As business proc-esses move closer to real time, expect to replace these opportunistically with direct connections to data sources and targets. Direct connection can also enable SQL access and changed data capture, thus reducing data volumes for greater scalability. But numerous direct connections demand a wide range of connectivity options from an ETL tool, so select tools carefully.

Architectural design is key to enterprise ETL.Given the size and complexity of enterprise ETL implementations, don't expect an enterprise ETL product to solve all your problems single-handedly. Key server performance characteristics like scalability and high availability are as much due to architectural designs (both micro and macro) as to product capability. Make as many design decisions as possible before creating an evaluation list for ETL products, then apply design considerations along with other criteria to tool selection.

Enterprise ETL is expensive, so set budgetary expectation accordingly.Big configurations with big teams of ETL developers are typical of enterprise ETL, amounting to rather high costs for licenses and payroll. Cost out each enterprise ETL project thoroughly, noting the license costs explained in this report, balanced by a careful count of ETL full-time employees and required sta. augmentation. Beware of common hidden costs, like extra training, consulting to get over performance hurdles, additional connectors, and increased server hardware to enable greater-scalability.

The bottom line, according to the report, is that with scalability more attainable, priorities of vendors and users will shift to the next requirement: right-time information delivery. Addition-ally, collaboration will change the face of ETL tools and users. "With all the integration technolo-gies taking a more central role in IT, each must change to accommodate to the greater number of projects and developers that result," the study reported. Finally, with integration products implemented, companies are beginning to move to the next level by making these interoperate.

The full study, "How to Evaluate Enterprise ETL," is available from Forrester Research.

ETL - extract, transform and load - is the set of processes by which data is extracted from nu-merous databases, applications and systems, transformed as appropriate, and loaded into tar-get systems - including, but not limited to, data warehouses, data marts, analytical applica-tions, etc.

More than half of all development work for data warehousing projects is typically dedicated to the design and implementation of ETL processes. Poorly designed ETL processes are costly to maintain, change and update, so it is critical it is to make the right choices in terms of the right technology and tools that will be used for developing and maintaining the ETL processes.

Some of the numerous technological approaches and solutions available on the market include:

• Traditional engine-based ETL products

• RDBMS proprietary solutions Third-generation ETL based on a SQL code-generation approach

What is ETL?

Page 3: O TIME DATA WAREHOUSING JOURNAL - · PDF fileGiven the size and complexity of ... Ascential Software DataStage ... check-in, repository management, and versioning for objects, projects

On Time Consulting Services Inc. www.ontimec.com

Ascential Software DataStage Business Objects Data Integrator

DataMirror Transformation Server

Evolutionary Technologies (ETI) ETI Solution

Group 1 Data Flow

Hummingbird Genio

IBM DB2 Warehouse Manager

Informatica PowerCenter

iWay Software DataMigrator

Microsoft Data Transformation Services (DTS)

Oracle Oracle Warehouse Builder (OWB)

Pervasive Data Integrator

SAS Enterprise ETL Server

Sunopsis Sunopsis

ETL Software included in the evaluation

In December 2004, Philip Russom with Connie Moore and Colin Teubner of Forrester Research authored the report "How to Evaluate Enterprise ETL," which focused on scalability, connec-tivity and collaboration being the leading requirements. These were applied to a Forrester Wave assessment of 14 vendor products to determine which were most capable of enterprise ETL.

According to the report's executive summary, "As no surprise, Ascential and Informatica are still the technology leaders, with Business Objects, DataMirror, Oracle, and SAS trailing. How-ever, the research uncovered a few surprises. ETL scalability is higher, yet more attainable. Vendors provide rich connectivity options, but users employ very little of it. And collaboration is driving changes in ETL development tools and best practices."

ENTERPRISE ETL—THE REAL SCOOP

“VENDORS PROVIDE RICH CONNECTIVITY OPTIONS, BUT USERS EMPLOY VERY LITTLE OF

IT”

Page 3 VOLUME 2 ISSUE 1

Page 4: O TIME DATA WAREHOUSING JOURNAL - · PDF fileGiven the size and complexity of ... Ascential Software DataStage ... check-in, repository management, and versioning for objects, projects

On Time Consulting Services Inc. www.ontimec.com

Scalability

One of the most dramatic findings of the enterprise ETL study is that vendor products and user practices are better equipped to deliver ETL scalability than ever before. Since 2000, most ETL vendors have focused on scalability as a top requirement, meaning today's tools are better. In addition, IT professionals have become more savvy about selecting ETL tools, meaning today's users are smarter. Every face of an ETL server is becoming multithreaded, and best practices are better understood.

To determine metrics for ETL scalability, Forrester collected performance numbers from ETL users during the research. Those numbers fell into groups that revealed three categories of scalability: moderate, high and extreme.

Extreme Scalability High Scalability Moderate Scalability

Essential Business Objects Hummingbird

Informatica ETI IBM

Oracle DataMirror Microsoft

iWay Pervasive

Sunopsis SAS

ENTERPRISE ETL—THE REAL SCOOP

“SINCE 2000, MOST ETL

VENDORS HAVE FOCUSED ON SCALABILITY

AS A TOP REQUIREMENT”

PAGE 4 VOLUME 2 ISSUE 1

Connectivity

Vendor ETL products currently support connectivity to a greater range of data source and tar-get types than ever before. The bad news is that most users tap into a very small subset of an ETL product's connectivity capabilities. Leading vendors continue to produce more connectors, making the list of connectivity options from these vendors longer than ever. However, users buy few connectors, choosing them "a la carte," making vendors support many types of sources and targets. Finally, since connectors incur an additional cost and flat files don't, some users control those costs by relying on flat files as sources, making them rely heavily on flat files, whether they want to or not.

Vendor products in this study fall into three levels of connectivity: advanced, standard and base. ETL tools at the advanced connectivity level are differentiated by native access to legacy databases and EAI platforms, and most connection interfaces support write access, not just read. The standard connectivity level focuses on standard sources and targets, like relational databases, packaged applications and XML documents, plus base connectivity options. Finally, options supported at the base connectivity level are just flat files, relational access via ODBC/JDBC, and perhaps native support for one or two relational databases.

Advanced Connectivity Standard Base

Ascential Oracle IBM

Business Objects Group 1 Pervasive

ETI DataMirror Microsoft

Informatica Sunopsis

iWay Hummingbird

SAS

Page 5: O TIME DATA WAREHOUSING JOURNAL - · PDF fileGiven the size and complexity of ... Ascential Software DataStage ... check-in, repository management, and versioning for objects, projects

On Time Consulting Services Inc. www.ontimec.com

ENTERPRISE ETL—THE REAL SCOOP

“IN THE LAST TWO OR THREE

YEARS, LEADING ETL VENDORS

HAVE SCRAMBLED TO SUPPLY DEMAND

FOR COLLABORATIVE

FUNCTIONS, WHILE OTHERS HAVE IGNORED

THE NEW DEMAND”

PAGE 5 VOLUME 2 ISSUE 1

Collaboration

According to the study, "In the last two or three years, leading ETL vendors have scrambled to supply demand for collaborative functions, while others have ignored the new demand. Because collaborative ETL is still an emerging requirement, support for it in tools is weak on average and usage is spotty. Even so, demand and usage for this emerging requirement will increase quickly as ETL teams grow larger." ETL teams should look for tools that satisfy requirements such as source-code management, versioning in development, versioning in rollout and a server-based repository.

Vendor ETL products fall into three categories based on support for the collaborative require-ments: complete collaborative capability (includes checkout, check-in, repository management, and versioning for objects, projects, and rollout), incomplete collaborative capability (A subset of collaborative functions is still the norm in ETL tools, and the subset varies from tool to tool), and little or no collaborative capability (is rudimentary or completely absent from ETL tools). Based on the study's findings, all three categories have room for improvement.

Complete collaborative capability

Incomplete Collaborative capability

Little or no collabora-tive capability

Informatica Hummingbird DataMirror Oracle iWay Group 1 Ascential SAS IBM Business Objects Microsoft Pervasive ETI Suno

Which tools are best for enterprise ETL? Figure 1, the Forrester Wave Chart, outlines the top ETL vendors.

Page 6: O TIME DATA WAREHOUSING JOURNAL - · PDF fileGiven the size and complexity of ... Ascential Software DataStage ... check-in, repository management, and versioning for objects, projects

On Time Consulting Services Inc. www.ontimec.com

The Orange County Transporta-tion Authority (OCTA)

OCTA is the county's primary trans-portation agency. Located in South-ern California, the OCTA is focused on rail, freeways, streets, and public transportation. Its primary mission is to create, coordinate, finance, and deliver an easy-to-use public trans-portation network which keeps Or-ange County’s 2.8 million people moving.

While Southern California is known for its freeways, more than 60,000 Orange County residents a day rely on OCTA buses to travel to and from work, school, and a variety of other places. OCTA takes this responsibility seriously, placing more than 460 buses on 80 routes during peak hours. This activity generates over 5 million transactions of bus ridership data per month.

OCTA’s Information Challenge

OCTA relies on a variety of complex systems to support Bus Operations, including the following: Genfare GFI Farebox system for bus ridership, fare collection, and bus pass proc-essing; HASTUS vehicle and crew scheduling system for establishing bus routes and timetables; SunGard Bi-Tech IFAS accounting, a financial and personnel management system for government and not-for-profit entities; and a variety of custom ap-plications designed to support the activities of Coach Operators and Maintenance Employees.

OCTA’s initial approach to informa-tion management and decision sup-port was to create an individual data store (mostly using MS-Access) tai-lored to address a specific need for information. Over a number of years, a variety of separate databases were created, resulting in a chaotic envi-ronment of inconsistent (and often conflicting) information that could not support the volume of data gen-erated by ever-increasing bus rider-ship.

The On Time Data Warehouse Solution

On Time Consulting Services was engaged by OCTA to develop 2 new data ware-houses: one for Bus Ridership, and the other for Bus Pass usage. Oracle Ware-house Builder (OWB) was used to extract transactions from Genfare and HASTUS sys-tems into a staging area; then to load each data warehouse with the appropriate “facts” and “dimensions”. The data warehousing capabilities of an Oracle 9i database allow for quick and efficient summarization of “facts” by a variety of “dimensions”; for example, number of bus riders (fact) by bus route (dimension) and time of day (dimension). The Bus Ridership and Bus Pass data warehouses have assimilated over 100 million ridership transactions to provide summarized (and consistent) infor-mation that gives OCTA a clearer under-standing of revenues, costs, utilization, and ridership. With an enhanced ability to ana-lyze Bus Operations, OCTA is positioned to continually improve its services to the citi-zens Orange County. Improved visibility of Bus Ridership and Bus Pass information has also led to internal process changes to im-prove the quality of data in the source (Genfare and HASTUS) systems.

In addition to the Bus Ridership and Bus Pass data warehouses, On Time Consulting Services is also engaged in extending OCTA’s data warehouse environment by developing a number of specialized Data Marts. Each data mart is being developed using OWB/Oracle 9i, sharing (where ap-propriate) common dimensions, such as time (month, week, day, time of day), em-ployee (coach operator), bus number, etc. information for each maintenance and me-chanic employee from attendance, training, certification, shift assignments, and all other non-payroll related data.

To learn more about On Time Consulting please visit us at www.ontimec.com or contact us toll-free at 866-239-3326.

OCTA DATA WAREHOUSE CASE STUDY

OCTA’s primary mission is to

create, coordinate, finance, and

deliver an easy-to-use public transportation network which keeps Orange

County CA’s 2.8 million people

moving.

PAGE 6 VOLUME 2 ISSUE 1

Page 7: O TIME DATA WAREHOUSING JOURNAL - · PDF fileGiven the size and complexity of ... Ascential Software DataStage ... check-in, repository management, and versioning for objects, projects

On Time Consulting Services Inc. www.ontimec.com

but how can you get the information it carries?

Let our Team of Data Warehousing experts help.

Our experience with Oracle Warehouse Builder (OWB) enables us to efficiently

unload - Extract

repackage - Transform

and store - Load

your valuable cargo of data in a state-of-the-art Oracle Data Warehouse.

You’ve successfully launched your new system,

What’s your information destination?

the right information

at the right time

in the right format

using tools such as Oracle’s Discoverer.

Then let us help you find:

4848 Lakeview Ave. Suite 100H

Yorba Linda, CA 92886

Voice: 714.693.8111

Fax: 714.693.0617

PAGE 7 VOLUME 2 ISSUE 1

Page 8: O TIME DATA WAREHOUSING JOURNAL - · PDF fileGiven the size and complexity of ... Ascential Software DataStage ... check-in, repository management, and versioning for objects, projects

On Time Consulting Services Inc. www.ontimec.com

Data preparation is the core set of processes for data integration. These processes gather data from diverse source systems, transform it according to business and technical rules, and stage it for later steps in its life cycle when it becomes information used by information con-sumers.

Don't be lulled by the silver bullet a vendor or a consultant may try to sell you, saying that all you need to do is point a business intelligence (BI) tool at the source system and your busi-ness users will be off and running (supposedly with real-time data, too!).

That oversimplifies data integration into a connectivity issue alone. If it was that easy, every ERP vendor would be offering modules to do just that, rather than have you struggle for years to produce the reports that your business users are requesting. In addition, ERP vendors would not be continually evolving data warehouse (DW) solutions over the years if data inte-gration was merely a connectivity problem.

The reason the solution is not just point and click (using the BI tool) is because data prepara-tion involves many steps that may be very complex depending on how your business oper-ates, as well as how your applications are implemented within your enterprise. Physically ac-cessing the data in source systems may be easy, but transforming it into information is an-other story altogether (and something you don't appreciate if you have only been dealing with proofs of concept or single business solutions rather than an enterprise-wide, real data solu-tion).

The steps involved in data preparation are: gathering, reformatting, consolidating and validat-ing, transforming (business rules and technical conversions), performing data quality checks and storing data. Figure 1 depicts the data preparation processes in a logical progression. Although the sequence is accurate, the physical implementation may involve a combination of many of these processes into a sequence of source-to-target mapping operations or work-flows. Also, although the diagram implies that you're storing intermediate results in a data-base, it's only a logical representation. Intermediate mappings may involve the use of mem-ory or temporary tables based on performance and efficiency considerations rather than per-sistent storage in database tables.

SET THE STAGE WITH DATA PREPARATION

“DON'T BE LULLED BY THE

SILVER BULLET A VENDOR OR A CONSULTANT MAY TRY TO SELL YOU”

PAGE 8 VOLUME 2 ISSUE 1

Page 9: O TIME DATA WAREHOUSING JOURNAL - · PDF fileGiven the size and complexity of ... Ascential Software DataStage ... check-in, repository management, and versioning for objects, projects

On Time Consulting Services Inc. www.ontimec.com

SET THE STAGE WITH DATA PREPARATION

“ETL TOOLS HAVE BEEN

AROUND FOR MORE THAN A DECADE, HAVE

GROWN SIGNIFICANTLY IN FUNCTIONALITY,

AND ARE OFFERED BY MANY

VENDORS”

PAGE 9 VOLUME 2 ISSUE 1

Figure 1: Data Preparation Processes

Data preparation processes easily capture the lion's share of the work of any DW or BI pro-ject. Industry analysts estimate it consumes 60 to 75 percent of the project time. Project delays and cost overruns are frequently tied to underestimating the amount of time and re-sources necessary to complete data preparation or, even more frequently, to do the rework necessary when the project initially skimps on these activities and then data consistency, accuracy and quality issues arise.

More than ETL

Almost everyone - IT, vendors, consultants and industry analysts - associates data prepara-tion with extract, transform and load (ETL) along with the tools in the marketplace that per-form these processes. ETL tools focus on source (getting data from source systems) to tar-get (putting data in data warehouses, operational data stores and data marts) mapping and associated transformations. Although the resulting ETL processes are part of the data prepa-ration processes, they are only a subset. ETL tools are just that, tools to be used, and do not negate designing overall architecture and the processes to implement that architecture. Peo-ple understand and include the gathering, reformatting, transforming, cleansing and storing data processes of data preparation, but they often fail to include or oversimplify the consoli-dation and data quality processes. You need to understand the data preparation processes, design an architecture to implement them and then use your ETL tool (and potentially other tools too) to deliver these processes to transform data into consistent, accurate and timely information that your business can use.

As we mentioned last month, data profiling is a critical process needed to understand the content of the source systems. Data profiling is a critical precursor to data preparation (i.e., you have to understand the data in order to prepare it for BI). In addition, data modeling is another process that complements data integration. When I was a software engineer, my first manager told me that there are two approaches to developing software: design and code or code and rework. The latter appears to be initially faster, but the former gets you there sooner with quality, consistent code. In addition, constant rework raises the overall cost and reduces the trust your user has in your code. "If we fail to plan, we plan to fail."

Taking the First Step ETL tools have been around for more than a decade, have grown significantly in functional-ity, and are offered by many vendors. When talking to anyone in the industry or reading articles, you would assume that everyone is using these tools on every DW project; how-ever, Gartner states, "ETL tools are not commonplace in large enterprises. The majority of data warehousing implementations still rely on custom coding to deliver their ETL proc-esses."1

Despite the press releases and articles, why has custom coding dominated the DW land-scape? Let's divide the world into then and now. Many data warehouses were started years ago and are in their second or later generation. These data warehouses are already legacy applications with many of their tool decisions made based on the context of these tools when the DW was initially built. At that time, many of the ETL tools were very expensive, required the IT staff to learn yet another complex tool and, quite frankly, were limited in their func-tionality. Faced with these conditions, many DW project teams decided to create custom code to develop the ETL processes because it was quicker and cheaper (and they did not know how to justify the cost to their business). What many companies have experienced is that the total cost of ownership (TCO) that includes license fees, initial development costs and, most importantly, maintenance costs has shifted the argument to favor using ETL tools over custom coding; and this does not even account for many additional benefits of IT pro-ductivity, source code maintenance, impact analysis and meta data management.

Page 10: O TIME DATA WAREHOUSING JOURNAL - · PDF fileGiven the size and complexity of ... Ascential Software DataStage ... check-in, repository management, and versioning for objects, projects

On Time Consulting Services Inc. www.ontimec.com

SET THE STAGE WITH DATA PREPARATION

“DATA CLEANSING TOOLS ARE IMPORTANT

TOOLS IN MANY COMPANIES' ARSENALS IN

ENSURING DATA QUALITY AND

CONSISTENCY”

PAGE 10 VOLUME 2 ISSUE 1

Data preparation is not just about extracting and loading data. Ensuring data quality is often a critical activity that is required for the business to effectively use the data. This activity is always oversimplified and underestimated. It is often the activity that slows your project; or, if done inadequately, it is the activity that causes the most problems with the business users' assessment of the usefulness of the data. As mentioned earlier, business requirements and expectations need to be developed with the business users to establish the critical success factors in this area. Most people consider data quality to be checking for error conditions, data format conversions and mappings, and conditional processing of business rules based on a combination of dimension values or ranges. These types of data quality processes can be adequately performed with an ETL tool if planned for and designed into the overall data-preparation architecture. However, a more important aspect of data quality is data consis-tency rather than simply error detection or data conversion. Data may be valid in its trans-actional form when brought into a data warehouse, but become inconsistent (and incorrect) when transformed into business information. Data preparation needs to take into account these transformations and ensure data quality and consistency. Tools alone will not enforce this consistency unless it is designed into the processes.

There are some instances of data quality that can be handled by tools. These areas are often categorized as data cleansing tools and manage a variety of complex data quality issues that many companies encounter. A significant example is customer names and addresses. A cus-tomer or supplier name (person or company) may be input in different variations such as IBM, IBM Corp. or International Business Machines Corporation. Many companies dealing with large customer lists, for example, have purchased specialized data cleansing software. Another example of data cleansing is the householding that financial institutions perform where they link household or family members' personal and business accounts both for their customers' convenience, as well as for their own convenience in promoting a full range of their services to their customers. There are other examples of industry-specific data clean-sing tools that may be appropriate to use in addition to the ETL tools based on the business requirements relating to data quality.

Data cleansing tools are important tools in many companies' arsenals in ensuring data qual-ity and consistency. However, too often, many in our industry equate data cleansing tools with data quality. While the data cleansing tools provide data preparation processes for data quality, they are not the only processes that are needed in that area. Many see the data cleansing tool as a silver bullet. Others feel if they cannot afford those tools (or they are not applicable to their situation), then they can forget about data quality. You can never forget about data quality, and you have to build data quality processes throughout data prepara-tion.

In future columns, we will explore various data integration and ETL issues such as: Should you hand code or use an ETL tool? Should you standardize on an ETL tool? Should you pur-chase a data cleansing tool? Should you purchase a pre-built solution for CPM? Surprisingly, conventional wisdom isn't always the best solution for you. We will discuss the benefits, drawbacks and compromises involved with your choices. The goal is to separate fact from fiction and arm you with the knowledge to make an informed decision.

Reference:

1. Gartner, "Slow Growth Ahead for ETL Tools Market," November 7 2003, page 1.

Page 11: O TIME DATA WAREHOUSING JOURNAL - · PDF fileGiven the size and complexity of ... Ascential Software DataStage ... check-in, repository management, and versioning for objects, projects

On Time Consulting Services Inc. www.ontimec.com

On Time Consulting Services Inc.

When you need expert advice on critical data processing issues

Be sure to contact us.

Serving Clients

Nationwide since

1997

Your number one source for

Oracle Data Warehousing Expertise

PAGE 11 VOLUME 2 ISSUE 1

Page 12: O TIME DATA WAREHOUSING JOURNAL - · PDF fileGiven the size and complexity of ... Ascential Software DataStage ... check-in, repository management, and versioning for objects, projects

On Time Consulting Services Inc. www.ontimec.com

A Systematic Approach to ETL Design

“MANY DATA WAREHOUSE EFFORTS

STRUGGLE WITH DETERMINING THE

BEST METHODOLOGY AND

IMPLEMENTATION STRATEGY FOR PERFORMING

EXTRACT, TRANSFORM AND

LOAD (ETL) PROCESSING”

Many data warehouse efforts struggle with determining the best methodology and implementation strategy for performing extract, transform and load (ETL) process-ing for their particular enterprise. Once the business requirements for the warehouse project have been defined, the source sys-tems analyzed and a physical data model has been developed, you are ready to be-gin designing your ETL processes. A sys-tematic approach needs to be applied for designing your ETL processes. This column will describe the use of a staged approach for designing ETL process flows. Each process is directed toward loading of a particular target table or cluster of tables depending on the business requirements and technical design considerations.

A staged transformation approach is used to optimize data acquisition and transfor-mation from source systems. This method provides increased flexibility and extensi-bility to ETL developers. The various stages provide a single processing method for initial and successive loads to the data warehouse. The five stages provide an adaptable transformation process for the target table(s) that can adapt easily to changes in the source systems or ware-house model design. The combined use of meta data tags, sometimes referred to as operational meta data, in the warehouse schema design and ETL transformation processes allows for improved capabilities in loading and maintenance designs. Not all stages will be used in every target table case depending on the business require-ments and complexity of business rules needed in the transformation.

SOURCE VERIFICATION

The first stage, source verification, per-forms the access and extraction of data from the source system. This stage builds a temporal view of the data at the time of extraction. The source extract built in this stage is included in backups of the entire batch cycle for reload purposes and for audit/reconciliation purposes during test-ing. If audit files are provided from the source system, these files are compared against the extract files to verify such items as row counts, byte counts, amount totals or hash sums. During this stage, both technical and business meta data can be captured and verified against the meta data repository (if available). Verification of business rules unique to the source sys-tem can also be applied during this stage using the repository to identify any excep-tions or errors.

SOURCE ALTERATION

The second stage, source alteration, can per-form a variety of transformations unique to the source depending on business requirements. These transformation options include integra-tion of data from multiple source systems based on priority ranking or availability, inte-gration of data from secondary sources, split-ting of source system files into multiple work files for multiple target table loads (clusters), and application of business logic and conver-sions unique to the source systems. Meta data tags such as source system(s) identifiers, pro-duction key active in source system flags and confidence level indicators can be applied dur-ing this transformation stage.

COMMON INTERCHANGE

The third stage, common interchange, applies business rules and/or transformation logic that are frequent across multiple target tables. Ex-amples of transformation logic applied during this stage include referential integrity (e.g., population of fact table surrogate keys from dimension tables) and application of enterprise definitions and business rules from the meta data repository (e.g., code values to ISO stan-dard formats).

TARGET LOAD DETERMINATION

The fourth stage, target load determination, performs final formatting of data to produce load-ready files for the target table, identifies and segregates rows to be inserted versus up-dated (if applicable), applies remaining techni-cal meta data tagging and processes data into the relational database management system (RDBMS). Loading data into the RDBMS can be considered a separate stage depending on your preferences. Data rows processed from the source system up to this stage for the current batch cycle are compared against records loaded in previous cycles to determine if inser-tion or update is required. An example of this is a dimension table that uses slowly changing dimension (SCD) type 2 method for update. The load-ready files built in this stage are in-cluded in backups of the entire batch cycle for reload purposes and for audit/reconciliation purposes during testing.

PAGE 12 VOLUME 2 ISSUE 1

Page 13: O TIME DATA WAREHOUSING JOURNAL - · PDF fileGiven the size and complexity of ... Ascential Software DataStage ... check-in, repository management, and versioning for objects, projects

On Time Consulting Services Inc. www.ontimec.com

A Systematic Approach to ETL Design

HAVE YOU DETERMINED YOUR DATA

REQUIREMENTS?

Depending on the database management system (DBMS) used for the warehouse and the volume of records being processed, the high-speed parallel load process of the database en-gine may be used in this stage in place of the standard database load utility. Faster load per-formance can sometimes be achieved through truncating and reinsertion of all data rows ver-sus updating due to optimizations made available through use of the DBMS's high-speed par-allel load utility.

AGGREGATION

The fifth and final stage, aggregation, uses the load-ready files from the fourth stage to build aggregation tables needed to improve query performance against the warehouse. This stage is typically applied only against fact table target load processes. Care needs to be taken to ensure that aggregated records end up with the correct surrogate keys from the dimension tables for the rollup levels required in reports.

This five-stage approach provides a modular and adaptable means to efficiently load a data warehouse. ETL process designs can easily adapt to changes in source systems or the addi-tion or removal of source systems without affecting the entire ETL workflow. Additionally, changes to the data warehouse schema will cause minimal impact to ETL processes through the insulation provided through the various transformation stages.

Determine your data requirements

Determining your data requirements includes:

• Gathering business requirements

• Determining data and quality needs

• Data profiling or understanding data sources and associated quality both in the source system and across multiple source systems, if applicable

• Performing a data quality assessment against the metrics the business has requested

• Defining the gap between what data is available and its quality versus what the business has requested

• Revising business expectations or project costs and determining the selected data solu-tion Modeling the data stores necessary -? staging areas, data warehouse, operational data store and data mart(s) ?- both from a logical perspective (to confirm business requirements) and a physical perspective (to enable implementation). At the end, you'll know what data to source and store.

Regularly updated documentation is critical during this process. The integration process, both from an information technology (IT) and business perspective, must be documented, verified and made available to everyone building and using the systems. In order to share this infor-mation with business users, it is highly recommended that you have a writer "translate" the IT jargon and system-specific transformations into something business users can understand.

PAGE 13 VOLUME 2 ISSUE 1

Page 14: O TIME DATA WAREHOUSING JOURNAL - · PDF fileGiven the size and complexity of ... Ascential Software DataStage ... check-in, repository management, and versioning for objects, projects

On Time Consulting Services Inc. www.ontimec.com

Data preparation includes gathering, reformatting, con-solidating, transforming, cleansing and storing data -? both in staging areas and in the data warehouse. During this step, unforeseen data quality problems can suddenly arise. If you're using business analytics, your data preparation process may need to handle additional dimensions, meas-ures or metrics.

Data preparation includes three key steps: a source system data conversion, data cleansing assurance and a business transformation.

First, the source system conversions convert the source system data into the data warehouse target format. It may be necessary to consolidate multiple data sources to pro-vide single, consistent definitions. The consolidation in-volves validation by querying dimensions or reference files to determine how to conform the data and provide referen-tial integrity (in database terms).

The next step, data cleansing, analyzes the data beyond the record-by-record checking. Name-and-address clean-sing and householding are two examples.

Finally, business transformations turn data into business information. They place data in the context of a business state such as a type of inventory, customer order or medi-cal procedure. Transformation might also attach dimen-sional context (foreign keys) to data such as a region, busi-ness division, or product or service grouping.

Data franchising is the process of reconstructing data into information for reporting and analysis with business intelligence (BI) tools. Often, data is franchised from a data warehouse into data marts or cubes. The data is filtered, reorganized, transformed, summarized/aggregated and stored.

For example, suppose you have global sales data and need to franchise a U.S. sales-by-product-line data mart. Data franchising would filter only U.S. sales records and associ-

ate them with product information. This would reorganize the data in order to cou-ple sales transactions with product-related information.

Information access and analytics are the processes that retrieve the data stored in the information architecture and deliver it to business users in the form of informa-tion products. These can take the form of reports, spreadsheets, alerts, graphics, analytic applications and slice-and- dice cubes. BI tools, spreadsheets and analytic applications, sometimes through portals, enable these processes.

Data and meta data management are processes behind the scenes that pass data and meta data between the other proc-esses. It's important to follow standards and document procedures during this phase to ensure that it is easy for IT and business users to maintain and understand the data.

Figure 2: DIF Information Architecture

A solid information architecture can transform scattered data into information that your business uses to operate and plan for the future. Data is gathered, transformed using business rules and technical conversions, staged in data-bases and made available to business users to report and analyze. Data flows from creation through transformation to information, just as materials flow from the supplier (enterprise resource planning systems, transaction sys-tems), to factory (data integration), to warehouses (data warehouses) and finally to retail stores (data marts and business analytic applications).

The information architecture depicted in Figure 2 illustrates the processes that transform data into useful information as it progresses from data sources to the business intelligence (BI) tools where it is used in reports by business users. It is a conventional hub-and-spoke architecture, simplified to show a single data warehouse and multiple data marts. Sub-sequent columns will examine more complex information architectures that include cubes, sub-marts, federated data marts, federated data warehouses, operational data stores (ODSs), closed-loop systems and analytic applications.

DATA SOURCES

Traditionally, data sources were just transaction systems, legacy applications and enterprise resource planning (ERP) systems. Now, data sources can include front-office systems such as customer relationship management (CRM) and Web analytics, business analytic applications, data from enter-prise partners such as suppliers and customers, and fore-casting and budgeting systems. Although the proliferation of different data sources makes things seem more complicated, it doesn't affect the underlying structure of the architecture. It's important that all these different sources of data be in-cluded in the architecture to prevent disparate silos of infor-mation.

UNDERSTANDING DATA INFORMATION FRAMEWORK (DIF) ARCHITECTURE

PAGE 14 VOLUME 2 ISSUE 1

Page 15: O TIME DATA WAREHOUSING JOURNAL - · PDF fileGiven the size and complexity of ... Ascential Software DataStage ... check-in, repository management, and versioning for objects, projects

On Time Consulting Services Inc. www.ontimec.com

ON TIME CONSULTING SERVICES INC.

On Time Consulting Services Inc. is committed to making our cli-ents successful by providing a comprehensive suite of Information Technology (IT) consulting services, designed to provide expertise in all phases of a complete systems development life cycle. From planning your next IT project to mentoring your in-house IT per-sonnel, our approach is to capitalize on our extensive industry ex-perience and in-depth knowledge of technology to deliver the maximum value to our clients.

On Time has a variety of experienced technologists that use a cus-tomer-centric approach to design and deliver on-time and under budget projects. Our expertise has enabled us to contribute to the success of clients in a variety of industries, including Healthcare, Public Transportation, News media, Retail, Oil & Gas, Food Ser-vice, Gaming, Staffing, and Distribution.

Based in Yorba Linda CA, On Time employs a network of consult-ants throughout the US and Canada who can help meet your chal-lenges of creating total business solutions that encompass every-

4848 Lakeview Ave. #100H Yorba Linda, Ca 92886

Phone: 714-693-8111 866-239-3326

Fax: 714-693-0617 Contact:[email protected]

ON TIME CONSULTING SERVICES INC.

solutions to immediate assignments or issues.

Some of the areas we can provide training in are:

• Introduction to Data Warehousing

• Data Warehouse Data-base Design

• Data Warehousing Fundamentals

• Oracle 9i Warehouse Builder: Implementation Oracle 9i: Data Ware-house Administration

• Oracle 9i New Fea-tures Overview

• Oracle 9i Designer: First Class

On Time Consulting Ser-vices can provide a broad range of training solu-tions for your company.

We can provide custom-ized classroom training (based on a technical skills assessment) that will deliver Oracle skills at the point where they can be put into practice.

We can also develop classroom training exer-cises that are based on actual (and familiar) data to enhance the learning process.

We can come alongside your existing Oracle re-sources and/or recent Oracle course graduates to provide personal men-toring while developing

• Oracle 9i Database Admini-stration Fundamentals I & II

• Oracle 9i Database: Imple-ment Partitioning

• Oracle 9i Data Guard Ad-ministration

• Oracle 9i Database Perform-ance Tuning

• SQL Statement Tuning Oracle 9iAS & 9iDS

• Oracle 9iAS Release 2: Ba-sic Administration

• Oracle 9iDS Release 2: Dis-coverer for End Users

• Oracle 9i Discoverer Admini-stration

Contact us for more information!

ON TIME PROVIDES ORACLE TRAINING

YOUR BEST SOURCE FOR ORACLE DATA WAREHOUSING INFORMATION

Bibliography:

Systematic Approach to ETL Design

Set the Stage with Data Preparation

Michael Jennings DM Review

Enterprise ETL– The Real Scoop

Forrester Research

WE’RE ON THE WEB AT WW W. ONTI MEC. C O M

Page 16: O TIME DATA WAREHOUSING JOURNAL - · PDF fileGiven the size and complexity of ... Ascential Software DataStage ... check-in, repository management, and versioning for objects, projects

On Time Consulting Services Inc. www.ontimec.com

On Time Consulting Services Inc.

Corporate Newsletter

First Quarter 2005

From:

On Time Consulting Services Inc.

P.O. Box 580

Yorba Linda, Ca 92885-0580

Stamp

TO:

VOLUME 2 ISSUE 1