tesngetl - exploring...

Tes$ng ETL Don’t dig a tunnel with a toothpick

Iain McCowa; h;p://exploringuncertainty.com

[email protected] @imccowa; imccowa;

© Can Stock Photo Inc. / frenta

•  What is ETL? •  Where is it used? •  How does it work?

Introduc$on: ETL

•  No GUI •  Tester skills •  Paucity of off-‐the-‐shelf tools

Introduc$on: Challenges

•  Func$onal tes$ng of ETL, primarily for accuracy

•  Logic tes$ng of ETL, i.e. verifica$on, rather than valida$on

Introduc$on: Scope of Presenta$on

Context: Types of ETL Logic

Logic Type Descrip;on Extract Transform Load Selec$on Selec$on of source data ✓ ✓ Mapping Simple transfer of data from

source field to des$na$on field ✓

Transforma$on Manipula$on / normaliza$on of data from source to des$na$on

✓

Load Load strategy (e.g. append/merge) Referen$al Integrity

✓

Excep$on Management of excep$ons (invalid data, invalid lookups, orphan records etc.)

✓ ✓

•  Project constraints: $me, cost, quality •  Size •  Complexity •  Types of load (e.g. full, delta etc.) •  Data volume •  Degree of specifica$on •  Tester skills

Context: Other Factors

Test Strategy: Data

Approach Indica;ons Contraindica;ons Produc$on Need to test with volume

Need noise

Data privacy issues

Synthe$c Need to cover logic

Li;le visibility into the logic

Condi$oned Need volume, noise AND coverage

Data privacy issues Li;le visibility into the logic

Test Strategy: Oracles

Approach Indica;ons Contraindica;ons Source Data Minimal selec$on &

transforma$on Any complexity

Previous Implementa$on

Regression tes$ng Previous implementa$on available

Significant change from previous implementa$on

Baseline Data Regression tes$ng Baseline data available

Significant change from previous implementa$on

Parallel Implementa$on

No appropriate Oracle is available, e.g. new implementa$on

Significant $me and cost constraints Testers lack development skills

Test Strategy: Verifica$on

Approach Indica;ons Contraindica;ons Profiling Logic is of minimal complexity

Data integrity is cri$cal

Visual Diff Data integrity is cri$cal

Large data volume Mul$ple types of load Time or cost is cri$cal

Automated Comparator

Data integrity is cri$cal Tes$ng with large data volumes or mul$ple types of load

Testers lack development skills

•  Previous release overran: $me cri$cal •  Regulatory project: quality (data integrity) cri$cal

•  Significant transforma$on logic (>130 tables, >950 fields, >2200 condi$ons), some highly complex, but reasonably specified

•  Mul$ple types of load with different logic (full, delta)

•  Preference for tes$ng with produc$on data

Case Study: Context

Case Study: Strategy

Data Oracle Verifica;on Produc$on Source Data Profiling

Synthe$c Previous Implementa$on Visual Diff

Condi$oned Baseline Data Automated Comparator







✓

Condi$oned data is the compromise

Client preference for produc$on data Reliability concerns suggested a need for “noisy” data

Complexity of transforma$ons drove need for coverage






✓

✓

With no other suitable Oracle, a Parallel Implementa$on was selected despite $me and cost constraints

Significant changes from previous release ruled these out

Complex Transforma$ons ruled this out






✓

✓

✓

Selected as the more efficient op$on

Importance of data integrity ruled this out

Given $me and cost constraints, data volumes and repe$$ons with different types of load ruled this out

Case Study: Approach

Source Target Extract Process Transforma$on Process

Staged Data

Extracted Data


Expected Results

Reconcilia$on

Load Process


Expected Results

Reconcilia$on



Staged Data

Extracted Data


Expected Results

Reconcilia$on

Load Process


Expected Results

Reconcilia$on

Due to data privacy concerns, all test data was scrubbed A client (internal) team tested the extract process

Source data used to build expected result set for transforma$on process Both source and target (pre-‐load) used to build expected result set for load process



Staged Data

Extracted Data


Expected Results

Reconcilia$on

Load Process


Expected Results

Reconcilia$on

Note: Tested by another team

Reconcilia$on is a comparison of the expected result data set generated by the parallel implementa$on and the actual result data set generated by the ETL tool Records in the rela$ve complements (i.e. mismatches) indicate poten$al problems:

Expected Data Set

Actual Data Set

Case Study: Process

Mine Data

Condi$on Data

Determine Expected Results

Reconcile Results

Isolate and Report Bugs

Execute ETL

•  Some things should be automated

•  Other things should not

•  A custom tool was created using Oracle (SQL, PL/SQL)

•  This serves as both a parallel implementa$on and comparator

•  A generic framework was used, enabling: – Easier addi$on of new transforma$ons – Control over scope of any given execu$on

Case Study: Tool


Stored Procedures Stored

Procedures

Case Study: Tool

Driver Control Data

Stored Procedures

Tracking Data

Source Target Expected Results

Comparator

Mismatch Data


Procedures Stored

Procedures



Procedures

Case Study: Tool

Driver Control Data

Stored Procedures

Tracking Data


Comparator

Mismatch Data


Procedures Stored

Procedures

Controls scope of execu$on: •  ER genera$on, reconcilia$on or both •  Tables / fields / condi$ons

Responsible for expected result genera$on

Expected result genera$on not dependent on specific set of input data

Decision coverage data is used to guide data condi$oning



Procedures

Case Study: Tool

Driver Control Data

Stored Procedures

Tracking Data


Comparator

Mismatch Data


Procedures Stored

Procedures

Responsible for reconciling expected and actual results

The “secret sauce”: tracking data used to pinpoint condi$ons for which mismatches are detected

Final mismatch report iden$fies which condi$ons as been executed, and which appear to have problems

Reconcilia$on performed in two phases: •  Key matching to iden$fy key

level mismatches •  Field matching of the

intersec$on to iden$fy field level mismatches

•  On a shared “run-‐of-‐the-‐mill” development box, the tool was able to reconcile >10 Million cells (Rows x Columns) per minute, enabling execu$on and reconcilia$on of a complete load in a ma;er of hours

•  This use of “large scale mechanized checking” enabled a degree of coverage not possible on the previous release, where a more manually intensive approach was applied

Case Study: Tool

•  This is an example of a heavy weight Oracle and Comparator, designed for reuse and high performance with large data volumes – Other projects may require neither

•  Similar tools can be constructed very quickly using off-‐the-‐shelf database sokware – For an earlier project, MS Access was used to construct a parallel implementa$on in a ma;er of days, and a comparator in under an hour

Case Study: A Note on Tools

•  Data literate, mix of manual testers and tradi$onal automators

•  Prior SQL but no prior PL/SQL experience •  Ready, willing and able to learn whatever needed to be learned in order to test

•  Ready to challenge each other, and to be challenged

Case Study: Team

•  The tool was “just a tool”: human factors were more important, e.g.: – Availability of informa$on – Responsibili$es and handoffs – Communica$on of changes

Case Study: Lessons Learned

•  The tool, and the tes$ng, were heuris$c in nature: –  Tests (whether manual or automated) can have bugs in them: tests that do not reveal problems don’t “pass”, they simply do not reveal problems

–  Our checks were only as good as the requirements and design, and were blind to requirement and design bugs

– We were forced to use composite keys where true keys could not be replicated, increasing our blindness to certain logic bugs

–  In some cases, condi$ons were not mutually exclusive, making isola$on of any corresponding mismatch less precise (i.e. at a field rather than condi$on level)

Case Study: Lessons Learned

Ques$ons?

tesngetl - exploring...

Documents