tesngetl - exploring...
TRANSCRIPT
Tes$ng ETL Don’t dig a tunnel with a toothpick
Iain McCowa; h;p://exploringuncertainty.com
[email protected] @imccowa; imccowa;
© Can Stock Photo Inc. / frenta
• Func$onal tes$ng of ETL, primarily for accuracy
• Logic tes$ng of ETL, i.e. verifica$on, rather than valida$on
Introduc$on: Scope of Presenta$on
Context: Types of ETL Logic
Logic Type Descrip;on Extract Transform Load Selec$on Selec$on of source data ✓ ✓ Mapping Simple transfer of data from
source field to des$na$on field ✓
Transforma$on Manipula$on / normaliza$on of data from source to des$na$on
✓
Load Load strategy (e.g. append/merge) Referen$al Integrity
✓
Excep$on Management of excep$ons (invalid data, invalid lookups, orphan records etc.)
✓ ✓
• Project constraints: $me, cost, quality • Size • Complexity • Types of load (e.g. full, delta etc.) • Data volume • Degree of specifica$on • Tester skills
Context: Other Factors
Test Strategy: Data
Approach Indica;ons Contraindica;ons Produc$on Need to test with volume
Need noise
Data privacy issues
Synthe$c Need to cover logic
Li;le visibility into the logic
Condi$oned Need volume, noise AND coverage
Data privacy issues Li;le visibility into the logic
Test Strategy: Oracles
Approach Indica;ons Contraindica;ons Source Data Minimal selec$on &
transforma$on Any complexity
Previous Implementa$on
Regression tes$ng Previous implementa$on available
Significant change from previous implementa$on
Baseline Data Regression tes$ng Baseline data available
Significant change from previous implementa$on
Parallel Implementa$on
No appropriate Oracle is available, e.g. new implementa$on
Significant $me and cost constraints Testers lack development skills
Test Strategy: Verifica$on
Approach Indica;ons Contraindica;ons Profiling Logic is of minimal complexity
Data integrity is cri$cal
Visual Diff Data integrity is cri$cal
Large data volume Mul$ple types of load Time or cost is cri$cal
Automated Comparator
Data integrity is cri$cal Tes$ng with large data volumes or mul$ple types of load
Testers lack development skills
• Previous release overran: $me cri$cal • Regulatory project: quality (data integrity) cri$cal
• Significant transforma$on logic (>130 tables, >950 fields, >2200 condi$ons), some highly complex, but reasonably specified
• Mul$ple types of load with different logic (full, delta)
• Preference for tes$ng with produc$on data
Case Study: Context
Case Study: Strategy
Data Oracle Verifica;on Produc$on Source Data Profiling
Synthe$c Previous Implementa$on Visual Diff
Condi$oned Baseline Data Automated Comparator
Parallel Implementa$on
Case Study: Strategy
Data Oracle Verifica;on Produc$on Source Data Profiling
Synthe$c Previous Implementa$on Visual Diff
Condi$oned Baseline Data Automated Comparator
Parallel Implementa$on
✓
Condi$oned data is the compromise
Client preference for produc$on data Reliability concerns suggested a need for “noisy” data
Complexity of transforma$ons drove need for coverage
Case Study: Strategy
Data Oracle Verifica;on Produc$on Source Data Profiling
Synthe$c Previous Implementa$on Visual Diff
Condi$oned Baseline Data Automated Comparator
Parallel Implementa$on
✓
✓
With no other suitable Oracle, a Parallel Implementa$on was selected despite $me and cost constraints
Significant changes from previous release ruled these out
Complex Transforma$ons ruled this out
Case Study: Strategy
Data Oracle Verifica;on Produc$on Source Data Profiling
Synthe$c Previous Implementa$on Visual Diff
Condi$oned Baseline Data Automated Comparator
Parallel Implementa$on
✓
✓
✓
Selected as the more efficient op$on
Importance of data integrity ruled this out
Given $me and cost constraints, data volumes and repe$$ons with different types of load ruled this out
Case Study: Approach
Source Target Extract Process Transforma$on Process
Staged Data
Extracted Data
Parallel Implementa$on
Expected Results
Reconcilia$on
Load Process
Parallel Implementa$on
Expected Results
Reconcilia$on
Case Study: Approach
Source Target Extract Process Transforma$on Process
Staged Data
Extracted Data
Parallel Implementa$on
Expected Results
Reconcilia$on
Load Process
Parallel Implementa$on
Expected Results
Reconcilia$on
Due to data privacy concerns, all test data was scrubbed A client (internal) team tested the extract process
Source data used to build expected result set for transforma$on process Both source and target (pre-‐load) used to build expected result set for load process
Case Study: Approach
Source Target Extract Process Transforma$on Process
Staged Data
Extracted Data
Parallel Implementa$on
Expected Results
Reconcilia$on
Load Process
Parallel Implementa$on
Expected Results
Reconcilia$on
Note: Tested by another team
Reconcilia$on is a comparison of the expected result data set generated by the parallel implementa$on and the actual result data set generated by the ETL tool Records in the rela$ve complements (i.e. mismatches) indicate poten$al problems:
Expected Data Set
Actual Data Set
Case Study: Process
Mine Data
Condi$on Data
Determine Expected Results
Reconcile Results
Isolate and Report Bugs
Execute ETL
• Some things should be automated
• Other things should not
• A custom tool was created using Oracle (SQL, PL/SQL)
• This serves as both a parallel implementa$on and comparator
• A generic framework was used, enabling: – Easier addi$on of new transforma$ons – Control over scope of any given execu$on
Case Study: Tool
Parallel Implementa$on
Stored Procedures Stored
Procedures
Case Study: Tool
Driver Control Data
Stored Procedures
Tracking Data
Source Target Expected Results
Comparator
Mismatch Data
Stored Procedures Stored
Procedures Stored
Procedures
Parallel Implementa$on
Stored Procedures Stored
Procedures
Case Study: Tool
Driver Control Data
Stored Procedures
Tracking Data
Source Target Expected Results
Comparator
Mismatch Data
Stored Procedures Stored
Procedures Stored
Procedures
Controls scope of execu$on: • ER genera$on, reconcilia$on or both • Tables / fields / condi$ons
Responsible for expected result genera$on
Expected result genera$on not dependent on specific set of input data
Decision coverage data is used to guide data condi$oning
Parallel Implementa$on
Stored Procedures Stored
Procedures
Case Study: Tool
Driver Control Data
Stored Procedures
Tracking Data
Source Target Expected Results
Comparator
Mismatch Data
Stored Procedures Stored
Procedures Stored
Procedures
Responsible for reconciling expected and actual results
The “secret sauce”: tracking data used to pinpoint condi$ons for which mismatches are detected
Final mismatch report iden$fies which condi$ons as been executed, and which appear to have problems
Reconcilia$on performed in two phases: • Key matching to iden$fy key
level mismatches • Field matching of the
intersec$on to iden$fy field level mismatches
• On a shared “run-‐of-‐the-‐mill” development box, the tool was able to reconcile >10 Million cells (Rows x Columns) per minute, enabling execu$on and reconcilia$on of a complete load in a ma;er of hours
• This use of “large scale mechanized checking” enabled a degree of coverage not possible on the previous release, where a more manually intensive approach was applied
Case Study: Tool
• This is an example of a heavy weight Oracle and Comparator, designed for reuse and high performance with large data volumes – Other projects may require neither
• Similar tools can be constructed very quickly using off-‐the-‐shelf database sokware – For an earlier project, MS Access was used to construct a parallel implementa$on in a ma;er of days, and a comparator in under an hour
Case Study: A Note on Tools
• Data literate, mix of manual testers and tradi$onal automators
• Prior SQL but no prior PL/SQL experience • Ready, willing and able to learn whatever needed to be learned in order to test
• Ready to challenge each other, and to be challenged
Case Study: Team
• The tool was “just a tool”: human factors were more important, e.g.: – Availability of informa$on – Responsibili$es and handoffs – Communica$on of changes
Case Study: Lessons Learned
• The tool, and the tes$ng, were heuris$c in nature: – Tests (whether manual or automated) can have bugs in them: tests that do not reveal problems don’t “pass”, they simply do not reveal problems
– Our checks were only as good as the requirements and design, and were blind to requirement and design bugs
– We were forced to use composite keys where true keys could not be replicated, increasing our blindness to certain logic bugs
– In some cases, condi$ons were not mutually exclusive, making isola$on of any corresponding mismatch less precise (i.e. at a field rather than condi$on level)
Case Study: Lessons Learned