etis09 - data quality - common problems & checks - presentation

28
Data Quality: Common Problems & Checks David M. Walker [email protected] +44 (0) 7050 028 911 - http://www.datamgmt.com Data Management & Warehousing Date: 24 April 2009 Location: Zagreb, Croatia

Upload: davidmwalker

Post on 15-Nov-2014

916 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Data Quality: Common Problems & Checks

David M. Walker [email protected]

+44 (0) 7050 028 911 - http://www.datamgmt.com

Data Management & Warehousing

Date: 24 April 2009 Location: Zagreb, Croatia

Page 2: ETIS09 - Data Quality - Common Problems & Checks - Presentation

24 April 2009 © 2009 Data Management & Warehousing Page 2

Agenda

•  Introduction •  Common Problems •  Automated Checking •  Profiling Checks •  Conclusions

Page 3: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Introduction

•  Data Quality problems are a SOURCE SYSTEM issue and an ETL issue –  They just manifest themselves in the data warehouse

•  Prevention is better than cure –  Fixing the source system or the ETL is ALWAYS

cheaper and more effective than cleaning the data in the ETL or in the Data Warehouse itself

•  Data Quality is a continuous process –  It is never finished and always needs to be monitored

24 April 2009 © 2009 Data Management & Warehousing Page 3

Page 4: ETIS09 - Data Quality - Common Problems & Checks - Presentation

The Impact of Poor Data Quality

•  Devalues the data warehouse – Discourages people from trusting or using the

system and therefore curtailing the life of the data warehouse

•  Highlights failings in the source system and/or the business process – Businesses would rather fix at any cost in the

data warehouse and pretend that there isn’t a source system problem

24 April 2009 © 2009 Data Management & Warehousing Page 4

Page 5: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Common Problems

•  11 types of problem that account for the most common problems

•  They usually reflect poor design and/or implementation of systems

•  Most can be fixed or monitored and managed to limit the impact

24 April 2009 © 2009 Data Management & Warehousing Page 5

Page 6: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Referential Issues

•  Keys that are not unique – Systems that do not enforce the unique

primary key or flat files or spreadsheets – Also generated by ETL that creates a

surrogate key incorrectly •  Referential Integrity Failures

– Where referentially integrity is not enforced values in the child table are created that are not in the parent table

24 April 2009 © 2009 Data Management & Warehousing Page 6

Page 7: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Data Type Issues

•  Format Errors – Typically in Date/Time type fields – 02/04/2009 2nd April (UK) or 4th Feb (US)

•  Inappropriate Data Types – Storing Dates in Character Strings

20090624 as YYYYMMDD format string – But what about 20090230?

24 April 2009 © 2009 Data Management & Warehousing Page 7

Page 8: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Data Model Issues

•  De-normalised tables –  Commonly created for performance reasons –  Inherently duplicates data –  Often gets out of sync

•  Data/Column Retirement –  Upgrade to system retires a column –  ETL continues to use the old column

•  Poor Table/Column Naming –  Don’t assume that a column does what it says –  Don’t assume that a column is still being used for it’s original

purpose

24 April 2009 © 2009 Data Management & Warehousing Page 8

Page 9: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Data Content Issues

•  Null Values – Systems that have many optional fields will

often have missing values – Null values allow rows to be silently omitted

from queries •  Inappropriate Values

– Databases allow special characters and/or leading/trailing white space

–  “DataspaceQuality” != “DatatabQuality” 24 April 2009 © 2009 Data Management & Warehousing Page 9

Page 10: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Data Feed Issues

•  Missing Data – Where a stream of files are loaded by ETL if

one is dropped it can go un-noticed – Common with CDR type loads in Telcos

•  Late Data – A short term data quality issue – Leaves users believing there is a problem – Produces inconsistent reporting over time

24 April 2009 © 2009 Data Management & Warehousing Page 10

Page 11: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Automated Checking

•  Regularly run checks •  Broad Coverage across systems

–  100s and 1000s not 10s of queries –  Run in a low priority loop in the background

•  Used against: –  Sources –  Data Staging –  Data Warehouse

•  No Product Required –  We often implement this as a controlling shell script and lots of

small scripts, one for each check

24 April 2009 © 2009 Data Management & Warehousing Page 11

Page 12: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Trending

•  Absolute Trending –  Track an expected value over time

•  e.g. Returned Mail is usually less than 500 items per day •  If the value is <500 status is green, 501 to 1000 amber and >1000

red

•  Statistical Process Control (SPC) Trending –  Track an expected value where the value changes over time

•  e.g. Telco CDRs – expect more as the company grows •  Don’t want to be continuously changing the threshold •  Compare current load to historical means •  If current load within 2 Standard Deviations – Green,

3 Standard Deviations – Amber, 4 or more Standard Deviations - Red

24 April 2009 © 2009 Data Management & Warehousing Page 12

Page 13: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Flow Control

•  Flow Control –  ETL manipulates data

•  Joins, De-duplicates, Filters, Aggregates, etc

–  Use the formula: Source Count - Filtered Count – DeDup Count – Target Count = 0

•  Trusted Source –  Compare the result with a third system –  e.g. Does the Count of Switch CDRs =

Count of those processed by the billing system Count of those processed in the DWH

24 April 2009 © 2009 Data Management & Warehousing Page 13

Page 14: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Business Rules Based

•  Specific rules to match known business rules – Account holders > 18 (Sys Date – DoB) – Account holders < 115 – Credit Card numbers are 16 digits long – Number of accounts without a status

•  Result should yield Zero

24 April 2009 © 2009 Data Management & Warehousing Page 14

Page 15: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Automated Checking - Ops

•  Managed by exception – Red given priority – Amber are always followed up

•  Massive number of checks – 100’s are good – 1000’s are better

•  Presentation – Alerts, RAG, Graphical, Numerical, etc.

24 April 2009 © 2009 Data Management & Warehousing Page 15

Page 16: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Data Profiling Checks

•  Run manually because they need to be interpreted by a human

•  Leads to new business rules being added to the automated checks

•  Can be done with simple reporting tools or commercial data profiling tools

24 April 2009 © 2009 Data Management & Warehousing Page 16

Page 17: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Frequency Outliers

•  Count discreet values in a table and check items with many more or less than normal – e.g. DoB 01-01-01 many times more common

than any other value indicates source default and something that needs work

– e.g. Count of SMS messages significantly lower on a given day may equate to a genuine system failure and therefore not a DQ problem

24 April 2009 © 2009 Data Management & Warehousing Page 17

Page 18: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Maximum & Minimum

•  Determine what a valid range for any value should be – e.g. age between 18 and 115 –  Immediately finds individual data quality

issues that can be resolved – Allows an analyst to create new business

rules to prevent future problems

24 April 2009 © 2009 Data Management & Warehousing Page 18

Page 19: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Sequential Keys

•  If a system has a sequential key:

Max Value – Min Value – Count = 0

•  If this is true – is it too perfect for an operational system and therefore test data

•  If this is false – what has caused the gaps, are the deletions intentional?

24 April 2009 © 2009 Data Management & Warehousing Page 19

Page 20: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Data Types

•  Validation of mis-used data types before loading – e.g. Dates in Character fields – Format: YYYYMMDD – Check MM between 01 and 12 – Check DD between 01 and 31 – Check MMDD does not include 0230, 0231 – etc.

24 April 2009 © 2009 Data Management & Warehousing Page 20

Page 21: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Skewed Pattern Profiling

•  Looking for specific patterns in data – e.g. UK National Insurance Numbers (?) have

the format AA 99 99 99 A – Pattern match all values looking for

exceptions •  Number Lengths are a special case

– e.g Credit Card Numbers are 16 digits long

24 April 2009 © 2009 Data Management & Warehousing Page 21

Page 22: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Content Checking

•  Content Checking is the manual review of character strings

•  Needs a good understanding of the nature of the data

•  Often determines the need to do analysis of other types

24 April 2009 © 2009 Data Management & Warehousing Page 22

Page 23: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Nulls & White Space

•  Nulls –  Fields that have large proportion of nulls are usually

not useful –  Also common is default status of null

(e.g. Account us either closed or null) •  White Space

–  Not Null fields with a single space –  Tab instead of space –  Leading/Trailing white space –  Double White Space: “DavidSpaceSpaceWalker”

24 April 2009 © 2009 Data Management & Warehousing Page 23

Page 24: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Punctuation & Control Chars

•  Punctuation – CSV files that are not properly quoted perform

field shifts – Address lines with extra commas

•  Control Characters – Data fields that contain ASCII character codes

0 to 31 and 127 to 159 are often ‘invisible’ when viewed in queries but cause failures

– Also be aware of ‘code-page’ specifics 24 April 2009 © 2009 Data Management & Warehousing Page 24

Page 25: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Problem Management Matrix

24 April 2009 © 2009 Data Management & Warehousing Page 25

Page 26: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Continuous DQ Process

24 April 2009 © 2009 Data Management & Warehousing Page 26

Page 27: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Quality is FREE …

… as long as you are prepared to INVEST HEAVILY in it

Philip Crosby 1980

Especially true of Data Quality

24 April 2009 © 2009 Data Management & Warehousing Page 27

Page 28: ETIS09 - Data Quality - Common Problems & Checks - Presentation

Data Quality: Common Problems & Checks

David M. Walker [email protected]

+44 (0) 7050 028 911 - http://www.datamgmt.com

Data Management & Warehousing

Thank You