etis09 - data quality - common problems & checks - presentation
TRANSCRIPT
Data Quality: Common Problems & Checks
David M. Walker [email protected]
+44 (0) 7050 028 911 - http://www.datamgmt.com
Data Management & Warehousing
Date: 24 April 2009 Location: Zagreb, Croatia
24 April 2009 © 2009 Data Management & Warehousing Page 2
Agenda
• Introduction • Common Problems • Automated Checking • Profiling Checks • Conclusions
Introduction
• Data Quality problems are a SOURCE SYSTEM issue and an ETL issue – They just manifest themselves in the data warehouse
• Prevention is better than cure – Fixing the source system or the ETL is ALWAYS
cheaper and more effective than cleaning the data in the ETL or in the Data Warehouse itself
• Data Quality is a continuous process – It is never finished and always needs to be monitored
24 April 2009 © 2009 Data Management & Warehousing Page 3
The Impact of Poor Data Quality
• Devalues the data warehouse – Discourages people from trusting or using the
system and therefore curtailing the life of the data warehouse
• Highlights failings in the source system and/or the business process – Businesses would rather fix at any cost in the
data warehouse and pretend that there isn’t a source system problem
24 April 2009 © 2009 Data Management & Warehousing Page 4
Common Problems
• 11 types of problem that account for the most common problems
• They usually reflect poor design and/or implementation of systems
• Most can be fixed or monitored and managed to limit the impact
24 April 2009 © 2009 Data Management & Warehousing Page 5
Referential Issues
• Keys that are not unique – Systems that do not enforce the unique
primary key or flat files or spreadsheets – Also generated by ETL that creates a
surrogate key incorrectly • Referential Integrity Failures
– Where referentially integrity is not enforced values in the child table are created that are not in the parent table
24 April 2009 © 2009 Data Management & Warehousing Page 6
Data Type Issues
• Format Errors – Typically in Date/Time type fields – 02/04/2009 2nd April (UK) or 4th Feb (US)
• Inappropriate Data Types – Storing Dates in Character Strings
20090624 as YYYYMMDD format string – But what about 20090230?
24 April 2009 © 2009 Data Management & Warehousing Page 7
Data Model Issues
• De-normalised tables – Commonly created for performance reasons – Inherently duplicates data – Often gets out of sync
• Data/Column Retirement – Upgrade to system retires a column – ETL continues to use the old column
• Poor Table/Column Naming – Don’t assume that a column does what it says – Don’t assume that a column is still being used for it’s original
purpose
24 April 2009 © 2009 Data Management & Warehousing Page 8
Data Content Issues
• Null Values – Systems that have many optional fields will
often have missing values – Null values allow rows to be silently omitted
from queries • Inappropriate Values
– Databases allow special characters and/or leading/trailing white space
– “DataspaceQuality” != “DatatabQuality” 24 April 2009 © 2009 Data Management & Warehousing Page 9
Data Feed Issues
• Missing Data – Where a stream of files are loaded by ETL if
one is dropped it can go un-noticed – Common with CDR type loads in Telcos
• Late Data – A short term data quality issue – Leaves users believing there is a problem – Produces inconsistent reporting over time
24 April 2009 © 2009 Data Management & Warehousing Page 10
Automated Checking
• Regularly run checks • Broad Coverage across systems
– 100s and 1000s not 10s of queries – Run in a low priority loop in the background
• Used against: – Sources – Data Staging – Data Warehouse
• No Product Required – We often implement this as a controlling shell script and lots of
small scripts, one for each check
24 April 2009 © 2009 Data Management & Warehousing Page 11
Trending
• Absolute Trending – Track an expected value over time
• e.g. Returned Mail is usually less than 500 items per day • If the value is <500 status is green, 501 to 1000 amber and >1000
red
• Statistical Process Control (SPC) Trending – Track an expected value where the value changes over time
• e.g. Telco CDRs – expect more as the company grows • Don’t want to be continuously changing the threshold • Compare current load to historical means • If current load within 2 Standard Deviations – Green,
3 Standard Deviations – Amber, 4 or more Standard Deviations - Red
24 April 2009 © 2009 Data Management & Warehousing Page 12
Flow Control
• Flow Control – ETL manipulates data
• Joins, De-duplicates, Filters, Aggregates, etc
– Use the formula: Source Count - Filtered Count – DeDup Count – Target Count = 0
• Trusted Source – Compare the result with a third system – e.g. Does the Count of Switch CDRs =
Count of those processed by the billing system Count of those processed in the DWH
24 April 2009 © 2009 Data Management & Warehousing Page 13
Business Rules Based
• Specific rules to match known business rules – Account holders > 18 (Sys Date – DoB) – Account holders < 115 – Credit Card numbers are 16 digits long – Number of accounts without a status
• Result should yield Zero
24 April 2009 © 2009 Data Management & Warehousing Page 14
Automated Checking - Ops
• Managed by exception – Red given priority – Amber are always followed up
• Massive number of checks – 100’s are good – 1000’s are better
• Presentation – Alerts, RAG, Graphical, Numerical, etc.
24 April 2009 © 2009 Data Management & Warehousing Page 15
Data Profiling Checks
• Run manually because they need to be interpreted by a human
• Leads to new business rules being added to the automated checks
• Can be done with simple reporting tools or commercial data profiling tools
24 April 2009 © 2009 Data Management & Warehousing Page 16
Frequency Outliers
• Count discreet values in a table and check items with many more or less than normal – e.g. DoB 01-01-01 many times more common
than any other value indicates source default and something that needs work
– e.g. Count of SMS messages significantly lower on a given day may equate to a genuine system failure and therefore not a DQ problem
24 April 2009 © 2009 Data Management & Warehousing Page 17
Maximum & Minimum
• Determine what a valid range for any value should be – e.g. age between 18 and 115 – Immediately finds individual data quality
issues that can be resolved – Allows an analyst to create new business
rules to prevent future problems
24 April 2009 © 2009 Data Management & Warehousing Page 18
Sequential Keys
• If a system has a sequential key:
Max Value – Min Value – Count = 0
• If this is true – is it too perfect for an operational system and therefore test data
• If this is false – what has caused the gaps, are the deletions intentional?
24 April 2009 © 2009 Data Management & Warehousing Page 19
Data Types
• Validation of mis-used data types before loading – e.g. Dates in Character fields – Format: YYYYMMDD – Check MM between 01 and 12 – Check DD between 01 and 31 – Check MMDD does not include 0230, 0231 – etc.
24 April 2009 © 2009 Data Management & Warehousing Page 20
Skewed Pattern Profiling
• Looking for specific patterns in data – e.g. UK National Insurance Numbers (?) have
the format AA 99 99 99 A – Pattern match all values looking for
exceptions • Number Lengths are a special case
– e.g Credit Card Numbers are 16 digits long
24 April 2009 © 2009 Data Management & Warehousing Page 21
Content Checking
• Content Checking is the manual review of character strings
• Needs a good understanding of the nature of the data
• Often determines the need to do analysis of other types
24 April 2009 © 2009 Data Management & Warehousing Page 22
Nulls & White Space
• Nulls – Fields that have large proportion of nulls are usually
not useful – Also common is default status of null
(e.g. Account us either closed or null) • White Space
– Not Null fields with a single space – Tab instead of space – Leading/Trailing white space – Double White Space: “DavidSpaceSpaceWalker”
24 April 2009 © 2009 Data Management & Warehousing Page 23
Punctuation & Control Chars
• Punctuation – CSV files that are not properly quoted perform
field shifts – Address lines with extra commas
• Control Characters – Data fields that contain ASCII character codes
0 to 31 and 127 to 159 are often ‘invisible’ when viewed in queries but cause failures
– Also be aware of ‘code-page’ specifics 24 April 2009 © 2009 Data Management & Warehousing Page 24
Problem Management Matrix
24 April 2009 © 2009 Data Management & Warehousing Page 25
Continuous DQ Process
24 April 2009 © 2009 Data Management & Warehousing Page 26
Quality is FREE …
… as long as you are prepared to INVEST HEAVILY in it
Philip Crosby 1980
Especially true of Data Quality
24 April 2009 © 2009 Data Management & Warehousing Page 27
Data Quality: Common Problems & Checks
David M. Walker [email protected]
+44 (0) 7050 028 911 - http://www.datamgmt.com
Data Management & Warehousing
Thank You