a framework for exploring data quality in a large data system willard hom institute on research...
TRANSCRIPT
![Page 1: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/1.jpg)
A Framework for Exploring Data Quality in a Large Data System
Willard Hom
Institute on Research & Statistics
April 8, 2004
![Page 2: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/2.jpg)
Problem: How can we plan to explore data quality in a large data system?
Basic Response: Match the needs for data quality exploration with your resources/contexts for doing so.
![Page 3: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/3.jpg)
Note on Topic Coverage
• Focus on traditional MIS data (numeric and string variables).
• Not covered are other data forms (audio, visual, GIS, and narrative text/reports).
• MIS staff and expertise are obviously critical so this talk focuses mostly on contributions researchers can make.
![Page 4: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/4.jpg)
Reasons to Explore Data Quality
• Meet professional duty as researchers.• Help us to judge the types of analysis that
we can do with a data system. • Prevent others from misusing data.• Facilitate the improvement of data.• Counter frequent myths about your data.• Help justify the agency’s mission & funding
(esp. when data is necessary for funding).
![Page 5: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/5.jpg)
Why Exploring Data Quality Is Hard for Organizations
• Inexperience in the topic (and lack of expertise).• Concrete added expense but no clear value.• Sunken costs (incl. pressure for a time series).• Finding errors will not mean perfect data will
result.• Political & administrative sensitivity.• Not usually mandated.
![Page 6: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/6.jpg)
Some Dimensions of Data Quality
• Accuracy.*• Completeness.*• Consistency.• Currency.• Accessibility (Usability).
* areas where researchers or statisticans can contribute most effectively
![Page 7: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/7.jpg)
Accuracy
• Closeness to “true” value of a variable.
• Unbiased (absence of systematic error).
• Level of precision (or “coarseness ” of the recorded values).
![Page 8: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/8.jpg)
Completeness
• The degree of coverage (in terms of “cases” and of “variables”) for the analysis of a “target population.”*
• No, or very few, missing values where true values exist.
*also issues relating to longitudinal studies and explanatory modeling.
![Page 9: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/9.jpg)
Consistency
• Equivalence of “instrumentation” and formatting across time, space, and other “batches” of the data collection/management environment.
![Page 10: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/10.jpg)
Currency
• Minimal lag time between occurrence of new phenomenon and the availability of data values in the system to represent the new phenomenon.
• Minimal lag time between discovery of errors and the correction of those errors in historical data.
![Page 11: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/11.jpg)
Accessibility/Usability
• Ease of manipulation by target users (file format, record format, field format, system compatibility)
• Clarity of metadata for proper data analysis.
• Breadth of access (authority for use).
![Page 12: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/12.jpg)
Some Factors in Choosing Which Variables to Explore
• Risk from errors in a variable.
• Ease of error detection for a variable.
• Ease of error correction for a variable.
• Cost of error detection for a variable.
• Cost of error correction for a variable.
![Page 13: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/13.jpg)
Two Basic Data Exploration Tracks
• Data editing and testing.
• Process analysis.
• Both are important to use.
![Page 14: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/14.jpg)
Data Editing/Testing
• Screen for allowable range of values.• Screen for outliers (univariate or
multivariate).• Statistical quality control methods.• Screen within a record or across records.• These can prevent some error and can detect
some error, but they rarely find root causes.
![Page 15: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/15.jpg)
Some Caveats for Editing/Testing
• Some outliers are true values while some inliers are not. (Error detection is complicated.)
• Testing depends upon the analyst’s ability to use some “gold standard” in a comparison.
• Variables with restricted range of measurement or on a categorical scale present a different challenge than outliers for variables using an interval scale (with no range restriction or truncation).
![Page 16: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/16.jpg)
Process Analysis
• Analyze each step in the data’s history ---which includes the initial data generating step and all ensuing steps in the data processing---right down to the user of a final report or analysis.
![Page 17: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/17.jpg)
Some Caveats for Process Analysis
• This is a multi-disciplinary concept (needing at least MIS, social scientists, and subject matter experts).
• This can cost far more (in time and resources) than the editing/testing track.
• The metadata factor is important here.• It is critical for finding the root cause of
data error.
![Page 18: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/18.jpg)
Administrative Issues in Data Error
• Publish data so that data originators can correct errors in the system (a feedback loop)---another benefit of data “usability.”
• Consider the incentives that data creators or intermediaries have to bias data on purpose (so alter the incentives or monitor closely).
• Consider factors that can motivate the production of more accurate and complete data (show that data get used and how the costs of error will hit them)---especially when lack of effort is the cause.
![Page 19: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/19.jpg)
How Researchers Can Add Value
• Models in social psychology and economics to understand a data generation/processing system.
• Field observation (and interviews) of data collection process.
• Verbal protocol methods for process actors.• Experimentation to develop improvements.• Statistical tools for sampling (incl. audit
sampling), outliers, odd data patterns, control charts, and validity/reliability studies.
![Page 20: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/20.jpg)
Some Examples of Exploration
• Determining the number of CC students in an academic year with a bachelor’s degree.*
• Validating the students’ self-reported goals for CC enrollment (at time of initial registration).*
• Checking the accuracy of a flag for first-time CC student.**
* hypothetical example** actual historical example
![Page 21: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/21.jpg)
Determining the number of CC students in an academic year with a bachelor’s degree.
• A differential fee for CC students who have a BA/BS could motivate students to misreport the prior attainment of a BA/BS.
• A partial test for potential reporting bias could use databases of higher ed. enrollment to check for degree status in a random sample of CC students.
• We could use the sample proportion (in lieu of the “population” proportion in the MIS) if the MIS proportion lies outside the sample’s 95% confidence interval.
![Page 22: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/22.jpg)
Validating the students’ self-reported goals for CC enrollment (at time of initial registration).
• Student-reported goals may lack validity if students give the question no cognitive effort.
• A sample of CC students could be re-interviewed (phone or face-to-face) to check the reliablity of the initial response (a case of test-retest reliability).
• A qualitative evaluation could use field observation and/or post-survey de-briefing.
• Researchers could use the verbal protocol method to understand the ways that students interpret the question.
![Page 23: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/23.jpg)
Checking the accuracy of a flag for first-time CC student .
• CC students mark their status as “first-time CC students” or some other category, but field staff noted apparent reporting errors.
• The state-wide MIS has records of CC enrollment by individual student over a span of years.
• Programmers checked each new cohort of CC students for any prior CC enrollment.
![Page 24: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/24.jpg)
Some Pitfalls in Data Exploration
• If cases lack unique identifiers, you can’t use another data source to cross-check for data agreement.
• Even if cross-referencing indicates disagreement in data values, we may not know which source, if either of them, has the correct values.
• To find coverage errors (target population errors), you need alternate data and a statistical analysis of population profiles (because total N’s may agree).
• Survey data demand special methods such as re-interviewing and instrument validation.
![Page 25: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/25.jpg)
Rule of Thumb 1:
Do data exploration as near to the data generating step as possible; this will help in achieving correct data.
![Page 26: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/26.jpg)
Reasons for RoT 1:
• As time and proximity from the data source increase, the chances for getting correct data decrease.
• As more time passes and advancement into a data system occurs, the risk of dispersion of bad data grows---making amelioration more difficult.
![Page 27: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/27.jpg)
Rule of Thumb 2:
It’s impossible to achieve perfect data: seek to find the levels of quality
that are critical.
![Page 28: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/28.jpg)
Reasons for RoT 2:
• In large data systems, some loss of quality is inevitable.
• Usually, we cannot afford to achieve perfect data.
• Usually, we can achieve analytical goals with less-than-perfect data.
![Page 29: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/29.jpg)
Rule of Thumb 3:
With limited resources, we will need to trade-off breadth for depth in data
exploration.
![Page 30: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/30.jpg)
Reasons for RoT 3:
• In-depth data exploration takes time and expertise—which agencies usually have in limited supply.
![Page 31: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/31.jpg)
Rule of Thumb 4:
Expertise in statistical analysis and research in the relevant subject area are indispensable to effective data quality exploration (in concert with MIS staff) .
![Page 32: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/32.jpg)
Reasons for RoT 4:
• Staff who only have MIS expertise (with no expertise in statistical analysis or the relevant research topic) :
1. cannot fully understand the level of data quality (accuracy and completeness) needed, and2. cannot use the various statistical methods to detect potentially erroneous data.
![Page 33: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/33.jpg)
How Critical Is Data Quality for Your System?
• Are the fates of clients dependent on data quality?
• Is program funding or program evaluation directly linked to your data?
• Does your job depend upon data quality? (If your data are poor, will it be outsourced?)
![Page 34: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/34.jpg)
Can You Document Your Current Data Quality?
• Do you have a system in place that prevents data errors?
• Do you have a system in place that measures data quality (and detects error)?
• How rigorous are your steps to prevent error and measures of data quality?
![Page 35: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/35.jpg)
Is There A Credibility Gap?
• Do analysts/decision-makers downplay the reports or conclusions that are based on your data system?
• Do analysts/decision-makers prefer alternate data sources when they draw their conclusions?
![Page 36: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/36.jpg)
What Are Your Capacities?
• Does your agency have close control over the data system (that is, “cradle-to-grave”)?
• Does your agency have researchers with the skill/education to explore system data quality?
• Does your agency have MIS staff with skill/education to explore system data quality?
• Does your agency have time, staff availability, and funds to undertake data quality exploration?
• Is the data system a stable, long-term operation?
![Page 37: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/37.jpg)
Is There Management Support?
• Does management want a short-term solution---basically a “defensive” agenda---just find ways to rebut criticisms of your data’s quality?
• Does management want a long-term solution to data quality issues---a comprehensive strategy to prevent error and to raise quality?
![Page 38: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/38.jpg)
A Suggested Framework(assuming a bottom-up mode)
Assess importance of DQ.
Can you document current DQ?
Is there a credibility gap?
What are your capacities?
Is there management support?
![Page 39: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/39.jpg)
P.S. Some Factors in Initial Data Quality Problems
• Researchers may not have an active role in the design of data systems (an administrative and political issue)---if the agency has qualified researchers at all.
• Researchers may not have enough time, tools, or special training to help plan valuable outputs that a proposed data system could deliver.
• Analytical needs change but systems often do not adapt well to emerging needs or environmental changes.
![Page 40: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/40.jpg)
“Nutshell” Bibliography
• Dasu, T. & T. Johnson. (2003). Exploratory Data Mining and Data Cleaning. John Wiley & Sons: New York.
• Iglewicz, B. & D.C. Hoaglin. (1993). How to Detect and Handle Outliers. American Society for Quality Control: Milwaukee, Wisconsin.
![Page 41: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/41.jpg)
“Nutshell” Bibliography (cont.)
• Naus, J.I. (1975). Data Quality Control and Editing. Marcel Dekker: New York.
• Olson, J.E. (2003). Data Quality: The Accuracy Dimension. Morgan Kaufmann: San Francisco.
• Redman, T.C. (1992). Data Quality: Management and Technology. Bantam: New York.
![Page 42: A Framework for Exploring Data Quality in a Large Data System Willard Hom Institute on Research & Statistics April 8, 2004](https://reader031.vdocument.in/reader031/viewer/2022013100/5514bc61550346b0338b4663/html5/thumbnails/42.jpg)
Willard Hom
• Director of Research & Planning UnitChancellor’s Office, California Community Colleges, 1102 Q Street, Sacramento, CA 95814-6511
• E-mail: [email protected]
• Phone: (916) 327-5887