data management lab: session 3 slides
DESCRIPTION
Data Management Lab: Session 3 slides (more details at http://ulib.iupui.edu/digitalscholarship/dataservices/datamgmtlab) What you will learn: 1. Build awareness of research data management issues associated with digital data. 2. Introduce methods to address common data management issues and facilitate data integrity. 3. Introduce institutional resources supporting effective data management methods. 4. Build proficiency in applying these methods. 5. Build strategic skills that enable attendees to solve new data management problems.TRANSCRIPT
Research Data Management
Spring 2014: Session 3
Practical strategies for better results
University Library Center for Digital Scholarship
QUALITY ASSURANCE & CONTROL MODULE 3
LEARNING OUTCOMES • Develop procedures
for quality assurance and quality control activities.
Data Integrity
1. Data have integrity if they have been maintained without unauthorized alteration or destruction
2. Data integrity is data that has a complete or whole structure. (http://www.princeton.edu/~achaney/tmve/wiki100k/docs/Data_integrity.html)
Data Quality
• Fitness for use (depends on context of your questions) • Data quality is the most important aspect of data
management • Ensured by
– Sufficient resources and expertise – Paying close attention to the design of data collection
instruments – Creating appropriate entry, validation, and reporting processes – Ongoing QC processes – Understanding the data collected
Chapman, 2005 Dept of Biostatistics – Data Management, IUSM
Data Quality Standards
• Check data for its logical consistency. • Check data for reasonableness. • Ensure adherence to sound estimation methodologies. • Ensure adherence to monetary submission standards for
stolen and recovered property. • Ensure that other statistical edit functions are processed
within established parameters. FBI: http://www.fbi.gov/about-us/cjis/ucr/data_quality_guidelines Dept of Biostatistics – Data Management, IUSM
Data Entry and Manipulation
• Strategies for preventing errors from entering a dataset • Activities to ensure quality of data before collection • Activities that involve monitoring and maintaining the
quality of data during the study
Data Entry and Manipulation
• Define & enforce standards ◦ Formats ◦ Codes ◦ Measurement units ◦ Metadata
• Assign responsibility for data quality ◦ Be sure assigned person is educated in QA/QC
Quality Assurance v. Control
• QA: set of processes, procedures, and activities that are initiated prior to data collection to ensure the expected level of quality will be reached and data integrity will be maintained.
• QC: a system for verifying and maintaining a desired level of quality in a product or service.
http://c2.com/cgi/wiki?QualityAssuranceIsNotQualityControl
Quality Assurance in Practice
• CRF (data collection instrument) review & validation • System/process testing & validation • Training, education, communication of a team • Standard Operating Procedures, Standard Operating
Guidelines • Site audits Dept of Biostatistics – Data Management, IUSM
Quality Control in Practice
• Set of processes, procedures, and activities associated with monitoring, detection, and action during and after data collection.
• Examples: – Errors in individual data fields – Systematic errors – Violation of protocol – Staff performance issues – Fraud or scientific misconduct
Dept of Biostatistics – Data Management, IUSM
Activity
Define data quality standards for the following variables: • Age • Height • BMI • Life satisfaction scale • Number of close friends
Don’t forget to upload this to Box. Suggested file name “Data Quality Standards”
References 1. Department of Biostatistics – Data Management Team, Indiana
University School of Medicine (2013). Data Management including REDCap. (provided via email)
2. Chapman, A. D. 2005. Principles of Data Quality, version 1.0. Report for the Global Biodiversity Information Facility, Copenhagen. ISBN 87-92020-03-8. http://www.gbif.org/resources/2829
3. DataONE Education Module: Data Quality Control and Assurance. DataONE. From http://www.dataone.org/sites/all/documents /L05_DataQualityControlAssurance.pptx
DATA COLLECTION MODULE 3
LEARNING OUTCOMES • Describe key
considerations for selecting data collection tools.
Choose your tools wisely
Choose your tools wisely
Allie Brosh, 2010
Activity
Draft data collection instrument See document “DataMgmtLab-Spr14-CollectionCodingEntry_EX“
Don’t forget to upload this to Box. Suggested file name “Data Collection Tool”
References 1. Brosh. A. 2010. Boyfriend doesn’t have ebola. Probably.
http://hyperboleandahalf.blogspot.com/2010/02/boyfriend-doesnt-have-ebola-probably.html
DATA CODING & ENTRY MODULE 3
LEARNING OUTCOMES • Use best practices
for coding. • Use best practices
for data entry.
Goals of Data Entry
• Publishable results! – Valid data that are organized to support smooth
analysis • Easy to import into analytical program • Minimize manipulations and errors • Has a logical [data] structure
Activity
Draft data coding scheme for data entry • Review data entry best practices
document in Box
Don’t forget to upload this to Box. Suggested file name “Coding Scheme”
References 1. DataONE Education Module: Data Entry and Manipulation. DataONE.
From http://www.dataone.org/sites/all/documents/ L04_DataEntryManipulation.pptx
2. Tilmes, C. (2011). Data Management 101 for the Earth Scientist presented at the AGU Workshop. From http://wiki.esipfed.org/index.php/2011AGUworkshop
3. Scott, T. (2012). Guidelines to Data Collection and Data Entry, Vanderbilt CRC Research Skills Workshop Series. From http://www.mc.vanderbilt.edu/gcrc/workshop_files/2012-09-07.pdf
DATA SCREENING & CLEANING MODULE 3
LEARNING OUTCOMES • Develop a screening
and cleaning protocol and/or checklist.
Data Entry and Manipulation
Data Contamination • Process or phenomenon, other than the one of interest,
that affects the variable value • Erroneous values
CC
imag
e by
Mic
hael
Cog
hlan
on
Flic
kr
Data Entry and Manipulation
• Errors of Commission o Incorrect or inaccurate data entered o Examples: malfunctioning instrument, mistyped data
• Errors of Omission o Data or metadata not recorded o Examples: inadequate documentation, human error, anomalies in the
field
CC
imag
e by
Nic
k J
Web
b on
Flic
kr
Data Entry and Manipulation
• Double entry ◦ Data keyed in by two independent people ◦ Check for agreement with computer verification
• Record a reading of the data and transcribe from the recording
• Use text-to-speech program to read data back
CC
imag
e by
wes
krie
sel o
n Fl
ickr
Data Entry and Manipulation
• Design data storage well ◦ Minimize number of times items that must be entered repeatedly ◦ Use consistent terminology ◦ Atomize data: one cell per piece of information
• Document changes to data ◦ Avoids duplicate error checking ◦ Allows undo if necessary
Data Entry and Manipulation
• Make sure data line up in proper columns • No missing, impossible, or anomalous values • Perform statistical summaries
CC
imag
e by
che
sape
akec
limat
e on
Flic
kr
Data Entry and Manipulation
• Look for outliers ◦ Outliers are extreme values for a variable given the statistical model
being used ◦ The goal is not to eliminate outliers but to identify potential data
contamination
0
10
20
30
40
50
60
0 5 10 15 20 25 30 35
Data Entry and Manipulation
• Methods to look for outliers ◦ Graphical
• Normal probability plots • Regression • Scatter plots
◦ Maps ◦ Subtract values from mean
Data Entry and Manipulation
• Data contamination is data that results from a factor not examined by the study that results in altered data values
• Data error types: commission or omission • Quality assurance and quality control are strategies for ◦ preventing errors from entering a dataset ◦ ensuring data quality for entered data ◦ monitoring, and maintaining data quality throughout the project
• Identify and enforce quality assurance and quality control measures throughout the Data Life Cycle
Discussion
Using the Data Review Checklist, evaluate the HBSC codebook “DataMgmtLab-Spr14_DataReviewChecklist_EX”
What screening & cleaning procedures were used?
Data Entry and Manipulation
1. D. Edwards, in Ecological Data: Design, Management and Processing, WK Michener and JW Brunt, Eds. (Blackwell, New York, 2000), pp. 70-91. Available at www.ecoinformatics.org/pubs
2. R. B. Cook, R. J. Olson, P. Kanciruk, L. A. Hook, Best practices for preparing ecological data sets to share and archive. Bull. Ecol. Soc. Amer. 82, 138-141 (2001).
3. A. D. Chapman, “Principles of Data Quality:. Report for the Global Biodiversity Information Facility” (Global Biodiversity Information Facility, Copenhagen, 2004). Available at http://www.gbif.org/communications/resources/print-and-online-resources/download-publications/bookelets/
References 1. Cook, 2013, NACP Best Data Management Practices Workshop. From
http://daac.ornl.gov/NACP_AIM_2013/04_data_management_cook_2013.02.03.ppt
2. Simmhan, Y. L., Plale, B., & Gannon, D. (2005). A survey of data provenance in e-Science. SIGMOD Record, 34(3), 31-36. From http://www.sigmod.org/publications/sigmod-record/0509/p31-special-sw-section-5.pdf
3. Ram, S. (2012). Emerging Role of Social Media in Data Sharing and Management. From http://www.slideshare.net/INSITEUA/provenance-management-to-enable-data-sharing
AUTOMATION MODULE 3
LEARNING OUTCOMES • Explain why
automation provides better provenance than manual processes.
• Identify effective tools for automating data processing and analysis.
Choose your tools wisely • Documents • Excel • Access • SPSS, Minitab • Mathematica, MATLAB, Scilab • SAS, Stata • R • MapReduce • NVivo, Atlas.ti, Dedoose, HyperRESEARCH, etc. http://www.dataone.org/all-software-tools
Data Formats; Version 1.0
Overview
• Spreadsheets are amazingly flexible, and are commonly used for data collection, analysis and management
• Spreadsheets are seldom self-documenting, and seldom well-documented
• Subtle (and not so subtle) errors are easily introduced during entry, manipulation and analysis
• Spreadsheet conventions – often ad hoc and evolutionary – may change or be applied inconsistently
• Spreadsheet file formats are proprietary and thus generally unacceptable as long term archival purposes
Data Entry and Manipulation
• Great for charts, graphs, calculations
• Flexible about cell content type—cells in same column can contain numbers or text
• Lack record integrity--can sort a column independently of all others)
• Easy to use – but harder to maintain as complexity and size of data grows
• Easy to query to select portions of data
• Data fields are typed – For example, only integers are allowed in integer fields
• Columns cannot be sorted independently of each other
• Steeper learning curve than a spreadsheet
NACP Best Data Management Practices, February 3, 2013
5. Preserve information (cont) • Use a scripted language to process data
– R Statistical package (free, powerful) – SAS – MATLAB
• Processing scripts are records of processing – Scripts can be revised, rerun
• Graphical User Interface-based analyses may seem easy, but don’t leave a record
45
Provenance, Audit Trails, etc.
• “…information that helps determine the derivation history of a data product, starting from its original sources.” (Simmhan et al, 2005) – Ancestral data products from which the data evolved – Process of transformation of these ancestral data
products
• Uses: data quality, audit trail, replication recipe, attribution, informational
More Considerations
• Field names & descriptions • Structured entry • Validation • Record integrity • Missing data • Data/field types • File types: common, open documented standard • Output required for analysis and visualization
Demonstration & Discussion
Run [analysis] in Excel and Stata. Compare output. • What features does Stata have that Excel
does not? • How do these features support
provenance and data integrity?
References 1. DataONE Education Module: Data Entry and Manipulation. DataONE.
From http://www.dataone.org/sites/all/documents/ L04_DataEntryManipulation.pptx