conquering chaos in the age of networked science: research data management
TRANSCRIPT
Conquering Chaos in the Age of Networked Science:
Research Data Management*
*Adaptation of the NECDMC First Module
Kathryn M. Houk, MLISTufts University Hirsh Health Sciences LibraryWednesday June 4, 2014
Librarians: Your Partners in Research
Today’s Objectives
Recognize what research data is and what data
management entails
Recognize why managing data is important
Identify common data management issues
Learn best practices and resources for managing these
issues
Learn about how the library can help you identify data
management resources, tools, and best practices
What is Data?
• “Research data, unlike other types of information, is
collected, observed, or created, for purposes of
analysis to produce original research results”
(University of Edinburgh).
• Observational
• Experimental
• Simulation data
• Derived or compiled data
Why Should I Manage it?
• Transparency & Integrity
• Compliance
Science & Personal Benefits
• Who uses your data now?
• Who COULD use your data?
• Shared/Open Data
• Scientific progress
• Impact on your career
• Citation counts
What if I Don’t Consider RDM?
Data Sharing and Management Snafu in 3 Short Acts:
A data management horror story by Karen Hanson, Alisa Surkis and Karen Yacobucci.
http://www.youtube.com/watch?v=N2zK3sAtr-4
Data Management Planningvs. a DMP
Data Management Plans
• What types of data will be created?
• Who will own, have access to, and be responsible
for managing these data?
• What equipment and methods will be used to
capture and process data?
• Where will data be stored during and after?
Simplified Data Management Plan1. Types of data
• What types of data will you be creating or capturing? (experimental measures, observational or qualitative, model simulation, existing)
• How will you capture, create, and/or process the data? (Identify instruments, software, imaging, etc. used)
2. Contextual Details (Metadata) Needed to Make Data Meaningful to others
• What file formats and naming conventions will you be using?
3. Storage, Backup and Security
• Where and on what media will you store the data?
• What is your backup plan for the data?
• How will you manage data security?
4. Provisions for Protection/Privacy
• How are you addressing any ethical or privacy issues (IRB, anonymization of data)?
• Who will own any copyright or intellectual property rights to the data?
5. Policies for re-use
• What restrictions need to be placed on re-use of your data?
6. Policies for access and sharing
• What is the process for gaining access to your data?
7. Plan for archiving and preservation of access
• What is your long-term plan for preservation and maintenance of the data?
Creating a DMP & Considering Long-Term DM Issues• Read the case study provided
• Your group is assigned a set of questions (labeled Group 1-6) to answer as best you can
• First set of questions are from one section of the simplified DMP
• 2nd set of questions highlight an issue that arises in day-to-day or long-term management of research data (a more detailed level)
• Elect a group speaker
• Each group will discuss their answers
• We will go over the issue associated with your section, common problems, and best practices
Group 1
• DMP Section 1: Types of Data1. What types (e.g. images, lists of readings, text documents) of
data are being collected for this study?
2. What analytical methods and tools are being used in this study?
3. What types of data will be generated from these analytical tools and methods?
• Detailed Planning1. What naming conventions are being used in the lab?
2. Is there a structure for saving files in the lab?
3. What kind of information would you include in a naming convention for files?
4. What kinds of things would you avoid in naming/labeling files?
Issue: Records Management
• Does this sound familiar?
• Inconsistently labeled files
• in multiple versions…
• inside poorly structured folders…
• stored on multiple media…
• in multiple locations…
• and in various formats…
Issue: Records Management
• Best Practices:
• Avoid special characters in a file name.
• Use capitals or underscores instead of periods or spaces.
• Use 25 or fewer characters.
• Use documented & standardized descriptive information
about the project/experiment.
• Use date format ISO 8601:YYYYMMDD.
• Include a version number.
Issue: Records Management
Group 2
• DMP Section 2: Contextual Details (Metadata)
1. What contextual details would the researcher need to document to make her data meaningful to others?
2. How would a lack of naming and labeling conventions impact later data access by other researchers and possibly herself?
• Detailed Planning
1. What general information do you think is needed for scientific data to make it discoverable? (ex. Think of a search screen and a dropdown menu of where you can search for a term: Title, Author, Genre, etc.)
2. Are you aware of any metadata standards for the life or health sciences?
3. Do you think all metadata has to be hand-entered or recorded?
4. How would you ensure lab members knew to collect and record specific information in standard ways?
Issue: Metadata
• How will someone make sense of your data e.g. the cells
and values of your spreadsheet?
• What universal or disciplinary standards could be used to
label your data?
• How can you describe a data set to make it
discoverable?
Issue: Metadata
• Biology and health-specific metadata examples
Issue: Metadata
• Title
• Creator
• Identifier
• Subject
• Funders
• Rights
• Access information
• Language
• Dates
• Location
• Methodology
• Data processing
• Sources
• List of file names
• File Formats
• File structure
• Variable list
• Code lists
• Versions
• Checksums
Issue: Metadata
• Best Practices
• Describe the contents of data files
• Define the parameters and the units on the parameter
• Explain the formats for dates, time, geographic coordinates,
and other parameters
• Define any coded values
• Describe quality flags or qualifying values
• Define missing values
Group 3
• DMP Section 3: Data Backup, Storage, and Security1. Where and on what media will the data from each source be
stored?
2. How, how often, and where will the data be backed up?
3. Are there any security concerns for the data and have they been addressed?
• Detailed Planning1. How many copies of your data do you think you should have
and where should you keep them?
2. Is there any group on campus you think could help you with backup and security/access concerns?
3. What are some good data storage and backup practices you know about or practice?
Issue: Backup & Security
• How often should data be backed up?
• How many copies of data should you have?
• Where can you store your data?
• How much server space can I get?
Issue: Backup & Security
• Best Practices• Make 3 copies (original + external/local + external/remote)
• Have them geographically distributed (local vs. remote)
• Use a Hard drive (e.g. Vista backup, Mac Timeline, UNIX rsync) or Tape backup system
• Cloud Storage - some examples of private sector storage resources include: (Amazon S3, Elephant Drive, Jungle Disk, Mozy, Carbonite)
• Unencrypted is ideal for storing your data because it will make it most easily read by you and others in the future…but if you do need to encrypt your data because of human subjects then:• Keep passwords and keys on paper (2 copies), and in a PGP
(pretty good privacy) encrypted digital file
• Uncompressed is also ideal for storage, but if you need to do so to conserve space, limit compression to your 3rd backup copy
Group 4
• DMP Sections 4. Data protection/privacy and 5. Policies for reuse of data
1. How is the lab addressing any privacy or ethical issues?
2. Who will own any copyright or intellectual property rights to the data?
3. Are there any restrictions to the reuse of the data?
• Detailed Planning
1. Are there any reasons to not share or reuse data? Are these ethical or cultural issues?
2. Will having public funding affect data sharing and reuse differently than having private funding?
3. Who has the right to make decisions about reuse of your data?
Issue: Ownership & Retention
• Intellectual Property Policy
• IRB data retention policy
• Funders’ data retention policy
• Publishers’ data retention policy
• Federal and State laws
Issue: Ownership & Retention
• How long is long enough?
Issue: Ownership & Retention• IRB OHRP Requirements: 45 CFR 46 requires research records to be retained
for at least 3 years after the completion of the research.
• HIPAA Requirements: Any research that involved collecting identifiable health
information is subject to HIPAA requirements. As a result records must be
retained for a minimum of 6 years after each subject signed an authorization.
• FDA Requirements 21 CFR 312.62.c Any research that involved drugs,
devices, or biologics being tested in humans must have records retained for a
period of 2 years following the date a marketing application is approved for the
drug for the indication for which it is being investigated; or, if no application is
to be filed or if the application is not approved for such indication, until 2 years
after the investigation is discontinued and FDA is notified.
• VA Requirements: At present records for any research that involves the VA
must be retained indefinitely per VA federal regulatory requirements.
• Intellectual Property Requirements - Any research data used to support a
patent through must be retained for the life of the patent in accordance with
Intellectual Property Policy.
• Check with your Funder and Publisher Requirements
• Questions of data validity: If there are questions or allegations about the validity
of the data or appropriate conduct of the research, you must retain all of the
original research data until such questions or allegations have been completely
resolved.
Group 5
• DMP Sections 6: Policies for access and sharing1. How will others be able to gain future access to the study
data?
2. How does the graduate student plan to link her datasets to her published article?
• Detailed Planning1. Could there be a use for the graduate student’s data that was
not used in the published article?
2. Are the data the student collected open formats or proprietary (will people need specialized software to access and interpret the data)?
a) How would this affect future accessibility & reuse?
Group 6
• DMP Section 7: Plan for archiving and preservation of access
1. What is the long-term strategy for maintaining, curating and archiving the data?
2. Where will the data be stored?
3. What contextual data (data that describes your data) or other related data will be included in the archive?
• Detailed Planning
1. What data should be included in an archive?
2. Do you know of any data repositories that you could use for your data?
3. How can you ensure that your data is discoverable and interpretable?
4. How long should the data be maintained? What factors affect the length of time you retain your data?
Issue: Long-Term Planning
• What will happen to my data after my project ends?
• How can I appraise the value of my data?
• What are my options for archiving and preserving my
data?
• What are my options for publishing and sharing data?
Data Formats
• Is the file format open (i.e. open source) or closed
(i.e proprietary)?
• Is a particular software package required to read
and work with the data file? If so, the software
package, version, and operating system platform
should be cited in the metadata
• Do multiple files comprise the data file structure? If
so, that should be specified in the metadata
Open vs. Proprietary Formats Used in Research Labs
Issue: Long-Term Planning
• Best Practices
• When choosing a file format, select a consistent
format that can be read well into the future and is
independent of changes in applications.
• Non-proprietary: Open, documented standard,
Unencrypted, Uncompressed, ASCII formatted
files will be readable into the future.
Issue: Long-Term Planning
• Librarians can help:
• Identify file formats suitable for long-term preservation
• Interpret your funder or publisher’s repository
requirements
• Find and evaluate a suitable repository for your data
• Upload your data sets to a repository
• Help make your data in a repository searchable and
discoverable
• Create a doi and persistent id
• Choosing metadata standards for increased
discoverability
Issue: Data Stewardship
• Challenges
• Team Science
• Managing Laboratory Notebooks
• Rotating Lab Personnel
Issue: Data Stewardship
• Best Practices
• Define roles and assign responsibilities for data
management
• Identify skills needed to perform tasks outlined in
DMP and match to available staff
• Develop training plans for continuity
• Assign responsible parties and monitor results
How the Library Can Help:• Teach you, your lab, or
your classes about data management best practices
• Write a data management and/or sharing plan
• Comply with federal, funder, and publisher data sharing policies
• Find & submit your data to a repository
• Find standards to describe & label your data & data files
• Find a data set
• Cite others’ data
• Publish a data set
• Get a doi for a data set
• Measure the citation impact of your data set
• Build a collection of research data that others can search & access
• Archive & preserve your data
• Learn about copyright & license issues surrounding your data
Find Help
• Ask your librarian if the library can help!
• Make it known you are interested in receiving assistance from the library
• Ask your IT department for information on storage and security available
• Let them help you make a backup and storage plan
Learn More
• Data Management Principles & Education:
• Research Data MANTRA
• DataONE: Best Practices
• UK Data Archives
• MIT Data Management and Publishing Guide
• Data Management Plans
• Digital Curation Centre
• DMPTool2
• DataONE: Data Management Planning
Works Cited
Lamar Soutter Library, University of Massachusetts Medical School. 2014. “New England Collaborative Data Management Curriculum: Module 1.” http://library.umassmed.edu/necdmc.
DataONE. 2013. “Best Practices for Data Management.”
http://www.dataone.org/best-practices.
MIT Libraries. 2013. “Data Management and Publishing.” MIT
http://libraries.mit.edu/guides/subjects/data-management/index.html.
Office of Research Integrity. 2013. “Data Management.” United States Department of Health and Human Services. United States Federal Government.
http://ori.hhs.gov/education/products/rcradmin/topics/data/open.shtml.
Special thanks to Jen Ferguson, Richard Moore and Glenn Gaudette for permission to use their slides.