code and data management
TRANSCRIPT
CODE AND DATA MANAGEMENT Toni Rosati Lynn Yarmey
… Reproducibility is the foundation of science
… Journals are starting to require data deposit
… You want to get credit for producing data (data citations)
… Others can use and build on your work (data reuse)
… Recreating a figure from a 2006 paper shouldn’t be painful
… Funders tell us so (See NSF, NIH, NOAA, etc)
Data Management is Important! Because…
Outline • Back up often • Sharing code • File naming • Metadata • Sharing data • A data search tool
But why would you only backup when you can do so much more?...
Tips: - 1 working copy on your computer - 1 copy on infrastructure near you - 1 copy on infrastructure far away
Back up
SHARE!!
• Good backup • Collaboration • People don’t have to contact you to get and understand the code
• Faster and easier than other options (emailing individuals or sharing on servers)
• ……
Why Share Code?
Why Share Code? • Version control • Commenting gives public and brief history • Work on multiple computers with the same code– flexibility in where you work (no USB drive necessary)
• Keep code with metadata/user instructions • No bureaucracy • FREE!
What is Git?
• Git is a distributed revision control and source code management (SCM) system capable of dealing with non-linear workflows
• “As with most other distributed revision control systems,
and unlike most client-server systems, every Git working directory is a full-fledged repository with complete history and full version tracking capabilities, independent of network access or a central server.” (Wikipedia)
GitHub
Sharing Code – GitHub.com
Sharing Code – GitHub.com
GitHub serves as the location of record for VIC at: https://github.com/UW-Hydro/VIC
File Naming • Make names unique and meaningful! • Include (as appropriate):
- Project name or acronym - Study title - Location - Data type - Researcher initials - Date - Data stage - Version number - File type
Think “long-term”
Metadata What would someone unfamiliar with your data need in order to evaluate, understand, and reuse them? How about someone:
- who works in your lab? - from a different lab in your field? - who is in a related interdisciplinary field? - who researches a completely different area? - who works for a newspaper? Congress?
Metadata is the difference between:
Metadata is Data about Data • Units? • Resolution? • What do the Column names mean? • Caveats? Known data issues or missing values? • How data were collected? • Where forcing data came from? • How many layers were used in this model?
“Information that describes the content, quality, condition, origin, and other characteristics of data or other pieces of information. Metadata for spatial data may describe and document its subject matter; how, when, where, and by whom the data was collected; availability and distribution information; its projection, scale, resolution, and accuracy; and its reliability with regard to some standard. Metadata consists of properties and documentation. Properties are derived from the data source (for example, the coordinate system and projection of the data), while documentation is entered by a person (for example, keywords used to describe the data).” Esri
Metadata • What happens without good
metadata?
• You have no idea what the data mean
• You think you understand the data, so you use it… • …but you use it totally wrong
• You waste hours (or days) trying to find out more about the data
Sharing Data
These days, Dr. Hodes said, “the old model in which researchers jealously guarded their data is no longer applicable.” http://www.nytimes.com/2011/04/04/health/04alzheimer.html
Sharing/Finding Data
www.nsidc.org/acadis/search
Organize now…. or….
Thank you!
Data Reuse
• Our team enables Arctic sciences by ensuring datasets are well documented and can be understood by re-users.
• The trick with data re-use is to
find the dataset… • then become familiar enough
with a dataset… • to be able to combine it with
other data … • and extract accurate results.
Data Curation
• Metadata • Usability • Documentation • Training • Re-use • Tools • A little marketing • Partnering
• Consensus building • Data management plans
for grant proposals • Integrating social and
physical sciences • Data quality checks • Data analysis
DOIs and Citations • Digital Object Identifiers (DOI) officially name a resource. • A DOI is essentially a stable, permanent URL.
• Information about a digital object may change over time, including where to find it, but its DOI name will not change.
• “The DOI System provides a framework for persistent identification, managing intellectual content, managing metadata, linking customers with content suppliers, facilitating electronic commerce, and enabling automated management of media.” (DataCite.org)
Beyond ACADIS – Other Resources General Info and help -
Earth Science Information Partners (ESIP): http://wiki.esipfed.org/ UVA Libraries: http://www2.lib.virginia.edu/brown/data/
Data Management Plan and other tools – DMP Tool: https://dmp.cdlib.org/ DataOne: https://www.dataone.org/cattools/Data%20and%20Metadata
%20Management Metadata -
Excel Plug-in tool (in development): http://www.cdlib.org/cdlinfo/2011/09/01/facilitating-data-management-dcxl/ Lists of Standards (not complete!) for bio, climate, ecology, oceanography - http://
marinemetadata.org/conventions Stanford-based portal for medical/bio -
http://bioportal.bioontology.org/resources