software citation and a proposal (nsf workshop at havard medical school)
TRANSCRIPT
@jameshowison
Software in the scientific literature:
Software mentions and a provocative proposal
James Howison Information School
University of Texas at AustinThis material is based upon work supported by the National
Science Foundation under Grant No. SMA-1064209.
What does a citation do, anyway?
• Gives credit for contribution– A key reward that drives activity in science– Sits alongside publications, grants, promotions,
and prizes– Rewards drive type of artifacts and collaboration
• Explains the method used– Citations assist in knowing what was done– Provenance– Replication and extension
@jameshowison
How problematic are current practices?
• How is software mentioned in papers?• How accessible and reusable is the software
mentioned?• How well do these mentions perform the functions of
citation?github.com/jameshowison/softciteDOI: 10.6084/m9.figshare.1146366
Howison, J., & Bullard, J. (2015). Software in the scientific literature: Problems with seeing, finding, and using software mentioned in the biology literature. Journal of the Association for
Information Science and Technology (JASIST), doi: 10.1002/asi.23538
@jameshowison
Sample and Method
• 90 randomly selected articles from biology literature, articles published between 2000 and 2010.
• Journals stratified across Journal Impact Factor to balance coverage with influence
@jameshowison
Content analysis scheme
Manual content analysis (3 coders, Kappa)1. Identifying mentions– Read article, locate a mention of a piece of software
2. Identify in-text characteristics of mention– Name of software? URL? Date? Version number? In
bibliography? Cite to paper/manual/webpage?3. Functions of mention– Identifiable? Findable? Accessible? Source? Match
preferred citation?
@jameshowison
https://github.com/jameshowison/softcite/blob/master/data/software-citation-coding.ttl
@jameshowison
How many mentions?
• 59 articles mentioned software, 31 did not.• There were 286 distinct mentions of software.• Those mentions were to 146 distinct pieces of
software.– This includes general purpose (e.g., Microsoft
Excel) and science-specific software (e.g., DENZO, BLAST).
@jameshowison
Types of mentionsMention Type Example
Cite to Publication … was calculated using biosys (Swofford & Selander 1981).
Cite to Project Name or Website
… using the program Autodecay version 4.0.29 PPC (Eriksson 1998).Reference List has: ERIKSSON, T. 1998. Autodecay, vers. 4.0.29 Stockholm: Department of Botany.
Like Instrument … calculated by t-test using the Prism 3.0 software (GraphPad Software, San Diego, CA, USA).
URL in text … freely available from http://www.cibiv.at/software/pda/ .
In-text name mention only
… were analyzed using MapQTL (4.0) software.
Not even name mentioned
… was carried out using software implemented in the Java programming language.
@jameshowison
Types of Mentions
@jameshowison
Simpler Mention Kinds
@jameshowison
By Strata?
@jameshowison
What sort of software mentioned?
@jameshowison
Proprietary software more likely to be mentioned “like instrument”
@jameshowison
How useful are these mentions?
@jameshowison
Not much change across strata
@jameshowison
Do mention types work differently?
@jameshowison
Other findings
• Only 24% journals had policies that mentioned software, declining by strata.– Rarely mention versions.– Not clear that these are followed.
• Only between 13–30% of packages make a specific request for a particular type of citation– 32% of mentions didn’t follow the citation.
@jameshowison
Visible citation formats as “nudge”
• Some disagreement about how important the text of a publication is:– Should effort focus on machine readable “meta-data”
in publication repositories (not in paper)?– Or focus on human readable formats in the paper?
• My position is that human readable will influence practice more quickly
• Formal, well-structured formats and policies act as a “nudge” to shape how authors mention software.
@jameshowison
Software archiving
• Strong finding that many pieces of software were not findable.– 1 in 10 packages could not be found at all– Only 1 in 20 packages could the specific version be found
(combination of missing version info and missing versions online)
• Analogous to link-rot for URLs in publications (Koehler, 1999)
• Need to influence how software is archived– Is that a role for publishers? Escrow for non-open software?
@jameshowison
Part 2
But what are we working to incentivize anyway?
@jameshowison
@jameshowison
Howison, J., & Herbsleb, J. D. (2013). Incentives and integration in scientific software production. In Proceedings of the ACM Conference on Computer Supported Cooperative Work (pp. 459–470). San Antonio, TX.
Citation and collaboration
• What is the impact on collaboration of credit-giving through citations?
• Can a citation (of any kind) incentivize an ongoing collaboration able to do the work needed to keep a piece of software scientifically functional?
• Could a standard undermine collaboration further?
@jameshowison
Can citation incentivize maintenance?
• Software relies on other software– Dependencies all the way down– Software stacks change quickly (new opportunities, new
problems, new libraries)• Scientists seek to extend the work of others, not just
re-execute it.• Many re-implementations come from frustration with
poorly maintained software– Software that wasn’t adjusted as its dependencies changed– Software that wasn’t updated with newer techniques
@jameshowison
A modest proposal
1. Papers have full workflow available2. Workflows have regression tests running on a
continuous integration system3. Integration system pulls all new versions of
dependencies, executes regression tests.4. On fail (build or tests) the paper is retracted.
@jameshowison
Howison, J. (2014). Retract bit-rotten publications: Aligning incentives for sustaining scientific software. In Working towards Sustainable Software for Science: Practice and Experiences (SuperComputing 2014 Workshop). New Orleans.
Uh …
• Retraction too strong, you say?
Ok, let’s revisit step 4:• On fail, the paper is marked “provisionally
non-extendable” and authors have some period to fix before marked as “retired”.
@jameshowison
Could others fix papers?
• Why must the original authors be the ones to fix maintenance issues?– Attract new resources, motivate integration.
• Re-write Step 4 again:– On fail, workflow is marked as “needing work”– Anyone can contribute that work• Those extending the work, grad students, citizen
scientists– Anyone that succeeds is added as an Author
@jameshowison
Added as an author??!?
• Just for fixing a bug?Ok, fine. Let’s re-write the second half of step 4 again:– Anyone maintaining a workflow and returning a
publication to full extendable status is:• Added to paper as an acknowledgement• Invited to a conference, Given a prize• Credited in a visible, public, system (think github
profile)
@jameshowison
Takeaways
• Software citation is diverse and fails functions:– “Like instrument” and “cite to publication” citations
give credit but fail to provide version information– Other, informal mentions, better at versions but often
fail to give credit• Software is frequently inaccessible• Collaboration is counter-motivated by publication• Bit-rotten papers should create opportunities to
earn reputation for scientific contribution.
@jameshowison
@jameshowison
Extras
@jameshowison
Software packages found