Best Practices for Computational Science: Software Infrastructure and Environments for Reproducible and Extensible Research

Scholarly dissemination and communication standards are changing to reflect the increasingly computational nature of scholarly research, primarily to include the sharing of the data and code associated with published results. This paper presents a formalized set of best practice recommendations for computational scientists wishing to disseminate reproducible research, facilitate innovation by enabling data and code re-use, and enable broader communication of the output of digital scientific research. We distinguish two forms of collaboration to motivate choices of software environment for computational scientific research. We also present these Best Practices as a living, evolving, and changing document on wiki.


Introduction
The goal of this article is to coalesce a discussion around best practices for scholarly research that utilizes computational methods, by providing a formalized set of best practice recommendations to guide computational scientists and other stakeholders wishing to disseminate reproducible research, facilitate innovation by enabling data and code re-use, and enable broader communication of the output of computational scientific research.Pervasive digitization is changing the practice of science by enabling massive data collection and storage, the tools to carry out and record analyses of these data, and also by providing a mechanism for communicating these new digital scholarly objects through the Internet.The potential is enormous: the technology is available to permit the open communication not only of the results of scientific investigation, but also the tools and data required to verify, extend, and understand the knowledge.
There is a movement within the computational science community to adapt communication standards to include the data and code associated with published findings, called the Reproducible Research Movement [1].An ICERM workshop in December of 2012 on "Reproducibility in Computational Experimental Mathematics" [2] produced a workshop report with recommendations for enabling reproducibility and reliability in computational scientific findings [3,4].The July/August 2012 issue of IEEE Computing in Science and Engineering focused on Reproducible Research [5] and called for "changing the culture" of scientific research [6].A Roundtable at Yale Law School in 2009 focused on the issue of reproducibility by bringing together computational scientists from many different disciplines and producing a declaration addressing the need for data and code sharing in computational science [7,8].Over the past few years many editorials and commentaries have continued these efforts [9-13].The theme is similar: Without the data and computer codes that underlie scientific discoveries, published findings are all but impossible to verify.Computational results are frequently of a complexity that makes a complete enumeration of the steps taken to arrive at a result prohibitive in typical scientific publications today.As noted in 2009, At conferences and in publications, it's now completely acceptable for a researcher to simply say, "here is what I did, and here are my results."Presenters devote almost no time to explaining why the audience should believe that they found and corrected errors in their computations.The presentation's core isn't about the struggle to root out error -as it would be in mature fields -but is instead a sales pitch: an enthusiastic presentation of ideas and a breezy demo of an implementation.Computational science has nothing like the elaborate mechanisms of formal proof in mathematics or meta-analysis in empirical science.Many users of scientific computing aren't even trying to follow The goal of this article is to coalesce a discussion around best practices for scholarly research that utilizes computational methods, by providing a formalized set of best practice recommendations to guide computational scientists and other stakeholders wishing to disseminate reproducible research, facilitate innovation by enabling data and code re-use, and enable broader communication of the output of computational scientific research.Scholarly dissemination and communication standards are changing to reflect the increasingly computational nature of scholarly research, primarily to include the sharing of the data and code associated with published results.We also present these Best Practices as a living, evolving, and changing document at http://wiki.stodden.net/Best_Practices. a systematic, rigorous discipline that would in principle allow others to verify the claims they make.How dare we imagine that computational science, as routinely practiced, is reliable![1] A necessary response to this crisis is the adoption of the practice of reproducible computational research, in which all details of the computations -the underlying data and the code that generated the results -are made conveniently available to others.
In this document we envision a computational environment that facilitates reproducibility as a digital concept beginning from data and tracing through the computational steps taken to achieve the published results.This distinguishes it from the replication of the experiment from first principles, including for example the regeneration of the raw data and the reimplementation of the data analysis de novo.We introduce the concepts of vertical collaboration and horizontal collaboration, to distinguish between the act of building on previously published research and that of carrying out joint research at the same point in time.We do not try to enumerate optimal environments for all possible research settings, rather we outline use cases with the hope of spurring greater discussion and development in this area of research.Although some best practice documents do exist for digital archivists [14], we know of only one other resource designed to communicate best practices for scientific computing [15].We encapsulate some of these ideas in a wiki designed to facilitate the development and communication of best practices for computational scientists.

Developing Best Practices
A typical computational scientist today is being inundated with new software tools to help with research [16], new requirements for publication [17], and evolving standards as his or her field responds to the changing nature and increasing quantity of available data [18].Because of the speed of the changes occurring in scientific research, we chose to implement the best practice recommendations given in this paper as a wiki, available at http://wiki.stodden.net/Best_Practices.The hope is that parties with specialized or more complete knowledge will be able to add their expertise to the best practices document, to create a maximally useful document.
What follows is a series of principles for producing really reproducible computational science, and examples of implementations.We take as a starting point a National Academies of Science 2003 report, "Sharing Publication-Related Data and Materials: Responsibilities of Authorship in the Life Sciences" [19] stating: Here Principle 1 calls for the dissemination of data, software, and all information necessary for a researcher to "verify or replicate the claims" made in the publication.Principles 2 and 3 provide guidance for the implementation of Principle 1.We adapt and extend these ideas into a series of Best Practice principles for computational scientists generally, and include these on the associated wiki.This is not meant to be a comprehensive list of all possible best practices in all circumstances, but a first cut at a generic case with the means provided through the wiki for modification and improvement.Large data are very likely to come with its own infrastructure.It may already reside in a domain-specific repository designed for access, such as the Sloan Digital Sky Survey, the National Institute for Health's caBIG data sharing portal or its Genome-wide Association Studies, EarthCube, or a number of others.Each of these have or are developing policies on data re-integration to permit uploading of data that has undergone changes, or it may be possible to link to / cite the version of the dataset(s) you used in your research and provide code that replicates the manipulations you carried out on that snapshot of the data.

Best Practice Principles for
Infrastructure for large data are becoming available for researchers beyond these groups of domain-specific data repositories.Both Globus Online and HUBzero provide different types of computational environments for non-domain-specific scientific research and their own methods for data availability.Both are geared toward cloud computation, as is the National Science Foundation's XSEDE scientific computing environment.TheDataHub.org is an entirely open source data repository.Many of these infrastructure efforts provide suggested citations and versioning for data, and this is just as crucial as it is in the small data case.
• Streaming Data: These data seem like the most challenging case but are actually likely to fall into one of the above categories.Published results must be obtained on some amount of fixed data, and this particular dataset can be readily shared as above.In these cases it is likely scientifically relevant to validate models on future streams of data, but that is left to the domain of new, potentially publishable research that will share its data when published.
Some domain specific dissemination platforms include the Machine Learning Open Source Software (MLOSS) (both software and data), The Stanford Microarray Database, and the Protein Data Bank (PDB).
There are exceptions to this principle, including confidential data and proprietary data.Workarounds should be attempted and may exist for confidential data [24] and proprietary data [25].4. Code and methods must be available and accessible.Input values should be included with code and scripts that generated the results, along with random number generator seeds if randomization is used.Version control should be utilized for code development, facilitating re-use by others.This discussion can be broken into subdiscussions.
• Version Control for Code / Making the Code Available Externally: These goals can be accomplished together by using a hosted version control system with a public facing option.There are many advantages to using version control for the code you and your collaborators write during a project, and releasing the code to the wider world using version control is important.Doing so permits others to know precisely which version of the code generated what results, allows others to make modifications and feed them back into the system without disrupting the original code, and perhaps most importantly permits a community to develop around the research questions, complete with mature functionality for bug tracking and fixes, new code developments, centralized code dissemination, and collaboration.[29] project from Mozilla Science, GitHub, and Figshare is a working proof of concept that generates a DOI for a code repository in GitHub.Assigning a Digital Object Identifier is excellent and helps establish provenance and citation.At the moment there are no @data or @code fields for BibTex entries.In the meantime best practices would suggest using the @misc field to create a citation for data or code.Here is the BibTex format for @misc: this set of best practices and other requirements do not exist but in the event that they do we advocate two reactions.The first is to follow all legally binding requirements, of course, the second is to voice concern to the source of the requirements if they could be improved by moving closer to this set of best practices.This situation can happen in reverse as well.If new ideas arise that improve reproducibility and reliability of research findings, and advancement of scientific discovery, we hope these made their way into this set of best practices to improve them.This is one reason for implementing the set in a wiki format, to permit new knowledge and practice to surface and be incorporated into our current state of the art.

Conclusion
This article attempts to seed a conversation around best practices for publishing computational scientists, through the traditional medium of the published paper and a community-editable wiki.This can be seen as a response to the Reproducible Research movement, through which computational scientists have been moving toward research practices that include making the data and code underlying a published results conveniently available.Open questions remain.This effort has several lofty goals, including clarifying best practices for computational scientists, establishing community standards, providing a central discussion point for evolving best practices, accelerating discoveries by facilitating reproducible computational science and data and code re-use, and supporting the transfer of the technology underlying scientific results, with the aim of increasing the reliability of published findings and addressing the credibility crisis in computational science.This article is intended as a set of discussion points to help advance these goals.

Principle 1 . (Chapter 3 )
Authors should include in their publications the data, algorithms, or other information that is central or integral to the publication-that is, whatever is necessary to support the major claims of the paper and would enable one skilled in the art to verify or replicate the claims.This is a quid pro quo-in exchange for the credit and acknowledgement that come with publishing in a peer-reviewed journal, authors are expected to provide the information essential to their published findings.(p. 5) Principle 2. (Chapter 3) If central or integral information cannot be included in the publication for practical reasons (for example, because a dataset is too large), it should be made freely (without restriction on its use for research purposes and at no cost) and readily accessible through other means (for example, on-line).Moreover, when necessary to enable further research, integral information should be made available in a form that enables it to be manipulated, analyzed, and combined with other scientific data.(p. 5) Principle 3. (Chapter 3) If publicly accessible repositories for data have been agreed on by a community of researchers and are in general use, the relevant data should be deposited in one of these repositories by the time of publication.(p. 6)

Computational Science 1. Open licensing should be used for data and code. Best
[22]ing up an issue tracker and/or an agile planning board for example can make communication more efficient, as can using an open notebook as a record of provenance and workflow.For an example of work that follows these processes, refer to[22].3.
tant tools that help enable reproducibility and reuse by others, while minimizing the burden on the researcher.For example, using a version control system such as git or mercurial throughout the project simplifies making the code available at the time of publication.

Data must be available and accessible.
At minimum, provide a version for datasets you generate or collect.If you did not generate or collect the data yourself, provide a link and citation to the source of each dataset you incorporated, including which version of the data you used (if the data source does not provide version information, provide the exact time and date you accessed the data).As of yet there are no standards or conventions being widely practiced, but this is a very active topic.An additional best practice would be to include a DOI (digital objet identifier) and a hash for bitlevel identification of the data [22a].ii.Raw Data Availability: Results should be reproduced from the earliest digital data in the experiment, whether that is raw data coming from instruments or observations, or data as accessed from a secondary source.It defeats the purpose to supply a "cleaned" version of the data if it is impossible to access the methodology of the cleaning, for example.The goal is that all data

3rd party data and software should be cited. If
you use data you did not collect from scratch, or code you did not write, however little, cite it.Include the source, include the author, and include the date and time you accessed the data or code you used.Best practices indicate including a unique identifier in your citation, preferably a SHA-1 hash such as The DataVerse Network's UNF (see http://thedata.