## 1 Introduction

Reproducible research (RR) in computational sciences was pioneered by Claerbout in the Stanford Exploration Project (SEP) [1]. It has since been championed by Claerbout, Donoho, and their collaborators in their research areas of geophysics, signal and image processing [2, 3, 4]. The leading advocates for RR include Gentleman and Peng in biostatistics [5, 6], Koenkera in economics, LeVeque in numerical partial differential equations [7], and Stodden in legal frameworks that enable RR [8]. The defining principle of RR [3] is “When we publish articles containing figures which were generated by computer, we also publish the complete software environment which generates the figures.” Clearly there is an inevitable dependency of RR to software, hardware, code, and data.

This article has two main messages. First, we contend that worthy RR in modern computational sciences have to be built upon strong mathematical, statistical, and approximation theories leading to rigorous analysis of statistical and numerical errors, as well as computational costs. Second, we establish our position that RR can be made substantially more reliable—hence reliable reproducible research (RRR)—by development of supportable scientific software (SSS), which we define as a conceptual framework that encompasses a collection of software development methods for promoting the reliability of reproducing provably correct results in computational sciences. We note that our discussion of the SSS framework is not restricted to any particular programming languages.

While the notions of RRR and SSS are not entirely new, they are not as often observed as hoped. Relatively few computational scientists seem to practice RR as reflected in recent survey results put together by Flemisch [9]. It is also plausible that the community is not as a whole well aware of the importance of reproducibility. When SSS is absent, RR risks remaining an ideal. On the other hand, conscientious adoption of the principles can lead to substantially more trustworthy research results.

More specifically, in terms of theory, we have a relatively new suite of algorithms we call MINRES-QLP Pack, for solving singular or ill-conditioned linear systems of equations or linear least-squares problems with guaranteed accuracy. In terms of practice, we do what we advocate; we borrow a number of strategies and tools from industrial software engineering that lead to more reliable RR. While this paper focuses on the MINRES-QLP Pack to illustrate how to achieve RRR via SSS, we and our collaborators have at least two more suites of accuracy-guaranteed algorithms developed using similar approaches: First is the Guaranteed Automatic Integration Library (GAIL) [10, 11] for numerical integration in one or many dimensions based on the recent groundbreaking work of Hickernell et al. [12, 13]. Second is the Open-Source CIM-EARTH Framework (OSCEF) for solving large-scale computable general equilibrium (CGE) models [14, 15] built on the design of CIM-EARTH [16, 17, 18].

The outline of this paper is as follows. The next section gives a compressed description of the algorithms in the MINRES-QLP Pack. Section 3 recapitulates principles and benefits of RR, including examples of high-quality software and associated highly cited publications. In Section 4, we draw from our computational research experience and highlight a number of strategies and tools for robust software package development. Section 5 covers some existing but rare incentives and our initiatives on conducting a weekly seminar course and organizing themed meetings, which serves to educate or exchange ideas with computational researchers about RRR and SSS. In the last section, we conclude with thoughts for future work.

## 2 MINRES-QLP Pack

Most Krylov methods for solving large linear systems or linear least-squares problems have pervasive applications in science and engineering fields. Nonetheless, they usually assume nonsingular or full-rank matrices (or linear operators that are not explicitly represented as matrices). These methods are generally divided into two classes: Those for symmetric matrices (e.g., conjugate gradient, MINRES, SYMMLQ) and those for unsymmetric matrices (e.g., BCG, GMRES, QMR, BICGSTAB, LSQR, and IDR(s)). Such a division is largely due to the fact that historically the most prevalent matrix structure from practical applications is symmetric. However, other types of symmetry structures, notably complex symmetric, skew symmetric, and skew Hermitian matrices, are becoming increasingly common in modern applications. Often, these are treated as general unsymmetric matrices. In contrast, our algorithms take advantage of the symmetries to minimize computational costs such as memory and arithmetic operations.

On a singular linear system such as

$\left[\begin{array}{cc}1& 1\\ 0& \epsilon \end{array}\right]x=\left[\begin{array}{c}1\\ 1\end{array}\right],$

where ε is the machine precision, the pseduoinverse solution is

${x}^{†}=\left[\begin{array}{c}1\\ 0\end{array}\right].$

Most of MATLAB’s Krylov solvers would abort or return an exploding solution. To carefully handle singularity and exploit various symmetry structures, we have designed a suite of MINRES-QLP [19, 20, 21, 22] algorithms, which can constructively reveal (numerical) singularity and compatibility of a given linear system of equations; users do not have to know these properties a priori. (We are also currently developing two related algorithms known as GMRES-QLP and GMRES-URV for unsymmetric singular square systems; see [19, 23].)

The key automatic steps of the MINRES-QLP algorithm are: First, detect special symmetry with an efficient statistical test. Second (optional), construct a symmetric positive definite preconditioner M whose number of distinct nonzero eigenvalues M–1/2 AM–1/2 is less than that of A and hence number of iterations required in the next stage are reduced. Without loss of generality, let M be the identity matrix in our subsequent discussion. Third, in the kth iteration, for k = 1, …, min{p, K}, where p is the smallest positive integer such that {b, Ab,…, Apb} is linearly dependent, and K is a user-input positive integer, the algorithm searches for kth in the kth Krylov subspace Kk(A, b) = (b, Ab,…, Ak–1b), where xk is the minimum-length element of minimal-residual solutions for the problem

${\text{min}}_{x\in {\kappa }_{k}\left(A,b\right)}{‖Ax-b‖}_{2}.$

It is known that xk exists and is unique. If either residual norms ||rk||2 = ||bAxk||2 or ||Ark||2 is smaller than a scalar multiple of a user-given tolerance, then the algorithm returns xk as an approximant; otherwise the third step is repeated with k incremented by one. We note that the square root M–1/2 is not explicitly constructed and all the quantities such as xk, ||rk||2, and ||Ark||2 are accurately and efficiently computed by short recurrences.

In the MINRES-QLP Pack, our goals are to have MATLAB routines well-documented, well-tested, optimized for speed, accompanied by examples, and stored in a repository [24] where they can be modified, extended, and re-tested as needed by the originators or any public members. Over time we will work to improve existing algorithms and add new methods in the package. If resources permit, we will implement them in other languages such as Fortran 90/95 [21] and PETSc [25], which is a parallel mathematical library capable of peta-scale computations.

## 3 Reproducible research and challenges

RR for modern computational sciences as originally proposed by Claerbout is not simply a form of ideological rhetoric. It actually has rather specific instructions. Claerbout encouraged every author of computational research papers to mark each figure with the tags [ER] for “easily reproducible,” [CR] for “conditionally reproducible,” or [NR] for “non-reproducible” to indicate the extent of reproducibility. To qualify for [ER], a figure is generated and accompanied by programs with parameters and input data, as well as “makefiles executable on Linux and portable to other operating systems.” A conditionally reproducible figure is often time consuming to reproduce from scratch and depends on the availability of computational resources such as memory, commercial applications, or large input datasets. Non-reproducible figures may include images drawn or photographs taken—not computed—to illustrate scientific ideas. Today there is no reason not to extend the tags to other computational results or online resources such as interactive figures, video clips, tables, or online databases produced or referenced by a report [26].

Paraphrasing Claerbout, Buckheit and Donoho [3] stated that “an article about computational result is advertising, not scholarship. The actual scholarship comprises the full software environment, code and data, that produced the result.” Donoho et al. produced multiple MATLAB packages, WaveLab (1995), BeamLab (2004), SparseLab (2007), and MCALab (2008). Some of the most highly cited papers Donoho co-authored such as [27, 28] are accompanied by well-crafted MATLAB software. It is reasonable to posit that reproducible research with quality code contributes significantly to the citation statistics of the associated papers.

Nevertheless, we ask in practice, how reliable is RR? It would be at most as dependable as the strength and availability of its underlying theories, data, code, software, and hardware. Obviously, it is a “decreasing function” of time due to emergence of new software (versions), phasing out of old computer architecture, unaffordability of commercial software, or lack of quality software. Often, authors release only partial data, code fragments, or incomplete environment specification. It is not uncommon to see loss of data or code due to insufficiently frequent backup or unstable transfer processes. Algorithms of suboptimal design, with insufficient documentation, inadequate testing, and infrequent release of bug fixes—all understandable due to limited human resources or expertise—may render the computational software practically unusable even to the original creators not too much later in time. Modern complex applications with big data and big code, as well as emphasis on global team collaboration only add to these difficulties.

The process of software development by itself is often exceedingly challenging for programmers with moderate or even abundant experience. A common software development cycle such as the one shown on the left of Figure 1 is oversimplified. In practice, the “disorder” on the right of the same figure depicts much more accurately the potential multitude of cycles and transitions among the key phases. In the next section, we show how to restore partial order to the process via what we call the SSS principles.

Figure 1

Left: A standard software development cycle. Right: Software development cycles in reality.

## 4 Supportable scientific software: principles and tools

Flemisch’s 2013 survey [9] on reproducible research received feedback from only about 10% of targeted recipients. Among those who responded, more than 70% reported that the efforts to reproduce own or others’ computational results are between average to very high. The survey results also convey that RR is perceived as “unrewarding” hard work. The survey suggested the central roles of strategic tools such as the use of a repository (e.g., SVN, CVS, GIT, Mercurial). The survey results are in line with our understanding and experiences. Despite the burden of RR, we maintain our position and belief in the importance of enhancing its reliability.

We define supportable scientific software (SSS)1 as a conceptual framework that encompasses a collection of software development methods for promoting the reliability of reproducing provably correct results in computational sciences. There are four critical elements of SSS [29]:

• Justified: Scientific software should be based on theoretical guarantees that specify sufficient preconditions leading to success in satisfying error tolerance or uncertainty level. There are lower and upper bounds of computational cost functions of input size n in terms of memory or storage, as well as computational and communication time (these functions are preferably low-degree polynomials of n).
• Efficient: In theory, complexity analysis is an important tool for gauging efficiency. Nevertheless, its nature is asymptotic. Even polynomial-time algorithms can be costly if they have large leading coefficients or are of high degrees. In practice, a profiler can be utilized to measure and pinpoint code lines that consume the most resources, such as time and memory. Code reuse is another measure of efficiency; it can be achieved through refactoring of duplicate code patterns into functions to save development time.
• Robust/Durable: A robust software package is tested under a wide variety of conditions and ranges of inputs. The design of a robust software package should ideally be durable under changing computing hardware and software environments. In addition, we recommend persisting and maintaining source code and research data via repositories with version control.
• Intelligible: At the minimum, we expect clear documentation of key algorithmic function and steps, inputs and outputs, parameters, usage examples, and related references.

From the practice of reproducible research in our work, we affirm that RRR via SSS can be done. Many rudimentary examples already exist even though it is probably extremely hard to find “a general solution” for a given scientific field. The tools we highlight in Table 1 are available with MATLAB and similar counterparts can be found with other modern programming languages such as C++, Java, and Python, enabling researchers to develop functional and polished algorithms.

Table 1

Tools compatible with or available in MATLAB 2014a that can be employed to enhance the defining elements of SSS.

Justified Efficient Durable/Robust Intelligible

Repository

Editor (IDE)

Code Analyzer

Help Report, Contents Report

Searchable HTML documentation

Doctest

Unit Testing Framework

Debugger

Profiler

We recommend the use of a code repository for every project from the beginning. We also employ Integrated Development Environment (IDE) software. In the case of MATLAB, it has its own IDE built in Java, with visually appealing automated reports on code dependency, documentation evaluation, and performance analysis, which together can help pinpoint at line level or character position where the code base can be improved.

We emphasize on the importance of designing simple application programming interfaces (APIs) from the outset. If we were allowed to communicate only one thing to expert or application users about our implementation, our choice would be its API, which consists of function name, required and optional inputs, outputs, and our parsing schemes. Our algorithms parse user inputs and ensure that they fall in the correct ranges; in some cases we supply default values for missing optional parameters and gently correct invalid input values. We craft detailed and readable documentation in both text and searchable HTML formats with specification of APIs, syntax, default input values, examples of usage, and further references.

Lastly, to ensure the correctness of computational results from our software and examples in the documentation, we create unit tests [30], doctests [31], and stress tests. An effective unit test or doctest would be demonstrative of API usage, and run in less than a small fraction of a second. Without these tests in place, our experience is that even minor changes to our code or documentation could quickly and unknowingly depart from the original design and theory. These two kinds of tests effectively safeguard most changes we make to our algorithms. We adhere to the following test practices:

Every time before code is checked into the repository, unit tests and doctests are run.

Every time a bug is found, unit tests are expanded.

All unit tests and doctests have to pass at 100% success rate for each code check-in and software release. The diligent execution of an expanding test suite serves to verify and strengthen the software functions, and to simplify the software development cycles; see Figure 2. On the other hand, stress tests check if the software performs under load.

Figure 2

Test-driven software development cycles.

## 5 Incentives and educational efforts for RRR and SSS

Practitioners of RRR and SSS need more incentives in terms of specialized software citation [32], comprehensive journal review policies for software publication such as ACM TOMS’ [33], and research funding schemes such as NSF’s SSE & SSI program [34] that gives significant weightage to quality scientific software infrastructure at all scales. We urge members of the scientific community to cite high-quality software and online resources; the following are two examples of referencing the software GAIL [10] and MINRES-QLP community website [24] in BibTeX format:

@misc{CDHJZ14gail,
title={{GAIL}: {G}uaranteed {A}utomatic {I}ntegration {L}ibrary (Version 1.3),
{MATLAB} Software},
author={Choi, Sou-Cheng T. and Ding, Yuhan and Hickernell, Fred J. and Jiang, Lan and
Zhang, Yizhi},
year={2014},
}

@misc{MinresqlpWebsite,
author={Choi, Sou-Cheng T.},
title={{MINRES-QLP} Project and Community Website},
year={2013},
notes={Website},
}


To maximize quality software’s exposure and usage, we support free software and open source movement, but are also open to having our software included in quality commercial software. There are also viable open source packages sustained by provision of commercial services.

In Summer 2013, we started a team devoted to development of a MATLAB research software package GAIL [11] for numerical integration with guarantee of accuracy. Our members role play a team of software engineers with specializations in three of the critical software functions, namely testing, documentation, and release. We make use of a private wiki to strengthen team communication, recording key steps of important scientific routine process and procedures; maintaining checklists of day-to-day tasks; and sharing of intermediate results, knowledge, and expertise. We prepare our students of applied mathematics to become practitioners of RRR via SSS. As a beneficial side effect, as they gain software engineering skills that are usually obtainable only in information-technology driven companies, they become more attractive job candidates. In fact, one of our five student members was employed as a part-time software test engineer the following semester.

We say, “Those who can, do and teach.” Principles and novel ideas behind the research work would ideally be propagated in beginning and advanced courses in scientific computing, whereas the practices of engineering supportable software promoted to computational scientists and researchers through workshops such as WSSSPE [35, 36] To this end, we started an experimental seminar course, “Reliable Mathematical Software” (IIT MATH-573) [29] in our home institution. We drew teaching materials from exemplary mathematical software such as Chebfun (http://www2.maths.ox.ac.uk/chebfun). The class met weekly for 75 minutes for a total of 10 weeks; each time we demonstrated one or two tools in Table 1 with motivating scenarios and finished the last two weeks with each student presenting a research algorithm implemented with some of the SSS principles. After the course, a master student completed her thesis and five doctoral students are developing research papers following the principles of RRR via SSS. In addition, we serve the computational community by organizing a two-session minisymposium “Reliable Computational Science” in the upcoming SIAM Annual Meeting 2014 (http://meetings.siam.org/program.cfm?CONFCODE=AN14) and co-authoring a comprehensive summary report for WSSSPE1 [35, 36].

## 6 Conclusions

We agree with the importance of reproducible research at all scales, and recommend to strengthen it with supportable scientific software development practices. In the short term, RRR via SSS may seem more time consuming. In the medium or long run, the investment of time and efforts gives valuable software a better chance to “survive” and make a greater impact in the research area. Some researchers may actually find that our approach of RRR via SSS provides to some extent a structured course for investigating and attacking complex computational problems. More reliable incremental research can be built on supportable software packages, leading to stably upward spirals of progress in scientific knowledge. As a community, we need more time to reflect, brainstorm, and experiment specific operations. We also need more ongoing dialogue on the topic in the community.