Jug: Software for Parallel Reproducible Computation in Python

As computational pipelines become a bigger part of science, it is important to ensure that the results are reproducible, a concern which has come to the fore in recent years. All developed software should be able to be run automatically without any user intervention. In addition to being valuable to the wider community, which may wish to reproduce or extend a published analysis, reproducible research practices allow for better control over the project by the original authors themselves. For example, keeping a non-executable record of parameters and command line arguments leads to error-prone analysis and opens up the possibility that, when the results are to be written up for publication, the researcher will no longer be able to even completely describe the process that led to them. For large projects, the use of multiple computational cores (either in a multi-core machine or distributed across a compute cluster) is necessary to obtain results in a useful time frame. Furthermore, it is often the case that, as the project evolves, it becomes necessary to save intermediate results while down-stream analyses are designed (or re-designed) and implemented. Under many frameworks, this causes having a single point of entry for the computation becomes increasingly difficult. Jug is a software framework which addresses these issues by caching intermediate results and distributing the computational work as tasks across a network. Jug is written in Python without the use of compiled modules, is completely cross-platform, and available as free software under the liberal MIT license. Jug is available from: http://github.com/luispedro/jug.


Introduction
The value of reproducible research in computational fields has been recognized in several areas, including fields as different computational mathematics, signal processing [18,59], neuronal network modeling [45], archeology [41], or climate science [20]. This has lead researchers to realize that computational reproducibility is an issue that spans across fields [19,21,23,38].
Besides the benefits to the wider scientific community and society, reproducible practices can be advantageous to the individual researcher as the resulting research process is faster and less error prone [40].
Several implementations of reproducible papers (or executable papers) have been proposed towards the goal of reproducing published analyses [3,36,48,64]. These solutions do not necessarily scale to large problems, those that take several days, months, or years of CPU time. For very large problems, specialized solutions are needed to fully leverage high performance computing platforms [15,65]. Nonetheless, there is a range of medium-sized problems that can be successfully tackled on a computer cluster with a small number of nodes or even taking advantage of a single multicore machine. It is for these medium-sized problems that Jug is best suited.
A typical ad hoc approach to this problem is to save intermediate files on disk. A limitation of this approach is that often the design of the computation itself takes several iterations as intermediate steps are improved. Thus, some intermediate results need to be recomputed. This involves a large amount of human management of the state of the computation, breaking it up into pieces, and, when using a cluster, scheduling jobs on the batch computing system.
Jug is a task-based framework, which supports saving and sharing of intermediate results and parallelisation on computer clusters (or multi-core machines).
Intermediate results are cached with a key which takes into consideration all input parameters of that computation. Thus, any change in the parameters immediately triggers a recomputation of all dependent results. The basic model is similar to Make, which has been used before for implementing reproducible research pipelines [53].
However, unlike Make, Jug is written in Python, a general purpose programming language which is widely used in scientific programming [49]. A Task in Jug can consist of any Python function and its arguments. The task can include running external commands and calling routines written in other languages as Python has many tools to interface with the rest of the system [4][5][6].
A Jugfile (the file which specifies which tasks need to be run-named by analogy to the Makefiles of Make) is a simple Python script with some added notations. Below, we show how the only a few small changes are needed to transform a conventional Python script into a Jugfile.

Task-based architecture
Jug is designed around tasks. A task is defined as a Python function and a set of arguments, which may be Python values or the output of other tasks.
For example, consider the following toy problem: given a set of text files, count the number of lines in each, and report the average number of lines. Conceptually, we can already see that each input file can be processed independenly (to count the number of lines) with the results combined at the end.
This can be implemented with Jug, using the following code: This code defines the task dependency structure represented in Figure 1. As we can see from the graph, all of the linecount operations can potentially be run in parallel, while the mean operation must wait the result of all of the other computation. The dependency structure is always a dag (directed acyclic graph). We will later see how Jug exploits this structure to achieve parallelism.
The code above has the construct Task(f, args) repeated several times. Using the TaskGenerator decorator this can be simplified to a more natural syntax. As the reader can appreciate, this is identical to a traditional Python script, except for the @TaskGenerator decorators. With a few limitations (which unfortunately can give rise to complex error messages), scripts can be written as if they were sequential Python scripts.
By default, Jug looks for a file called jugfile.py, but any filename can be used. Generically, we refer to the script being run as the Jugfile.

Jug subcommands
Based on a Jugfile defining the computational structure, Jug is invoked by calling the jug executable with a subcommand. The most important subcommands are execute, status, and shell.
Execution is used for actually running the tasks. It performs a more complex version of the following pseudo-code: If run on a single processor, this will just run all of the tasks in order. It is most interesting when it is run on multiple processors. Because of the lock synchronisation, tasks can be run in parallel to take advantage of the multiple cores.
The actual code is more complex than what is shown above, particularly to ensure that the locking is performed correctly and that the waiting step eventually times out (in order to handle the situation where another process is hung).
The status subcommand prints out a summary of the status of all of the tasks. Figure 2 shows the output of this command using the example Jugfile above. We assume that the Jugfile was called jugfile.py on disk and that there were 20 textfiles in the directory. We can see that there are 20 tasks ready to run, while the mean task is still waiting for the results of the other tasks.

Backends
A basic feature of Jug is its ability to save and load results so that values computed by one process can be loaded by another. Each task of the form Task(f, args) is represented by a hash of f and args. Jug assumes that the result of a function is uniquely defined by its arguments. Therefore, Jug does not work well with functions which are not pure or which access (non-constant) global variables.
A Jug backend must then support four basic operations: save Saving a Python object by its hash name. load Loading a Python object by its hash name. lock Creating a lock by hash name. Naturally, this lock must be created atomically. release Releasing the lock.
A few other operations, such as deletion and listing of names are also supported. The filesystem can support all of the above operations if the backend is coded correctly to avoid race conditions. This is the default backend, identified simply by a directory name. Inside this directory, files named by a hexadecimal representation of their hashes. Objects are saved using Python's pickle module with zlib compression. As a special case, numpy arrays [62] are saved to disk directly. This special case was introduced as numpy arrays are a very common data type in scientific programming and saving them directly allows for very fast saving and loading (they are represented on disk as a header followed by the binary information they contain).
Another backend currently included with Jug is a redis backend. Redis a name-key database system. 1 Redis is particularly recommended for the case where there are many small objects being saved. In this case, keeping each as a separate file on disk would incur a large space penalty, while redis keeps them all in the same file.
Finally, there is an in-memory backend. This was initially developed for testing, but can be useful on its own.

Asynchronous function
Parallelisation is achieved by running more than one Jug process simultaneously. All of the synchronisation is outsourced to the backend. As long as all of the processes can access the backend, there is no need for them to communicate directly. It can even be the case that processors start working on tasks in mid-processing. This makes Jug usable in batch-based computer cluster environments, which are quite common in research institutions.

Software development with Jug
The output of the computation can be obtained from Jug in several ways. One can write a task that writes the output to a file in the desired format. Alternatively, outputs can be inspected interactively using the shell subcommand. It is expected that the first option will be used for the final version of the computation, whilst the second one is most helpful during development.
The shell subcommand, invokes an IPython shell with all the objects in the Jugfile loaded. The IPython console is an enhanced interactive shell for Python [48]. A few functions are added to the namespace, in particular, value will load the results of a task object if it is available. Figure 3 shows a possible interaction session with the jug shell subcommand. While having to explicitly load all the results may be bothersome, it is both much faster at start up and the user might not load more than a few objects throughout their session. In some cases, loading all of the objects simultaneously might even be impossible due to memory constraints. Furthermore, this allows exploration of the task structure for debugging.
Jug can also be loaded as a library from a Python script and Jug computation outputs can serve as inputs for further computation. This can be performed from inside a Jupyter notebook [33], for example, for interactive exploration of the computational results.

Result invalidation
If a researcher improves an intermediate step in a pipeline (e.g., fixes a bug) and wishes to obtain new results of the computation, then all outputs from that step and downstream must be recomputed, but results from upstream and unrelated processes can be reused. Formally, in the task DAG, affected tasks and their descendants must be recomputed. Without tool support, this can be a very error-prone operation: by not removing all relevant intermediate files, it is easy to generate an irreproducible state where different blocks of the computation output were generated using different versions of the code. Therefore, Jug adds support for result invalidation. When a results from a task are invalidated, all tasks which (directly or indirectly) depend on them are also invalidated. In the case where the parameters of a task have not changed, only the code implementing it, it is still necessary to manually invalidate tasks. This can be performed using the jug invalidate subcommand which will invalidate all tasks with the given name as well as other which directly or indirectly depend on them. For finer control, within the jug shell environment, individual tasks can be invalidated with the invalidate function. An alternative, would be to take code which implements the task into account while computing the task hash. This would mark any results computed with this code as outdated if the code changed. While this would add another layer of protection, it would still be possible to make mistakes. If the function depended on other functions, especially if this was done in a dynamic way, it could be hard to discover all dependencies. Additionally, even minute refactoring of the code would lead to over eager recomputation. This could make the developer wary of making improvements in their code, resulting in overall worse code.
Therefore, as a design choice, Jug asks the user to explicitly invalidate their results, while supporting automatic dependency discovery. The recommendation is still that the user run the full pipeline from start to finish once they are satisfied with the state of the code and before publication, but the pipeline development stage can be more agile.

Example
This section presents an edited version of the code used in a previously published study of computer vision techniques for bioimage analysis [10]. 2 The code was edited to remove superfluous details, however the overall logic is preserved as the original version was already based on Jug. As part of that paper, it was necessary to evaluate the classification accuracy of a machine learning model.
To evaluate classification, the dataset is broken up into 10 pieces and each of the pieces is held-out of the analysis and then used to evaluate accuracy (the final result is the average of all ten values). This is known as cross-validation.
The image processing is done with mahotas [9], while the machine learning aspects are handled with sckit-learn [46].
The example starts with a simple Python function to parse the data directory structure and return a list of input files: from jug import TaskGenerator, CachedFunction import mahotas as mh from sklearn.cross_validation import KFold   In this case, we defined a function, output which will write the final results to a file. Finally, we call the above generators to process all the data. The resulting code is very readable for any programmer.  Figure 4 shows the dependency structure of this examples. The main feature is the fan-out/fan-in which is associated with map-reduce frameworks. A more compact representation can be generated by Jug itself, using the jug graph subcommand. Figure 5 shows an example of such auto-generated graphs.

Quality control
Jug follows best practices in the field [60,63] and includes a full test suite (>100 tests) with continuous integration using the Travis service.
The user can run the test suite using the test-jug subcommand.
Jug is available under the MIT software license, which grants the user right to use, copy, and modify the code almost at will. It is developed using the git version control tool, with the project being hosted on the github platform. The Jug project is available at: https://www.github.com/ luispedro/jug and bug reports can be submitted using the github issues system at: https://github.com/luispedro/jug/ issues. An open mailing-list (https://groups.google.com/ forum/#!forum/jug-users) provides discussion and support.

Dependencies
Basic usage of Jug requires no dependencies beyond Python itself. Some specific subcommands or functionality, however, have additional requirements: the jug shell   subcommand relies on IPython, jug webstatus on the bottle package; and writing out task completion metadata in Yaml format requires the pyyaml library.

Language
Jug is written in Python, with support for Python 2 (versions 2.6 and 2.7) and 3 (versions 3.3 and above). Jug is automatically tested on all these versions.

Reuse potential
Jug is a very generic framework for any computational pipeline. It has been used by the author in several projects [10,11,12,57]. Others have used the framework in other contexts, such as physics [2], machine learning [30,32,52], metereology [60,61], and it is used in the pyfssa package for algorithmic finite-size scaling analysis [55].

Similar tools
Several pipeline tools have been used in scientific computing (for a recent review, see the work of Leipzig [35]). The Make build system has a difficult syntax for any use beyond the most basic, but it is conceptually simple and widely available. Thus, it has been used as the basis of a set of conventions for reproducible research by Schwab et al. [53]. Fomel and Hennenfent [22] proposed a system built on top of Scons which shared superficially similar design. Scons, like Make, is a build system. It supports spawning parallel jobs using multiple threads. In the original use case, the tasks are delegated to other commands and the operating system can take care of parallelism. If, however, the tasks are computationally intensive in Python, contention for the global interpreter lock will limit the amount of real parallelism. 3 As Make syntax does not directly support complex operations, some researchers have developed alternative domain-specific syntaxes for for specifying a workflow graph [8,44].
Ruffus is a Python-based solution which supports parallel execution of pipelines [25]. Using the Ruby programming language, Mishima et al. [42] proposed Pwrake, which supports parallel execution of bioinformatics workflows written (or accessible) in that language. Snakemake [34] also improves over Make by providing more complex rules and automatic interaction with a high performance compute cluster, while providing a domain specific language which can easily be extended with Python. Similarly, Sadedin et al. [51] proposed Bpipe, a tool based on a domain-specific language around shell scripting. One large difference of Jug compared to these tools is that Jug is tightly integrated with Python code. Thus, while it lacks support for directly spawning external processes, it makes it easier to call Python-based functions using a natural syntax (calling external processes is naturally still possible through standard Python code).
For large scale computation, there are several workflow engines which have been used in science [56,58], such as Taverna [31], eHive [54], or Kepler [1,39]. In general, these are generic frameworks which allow the user to specify a computational path, although domain specific solutions also exist, such as Galaxy [24] for bioinformatics.
Also relevant is the IncPy implementation of a Python interpreter with automatic persistent memorization [26,27] (unfortunately, no longer maintained). The advantages of that system apply to Jug as well, with a few differences. Their system implements automatic memoization, while, using Jug, the user needs to manually annotate functions to memoize using TaskGenerator. This extra control can be necessary to avoid a proliferation of intermediate results when these are very large (often a very large intermediate output is not necessary, and only a summary must be kept) at the cost of increased overhead for the programmer. Additionally, Jug supports running tasks in parallel, a functionality that is absent from IncPy.
Joblib (bundled with scikit-learn [46], but usable independently of it) and memo [43] use mechanisms similar to Jug for in-process and in-memory memoization and parallelization. Joblib additionally supports memoization to disk, which like Jug enables the reuse of partial results in computations. Many of the design choices are similar to Jug, but usage is different. While Joblib is designed to speed an analysis that is run using a traditional Python driver script, with Jug the user defines a computational graph in Python, but this graph is executed by the Jug machinery. This enables functionality such as jug status and jug graph making the status of the computation explicit.
Dask [13] provides a generic task execution framework, similar to Jug in addition to a distributed numeric library (which achieves high performance on a predefined limited set of operations). Like Jug, dask can coordinate computation on a dependency graph across compute nodes on a cluster. Joblib (mentioned above) can also use Dask as a computational backend. Dask uses a central scheduler dispatching jobs to workers, unlike Jug where each node is independently running the Jugfile script with only limited communication between the nodes through the result storage backend. This enables dynamic control of the workload as the scheduler can assign workers to tasks so as to minimize the expected I/O burden. However, this architecture requires extra setup on the part of the user to ensure that communication between worker nodes and the central scheduler is set up before computation can proceed (while Jug was designed to work well in a shared cluster where the number of available compute nodes may vary throughout the process). Jug also supports saving intermediate results between different runs of the process so that intermediate results are available even if the code for subsequent computations changes. This functionality is not present in Dask.
Peng and Eckel [47] describe cacher, an R framework which uses similar concepts to Jug to allow for results to be distributed. In principle, a very similar system could be built on top of Jug by sharing the backend between users (either in the same research unit or after publication). For simple reproducibility, it would be sufficient for the researcher to share their database upon publication.
To be able ensure complete reproducibility of a computational result, it is necessary to capture all dependencies in the environment. Sumatra [14] tracks execution of a process and captures all dependencies. Reprozip [50] uses a similar approach to build a single archive with all dependencies which can be used by other users to exactly reproduce the original computational environment. These can be used in combination with Jug to achieve perfect reproducibility.

Conclusions
Jug focuses on the development of the computation as much as its communication. In fact, when it comes to communication other tools might be better suited as they combine written exposition with computer code. They can be used in combination with Jug as they do not often provide caching and distribution, needed for large projects. While the simplest Jug use employs a single single Jugfile, a Jugfile can import the results of another. For example, a first Jugfile might perform all heavy-duty computation and be run on a computing cluster. A second script could be embedded inside an executable paper to generate tables and plots [16]. This faster script could be run as part of building the paper.
Jug improves the pipeline development experience and makes the programming researcher more productive and less error-prone. As many scientists now spend a significant fraction of their time performing computational work [28,49], increasing their productivity in this task can have significant effects. Jug is based on Python, a programming language which is widely used for scientific computation. Thus, it can be used in a known environment without a high learning curve.
Jug does not address the issue of how to build a reproducible analysis environment. However, it can be used in combination with other tools which ensure a reproducible environment, such as container based tools [7,37] or package managers which emphasize reproducibility such as Nix [17] (Jug is available as a nix package in the main nixpkgs repository).
Jug was also not designed to compete with large scale frameworks such as Hadoop or Spark, which scale better to very large projects. However, those frameworks have much higher costs in terms of development time (code must be written especially for the framework) and overhead. They also require that the user learn a new framework. For scientists who want to quickly adapt pre-existing code to run on a cluster or even a multicore machine, Jug provides a better trade-off than those higher-powered alternatives in a familiar programming language.
Notes 1 See the redis webpage, at redis.io, for detailed information about redis. 2 The original code to reproduce the full study is available online at: https://github.com/luispedro/ Coelho2013_Bioinformatics. 3 In the most commonly used Python interpreters, there is a lock that prevents more than one thread from simultaneously executing interpreted code, although they can execute non-interpreted code, as calling an external programme or certain external libraries.