(1) Overview

Introduction

Science is becoming increasingly collaborative and data-intensive [], and many factors are pushing scientists to increase data access and use ‘best practices’ in dealing with data and code [, ]. There is also an increased emphasis on transparency, data provenance, and reproducibility from funders, governments, and scientists themselves [, ]. These factors have increased the need for reproducible, programmatic approaches, both to produce large-scale datasets and handle the increasing data demands of modern models. Examples include the software stack supporting globally gridded soils dataset products [], the CEDS database of anthropogenic emissions of reactive gases and aerosols [], and the Global Land Data Assimilation System []. These and many other regional to global data are used by a wide variety of earth system models, dynamic global vegetation models, and integrated human-earth system models [].

Modern, integrated human-earth system models are typically complex and require correspondingly detailed input datasets. These models are sophisticated attempts to encapsulate relations between environmental, social and economic factors that are thought to drive future global change, and assess the effectiveness of technologies and policies []. One integrated human-earth system model is GCAM, a model coupling representations of global and regional economies; energy systems; agricultural, water, and land use systems; and global climate []. GCAM’s primary external assumptions include socioeconomic drivers (e.g., population and GDP), technology characterizations (e.g., cost and efficiency), and assumptions about regulations and policies that might influence the human systems represented in GCAM.

Currently GCAM requires over 200 Extensible Markup Language (XML) input files, detailing everything from future population projections to historical land allocation to emissions factors. These files create and describe six inter-dependent model modules: (1) agriculture and land use; (2) energy production, transformation, and consumption; (3) water demands; (4) socio-economic demand drivers; (5) non-CO2 emissions; and (6) GCAM-USA, a state-level representation of the USA region. These modules’ inputs include information for all time periods, including historical calibration data, characteristics of hundreds of modeled technologies, future assumptions, and other relevant data.

Earlier versions of the model [] used spreadsheet-exported inputs but as the data volume increased a system of scripts was developed to generate and reconcile data []. Spreadsheets do not scale well, however, and in general impede reproducibility and transparency of data flow; thus there was an acute need for a data system that was open source and transparent, easy to install and use, flexible and robust in its assumptions, and well documented. There are general-purpose R packages to support reproducible, verifiable data processing and scientific research, including madrat (https://cran.r-project.org/package=madrat), drake (https://cran.r-project.org/package=drake), and workflowr (https://cran.r-project.org/package=workflowr). Our specific needs for extensive consistency-checking and error-handling, in addition to providing a platform for data and model exploration and reproducible, transparent scientific and policy research, led to the development of the system described here.

Implementation and architecture

As noted above, the design requirements for this software centered around clarity, ease of use, robustness, error checking, documentation, and flexibility. These criteria led us to select the R statistical programming language [] as the programming language of choice. R has seen increasing use across many fields of science [], is free and open source, and straightforward to install. Importantly, R’s package system (along with optional tools such as devtools and roxygen2) offers extensive support for reproducible research []; for example, packages will not pass testing and continuous integration successfully if all user-facing functions do not have documentation, or if the package fails any one of a wide range of standard as well as user-defined tests.

Assuming that devtools is installed, the package can be installed and run by:


devtools::install_github(“JGCRI/gcamdata”)
library(gcamdata)
driver() # build the GCAM input data

The gcamdata system is conceptually organized into three levels of data: raw data (level 0) that are processed and aggregated/disaggregated into generic intermediate categories (level 1), which are then processed further to fit GCAM’s specific structures and model time periods (level 2). There are ~30 major inventory data sources within GCAM’s level 0 data (see data documentation at https://github.com/JGCRI/gcamdata/wiki), and hundreds of additional data sources are used for more specific information. Data sources consist of a blend of top-down inventories, bottom-up estimates, and information describing the characteristics of modeled technologies. The gcamdata package allows for users to update raw data, and modify assumptions and mappings, in order to generate alternative GCAM input scenarios. Internal consistency is enforced, i.e. modifying any calibrated flow estimate requires consideration of all affected sectors and processes, in order to ensure that all modeled flows remain balanced; this is automatically handled by the gcamdata code.

The units of code that handle these processing steps are termed ‘chunks’ and generally consist of a single function that takes inputs (data dependencies) and produce outputs, which are then available for processing by downstream chunks. On startup, chunks must declare all their required inputs, their optional inputs (see below), and their outputs. Two special classes of chunks also exist: “data” chunks are responsible only for loading and parsing specific datasets from disk, typically from a file with a nonstandard format; and “xml” chunks that construct the actual GCAM input files. The gcamdata code includes a facility for automatic generation of a chunk skeleton, i.e. the basic architecture of a chunk ready for coding; this provides a mechanism for extension of the data system’s architecture.

A ‘driver’ routine is invoked by the user to start the data system. This function locates all the available chunks in the package namespace, and queries them for their dependencies and outputs, which allows for the construction of a full data-dependency graph (Figure 1). The driver enters its main run loop, in which it calls all chunks with currently-available inputs and verifies (see below) the chunks’ outputs. Outputs are then added to the main data store or, if they will no longer be needed, written to disk and removed from working memory. This process continues until there are no more chunks to run.

Figure 1 

High level view of the code-data dependencies in the gcamdata package. This plot of the system architecture shows nodes (“chunks”, units of code charged with processing data and producing specific outputs) and edges (data flows between chunks). Nodes are colored by discipline, e.g., agriculture and land use-related code is black, energy system code is blue, etc. For clarity neither the initial data inputs nor the final XML outputs (i.e. the GCAM input files) are shown; this means that seemingly isolated nodes or groups of nodes actually contribute data directly into the model.

An important question was how to handle proprietary data [], specifically the International Energy Agency (IEA) Energy Balances [], a data product that cannot be legally included in the open-source gcamdata package. This problem was solved by allowing chunks to have optional inputs, and including a cached copy of the summarized proprietary data in the package. (Distribution of these summarized versions of the IEA data is permitted under the terms of the license for the data.) For example, chunk X summarized the IEA data by GCAM regions and technologies, producing output Y, which is then used by chunk Z. Y is cached (and included with the gcamdata download) and thus always available for Z, even if the source IEA data are not; when this occurs, a note is added to the downstream metadata indicating that cached data were used. The overall gcamdata license, the Educational Community License (http://opensource.org/licenses/ecl2.php), is close to the Apache license and chosen to match that of the GCAM model.

Data objects in gcamdata are required to have descriptive metadata attached, including title, units, description, and dependency information. (Most of these requirements are enforced on input data as well.) This allows us to track data provenance [] throughout the system, and provide data tracing. Because all chunks declare their inputs and outputs to the driver, a full system-wide data map can be constructed (Figure 1) and then particular data dependencies, upstream and/or downstream, traced through the system (Figure 2). Because all data objects that flow between chunks are required to have extensive metadata (including title, units, source, and comments), this allows for easy and informative exploration of the data sources and dependencies of any object in the system.

Figure 2 

An example of tracing data flow. Here the user has requested a data trace on a particular data object “L100.FAO_ag_Exp_t” (FAO agricultural exports by country, item, and year). The package prints detailed information about this object and its upstream and downstream dependencies, and graphs these relationships to show data flow (arrows). Raw data inputs are at the top, and the final XML product that flows into the GCAM model is at the bottom. Explanatory notes describe each step.

Quality control

The package includes extensive functional and unit testing (Table 1) that verifies behavior of the data ‘chunks’, the supporting data system functions, the driver, and characteristics of data objects; testing considers both the chunks (units of code) and data (relationships) []. The current level of unit testing coverage is 92%. The full suite of tests is invoked every time a user builds the R package, at which point it is also subjected to a battery of standardized R checks (see http://r-pkgs.had.co.nz/check.html). The tests are part of the continuous integration [] with the gcamdata repository (https://github.com/jgcri/gcamdata), meaning that they are invoked for every pull request (PR), and the PR cannot proceed without all tests being passed.

Table 1

Automatic package-level checks performed on the gcamdata data-handling functions (termed “chunks”) and their outputs.

CategoryTest

BehaviorChunk responds to required messages from driver (DECLARE_INPUTS, DECLARE_OUTPUTS, MAKE)
Chunk doesn’t make forbidden calls (e.g., slow or deprecated R routines)
Chunk handles changes in model time settings
Chunk (package-level) constants are correctly formatted
DataChunk declares a (possibly empty) list of input that can all be found, either as the product of another chunk or as a file input
Chunk declares a valid list of outputs
Chunk uses only its declared inputs
Chunk produces exactly its declared outputs
All file inputs have metadata headers and are encoded (e.g., standard line endings) correctly
All chunk outputs have title, description, units, comments, and precursor information attached
All declared precursors are in the chunk input list, and each chunk input is the precursor of at least one output
Chunk outputs match known good output set

(2) Availability

Operating system

Mac OS X 10.6. or later; Unix-like operating systems; Windows 7 or later. See https://cran.r-project.org/doc/FAQ/R-FAQ.html#What-machines-does-R-run-on_003f.

Programming language

R (version 3.1 or later).

Additional system requirements

The package uses and processes some large datasets, but efficiently prunes in-memory objects when they are no longer needed. Its memory usage during a run peaks at ~800 MB; the on-disk input data, many of which are compressed, are ~73 MB. The XML files written by the system are ~2.3 GB.

Dependencies

Required dependencies include the R packages assertthat (>=0.2), dplyr (>=0.7.0), magrittr (>=1.5), tibble (>=1.1), tidyr (>=0.7.1), readr (>=1.0.0), and data.table (>=1.10.4).

Optional dependencies include the R packages igraph (>=1.0.1), mockr (>=0.1), testthat (>=1.0.2), and R.utils (>=2.6.0), as well as Python version 3.

List of contributors

Package design and development was led by Ben Bond-Lamberty. Kalyn Dorheim was the verification lead, and with Ryna Cui, Russell Horowitz, and Abigail Snyder wrote the bulk of the code. Katherine Calvin, Leyang Feng, Rachel Hoesly, Jill Horing, Page Kyle, Robert Link, Pralit Patel, Chris Roney, Aaron Staniszewski, and Sean Turner contributed significantly to coding and/or design. Further contributions were made by Min Chen, Felipe Feijoo Palacios, Corinne Hartin, Mohamad Hejazi, Gokul Iyer, Sonny Kim, Yaling Liu, Cary Lynch, Haewon McJeon, Steve Smith, Stephanie Woldhoff, Marshall Wise. Katherine Calvin, Corinne Hartin, Gokul Iyer, Haewon McJeon, and Leon Clarke managed the various developments teams. Page Kyle developed the original R scripts that grew into gcamdata and made significant contributions in writing this manuscript.

Software location

Archive

Name: Zenodo

Persistent identifier: Version v1.0, DOI: https://doi.org/10.5281/zenodo.1249932

Licence: Educational Community License, Version 2.0 (ECL-2.0). See https://github.com/JGCRI/gcamdata/blob/master/LICENSE

Publisher: Pacific Northwest National Laboratory

Version published: 1.0

Date published: 19/05/2018

Code repository

Name: gcamdata

Identifier:https://github.com/JGCRI/gcamdata/

Licence: Educational Community License, Version 2.0 (ECL-2.0). See https://github.com/JGCRI/gcamdata/blob/master/LICENSE

Date published: 19/05/2018

Language

English

(3) Reuse potential

The gcamdata package maintains good separation between GCAM-specific code and infrastructure code, and reusing the infrastructure as a platform for building data preparation code for other scientific models would be straightforward; in particular, many of the intermediate processing steps and even particular data products are also required by other multi-sectoral models [].

More generally, many of the package’s concepts, structural design, and specific code elements may be broadly interesting to, and reusable by, other model/data teams interested in improving the transparency, reproducibility, and flexibility of their systems. Many parts of the gcamdata package could be repurposed for any data system that involves multiple, potentially interacting, data processing steps. Given the wide diversity of human-earth system models and frameworks in use, and the resulting problems associated with separating model and scenario variability [], standardizing on an open-source data processing platform would be valuable for the many communities in human-earth system modeling.

Key areas of interest and potential reuse include:

  • An object-oriented approach to data processing, with chunks (units of code responsible for a specific data-processing step) called when needed by a controlling driver routine.
  • Chunks are auto-discovered, so it is easy to add new ones. An empty chunk template is included in the gcamdata code.
  • Data objects are passed between chunks. All data objects (including those from input files) must have attached metadata, and this is enforced by extensive checking by the driver.
  • Enforcement of file encoding and structure. For example, our developers variously use Windows, Mac, and Unix, all of which have different line ending conventions, but gcamdata enforces a single standard.
  • Chunks declare their inputs and outputs to a driver through a fixed Application Programming Interface. The resulting chunk and data dependency information allows for extensive visualization, data tracing, etc.
  • A great deal of unit/functional testing. This is standard in the software design world [], but much less so in scientific programming [], and to our knowledge extremely rare in systems designed to process or produce datasets.