(1) Overview

Introduction

The numerical global atmosphere-chemistry model EMAC (ECHAM/MESSy Atmospheric Chemistry) is a modular application used for climate modeling simulations [5]. EMAC uses the chemical kinetic module MECCA [8], utilizing the Kinetic Pre-Processor (KPP) general purpose open-source software tool [4] to calculate the concentrations and the interactions between different chemical species in the atmosphere. Numerically, solving atmospheric chemical kinetics is one of the most computationally intensive tasks in atmospheric chemical transport simulations.

The MECCA sub-model uses KPP to numerically solve ordinary differential equations (ODE) describing atmospheric chemical kinetics. KPP takes as input chemical reactions written in a domain-specific language and produces C and FORTRAN compatible code. The output ODE solver allows for the temporal integration of the full kinetic system. KPP utilizes the sparsity of the Jacobian matrices to increase the efficiency of the solver [9].

In typical climate simulations, chemical kinetics can take up to 90% of execution time [2]. To address this computational challenge, this paper presents a source-to-source parser that transforms the output of KPP from FORTRAN to GPU accelerated code by generating a CUDA [7] compatible solver [3]. The goal is to significantly improve the performance of numerical chemical kinetics (in terms of time-to-solution and problem complexity) in climate simulation models using GPU accelerators.

The software provides a purpose-built acceleration pathway for EMAC, rather than the general closed source solution KPP-A [6], which generates GPU-accelerated code from KPP-language directly. There have also been similar efforts in the WRF-Chem model [6].

Implementation and architecture

The implementation uses a source-to-source parser written in the Python programming language to generate a CUDA [7] compatible solver, by parsing the KPP preprocessor auto-generated FORTRAN code. Each GPU thread calculates the chemical concentrations of an individual cell. The solver uses the temporary arrays between different steps and they are allocated in the stack memory with the exception of the RCONST array that is stored in global memory. The memory required for temporary arrays depends on the configuration of the chemistry and can go up to 50 KB per CUDA thread. The intermediate values are stored in temporary arrays using the double representation. The accelerated code uses 3 KB of shared memory and 0.5 KB of constant memory when indirect arrays are not used. The parser allocation additional constant memory up to 2 KB when indirect accesses are used. All the methods that are available in the KPP numerical library under MECCA are supported.

The computation data structures are subdivided in runtime-specified arrays of columns in the atmosphere, with the memory of each array transferred to the GPU global memory and each grid box calculated on a separate GPU core to achieve massive parallelization, as shown in Figure 1. The CUDA chemical kinetics solver comprises three steps, also presented diagrammatically as a flow chart in Figure 2:

  1. The first step is the calculation of the reaction rate coefficients. The variable values are stored in a global array inside the GPU and used in the computational kernels.
  2. The second step is the most computationally demanding, including mostly linear algebra functions for the ODE solvers. The kernel selects the variation of the Rosenbrock solver method inside the GPU using an array of constant values in the memory.
  3. The third step kernel is used for statistical reduction, and demands limited computational time compared with other kernels.
Figure 1 

Grid partitioning of the atmopshere in the global model. Each GPU thread is assigned the calculation for one cell, stacked in arrays of columns.

Figure 2 

Tasks offload execution GPU.

There are two files required to enable the GPU utilization: i) f2c_alpha.py and ii) kpp_integrate_cuda_prototype.cu. The pre-processor is executed by running python f2c_alpha.py in the messy/util directory.

When offloading to GPUs, the number of cells must not exceed 12288. The application calculates the number of cells by multiplying the number of columns by the number of levels for the atmosphere. The user can specify the number of columns by using the NVL[1] (NPROMA) runtime parameter in the configuration of the EMAC.

Quality control

To ensure the quality of the code, we conduct unit testing by comparing the GPU accelerated with a pure Fortran simulation for one model year, using 155 species and 310 reactions. We compare the output of chemical element concentrations between the CPU only and accelerated version after the first time-step to calculate the relative error. The results show a median difference of 0.00005% with the maximum difference value of 0.54% using identical inputs, ensuring the high accuracy of the chemical solver. This is well within the accuracy criterion, asserting the numerical correctness of the GPU kernel.

Finally, we compare the results of aggregated mass of the CPU-only and GPU-accelerated version, over one year of simulated time. This test aggregates the error created over time using the L1-norm method and investigates the stability of the model. The results show that the median value of the difference in aggregated mass is less than 5%, well within the expected margin of differences stemming from architecture and compiler implementations.

(2) Availability

Operating system

The software is compatible with any operating system that supports the CUDA SDK.

Programming language

The script is written in the Python language and it has been tested with Python version 2.7. The script produces code that supports the the CUDA programming model version 2.0 or later.

Additional system requirements

Each CPU process that offloads to GPU requires a chunk of the GPU VRAM memory, whose size is dependent on the number of species and reaction constants in the MECCA mechanism. The number of GPUs per node and VRAM memory available in each GPU dictates the total number of CPU cores that can run simultaneously. We strongly recommend that at least 2.5 GB of VRAM per CPU process should be available on each GPU. Note that if not enough memory is available, the CUDA runtime will silently fail – without any warning.

Dependencies

The script requires Python version 2.6 or 2.7. The source code produced requires the CUDA SDK 6.5 or newer.

List of contributors

  • Michail Alvanos – Active maintenance and optimization.
  • Theodoros Christoudias – Conception, integration with the ECHAM/MESSy climate model, correctness/accuracy testing.
  • Giannis Ashiotis – Developed an initial version of the parser.

Software location

Archive

Name: MECCA – KPP Fortran to CUDA source-to-source pre-processor.

Persistent identifier: DOI: 10.5281/zenodo.546811

Licence: MIT License.

Publisher: The Cyprus Institute.

Version published: 1.1.

Date published: 18/04/17.

Code repository

Name: GitHub.

identifier: https://github.com/CyIClimate/medina.

Licence: MIT License.

Date published: 12/12/16.

Language

English.

(3) Reuse potential

The source-to-source parser core can be used by other researchers to transform FORTRAN 90 code to CUDA accelerated code. The methodology can be used as template to create high performance computing capabilities on GPUs for scientific code that is currently written in FORTRAN 90.

The CUDA compatible ODE solvers comprises the Runge Kutta methods and the linearly-implicit Rosenbrock family of solvers, and may be used for chemical kinetics and numerical integration outside the field. The software is supported by the developers. Any inquiries should be addressed by e-mail.