The Python package

Fast Fourier Transform (FFT) is a class of algorithms used to calculate the discrete Fourier transform, which traces back its origin to the groundbreaking work by [

To elaborate, there is only one way to apply domain decomposition for 2D FFT, which is to split them into narrow strips across one dimension. However for 3D FFT, there are two strategies to distribute an array in the memory, the 1D (or

Some of the well-known libraries are written in C, C++ and Fortran. The classical

Apart from CPU-based parallelism, General Purpose computing on Graphical Processing Units (GPGPU) is also gaining traction in scientific computing. Scalable libraries written for GPGPU such as OpenCL and CUDA have emerged, with their own FFT implementations, namely

Python can easily link these libraries through compiled extensions. For a Python developer, the following packages leverage this approach to perform FFT:

sequential FFT, using:

–

–

–

FFT with MPI, using:

–

–

FFT with GPGPU, using:

–

–

Although these Python packages are under active development, they suffer from certain drawbacks:

No effort so far to consolidate sequential, MPI and GPGPU based FFT libraries under a single package with similar syntax.

Quite complicated even for the simplest use case scenarios. To understand how to use them, a novice user has to, at least, read the

No benchmarks between libraries and between the Python solutions and solutions based only on a compiled language (as C, C++ or Fortran).

Provides just the FFT and inverse FFT functions, no associated mathematical operators.

The Python package

tests,

documentation and tutorials,

benchmarks,

operators for simple tasks (for example, compute the energy or the gradient of a field).

In the present article, we shall start by describing the implementation of

The two major design goals of

to support multiple FFT libraries under the same umbrella and expose the interface for both C++ and Python code development.

to keep the design of the interfaces as human-centric and easy to use as possible, without sacrificing performance.

Both C++ and Python APIs provided by

The C++ API is implemented as a hierarchy of classes as shown in Figure

Class hierarchy demonstrating object-oriented approach. The sequential classes are shown in red, the CUDA-based classes in magenta and the MPI-based classes in green. The arrows represent inheritance from parent to child class.

Remaining methods which are specific to a library are defined in the corresponding child classes, depicted in coloured boxes in Figure

Let us illustrate with a trivial example, wherein we initialize the FFT with a random physical array, and perform a set of

As suggested through comments above, in order to switch the FFT library, the user only needs to change the header file and the class name. An added advantage is that, the user does not need to bother about the domain decomposition while declaring and allocating the arrays. A few more helper functions are available with the FFT classes, such as functions to compute the mean value and energies in the array. These are demonstrated with examples in the documentation.

Similar to other packages in the FluidDyn project,

As a short example, let us try to calculate the gradient of a plane sine-wave using spectral methods, mathematically described as follows:

where _{x}, k_{y}

The equivalent pseudo-spectral implementation in

A parallelized version of the code above will work out of the box, simply by replacing the FFT class with an MPI-based FFT class, for instance

Thus, we have demonstrated how, by using

The internal architecture of

Flowchart illustrating how the C++ and Python API are built and used for one particular class, viz.

The Python API is built automatically when

A helper function

To summarize,

One C++ class per FFT library derived from a hierarchy of C++ classes as shown in Figure

Python operator classes (2D and 3D) to write code independently of the library used for the computation of the FFT and with some mathematical helper methods. These classes are accompanied by unit test cases.

Command-line utilities (

Scalability of

where _{p,min} is the minimum number of processes employed for a specific array size and hardware. The subscripts,

To compute strong scaling the utility

On big HPC clusters, we have only focussed on 3D array transforms as benchmark problems, since these are notoriously expensive to compute and require massive parallelization. The physical arrays used in all four 3D MPI based FFT classes are identical in structure. However, there are subtle differences, in terms of how the domain decomposition and the allocation of the transformed array in the memory are handled.

Hereafter, for the sake of brevity, the FFT classes will be named in terms of the associated library (For example, the class

Figure

Speedup computed from the median of the elapsed times for 3D fft (384 × 1152 × 1152, left: fft and right: ifft) on Occigen.

The benchmark is not sufficiently accurate to measure the cost of calling the functions from Python (difference between continuous and dashed lines, i.e. between pure C++ and the

Figure

Speedup computed from the median of the elapsed times for 3D fft (1152 × 1152 × 1152, left: fft and right: ifft) on Occigen.

In Figure

Speedup computed from the median of the elapsed times for 3D fft (384 × 1152 × 1152, left: fft and right: ifft) on Beskow.

A striking difference when compared with Figure

The strong scaling results of the cubical array on Beskow are displayed on Figure

Speedup computed from the median of the elapsed times for 3D fft (1152 × 1152 × 1152, left: fft and right: ifft) on Beskow.

In Figure

Speedup computed from the median of the elapsed times for 3D fft (320 × 640 × 640) at LEGI on cluster8.

We have also analysed the performance of 2D MPI enabled FFT classes on the same machine using an array shaped 2160 × 2160 in Figure

Speedup computed from the median of the elapsed times for 2D fft (2160 × 2160) at LEGI on cluster8.

As a conclusive remark on scalability, a general rule of thumb should be to use 1D domain decomposition when only very few processors are employed. For massive parallelization, 2D decomposition is required to achieve good speedup without being limited by the number of processors at disposal. We have thus shown that overall performance of the libraries interfaced by

As already mentioned, we use

Note that, this implementation is “out-of-place”, meaning that the result is returned by the function and that the input velocity field (

The array notation is well adapted and less verbose to express this simple vector calculus. Since explicit loops with indexing is not required, the computation with Python and

In the top axis of Figure

Elapsed time (smaller is better) for the projection function for different implementations and tools. The shape of the arrays is (128, 128, 65). The dotted lines indicate the times for Fortran for better comparison.

a Fortran code (not shown)

the simplest Python version shown above.

a Python version with three nested explicit loops:

For the version without explicit loops, we present the elapsed time for two cases: (i) simply using Python (yellow bar) and (ii) using the Pythranized function (first cyan bar). For the Python version with explicit loops, we only present the results for (i) the Pythranized function (second cyan bar) and (ii) the result of

We see that

Note that the exact performance differences depend on the hardware, the software versions,

Since allocating memory is expensive and we do not need the non-projected velocity field after the call of the function, an evident optimization is to put the output in the input arrays. Such an “in-place” version can be written with

As in the first version, we have included the

Note that this code is much longer and clearly less readable than the version without explicit loops. This is however the version which is used in

The elapsed time for these in-place versions and for an equivalent Fortran implementation are displayed in the bottom axis of Figure

From this short and simple microbenchmark, we can infer four main points:

Memory allocation takes time! In Python, memory management is automatic and we tend to forget it. An important rule to write efficient code is to reuse the buffers already allocated as much as possible.

Even for this very simple case quite favorable for

For the aforementioned reasons, we have preferred

The package

For

We also try to follow a consistent code style as recommended by PEP (Python enhancement proposals) 8 and 257. This is also inspected using lint checkers such as

Windows and any POSIX based OS, such as GNU/Linux and macOS.

Python 2.7, 3.5 or above. For the next versions, we will

C++ API:

Python API:

Pierre Augier (LEGI): creator of the FluidDyn project [

Cyrille Bonamy (LEGI): C++ code and some methods in the operator classes.

Ashwin Vishnu Mohanan (KTH): command lines utilities, benchmarks, unit tests, continuous integration, and bug fixes.

English

There is no formal support mechanism. However, bug reports can be submitted at the

Most C++ classes also support single-precision.

Detailed steps for installation are provided in the documentation.

Uses an approach similar to guidelines “

Detailed discussion on “

Saved at

The codes and a Makefile used for this benchmark study are available

Here, we use Python 3.6.4 (packaged by conda-forge),

The results with

See

Ashwin Vishnu Mohanan could not have been as involved in this project without the kindness of Erik Lindborg. We are grateful to Bitbucket for providing us with a high quality forge compatible with Mercurial, free of cost.

The authors have no competing interests to declare.