A Grid for Multidimensional and Multivariate Spatial Representation and Data Processing

Researchers use 2D and 3D spatial models of multivariate data of differing resolutions and formats. It can be challenging to work with multiple datasets, and it is time consuming to set up a robust, performant grid to handle such spatial models. We share ‘agrid’, a Python module which provides a framework for containing multidimensional data and functionality to work with those data. The module provides methods for defining the grid, data import, visualisation, processing capability and export. To facilitate reproducibility, the grid can point to original data sources and provides support for structured metadata. The module is written in an intelligible high level programming language, and uses well documented libraries as numpy, xarray, dask and rasterio.


Introduction
Spatial models are needed to enable numerical problems to be solved in a broad range of scientific applications. Representation of data and modelled properties can be discretizised to a grid. Each cell in the grid can contain a value from measurements or a modelled value, at a position defined in space and time. Cells can also be assigned a value by interpolation of nearby data points or by assumptions. The location of each grid cell is specified along the dimensions by index number or coordinates from e.g. a geographic coordinate system. Grids that represent part of Earth must also be associated with a geodetic datum for reference to the physical world. Cells in a regular grid represent the shape of parallelepipeds, and can be rectilinear or Cartesian. The latter is the special case where the cells are unit squares, or unit cubes. Some data, e.g. surface elevation, can be expressed in only two dimensions. Other parameters can vary in all spatial directions, and time, and need to be represented in a multidimensional grid. The cell size limits the resolution of the model, smaller cells can represent higher frequencies, but a denser and larger grid add exponentially to the computing cost [35]. To populate a grid model, data are generally imported from different sources and in various formats. Images and continuous data are often available as regular raster files, while some observations are provided as points in an irregular grid, or vector data as polygons and lines. Spatial data are published in different data, projections and coordinate systems. Given this variety of formats and reference conventions, it is inevitable that combining data from different sources often presents a challenge.

The computational framework
We share agrid, a framework to produce a regular grid for multidimensional and multivariate spatial modelling, processing and analysis. The extended functionality of the grid addresses many of the challenges in working with spatial 2D and 3D data noted above. Following the principles of Wilson et al. (2014) [39], the code is written in highest possible language level and made readable and intelligible. We use the general-purpose programming language Python 3. Python is equipped with libraries for fast array operations [23,37], basic statistics [20], signal processing and other scientific tools [15], machine learning [24], visualisation [14,26] and discipline specific libraries for e.g. seismology [3,21], astronomy [28] and GIS [8,10,16]. Python also provides interfaces for other languages as R, C and Fortran. All those tools and packages can be reached from the open structure of agrid (Figure 1).
A few related open-source projects provide useful code for the Earth Sciences community; GemPy [4] is a package that facilitates stochastic geomodeling and probabilistic programming. The package uses the linear algebra compiler Theano [2] for efficient computation. Another related project is Verde [36] and the Fatiando tool box, which contains advanced methods for e.g. interpolation. There are also examples of successful projects that connect various data sources with users. Quantarctica [30] makes Antarctic datasets from various sources easily accessible in a Geographical Information System (GIS) application, QGIS [25]. However, even with some 3D functionality in recent upgrades, GIS is predominantly a 2D frame. Another related project is the multidimensional DataCube [18,19]. DataCube pre-processes and presents remote sensing geographical and geophysical attributes for researchers and the broader public. DataCube is mainly targeted for changes (e.g. in Landsat raster data) over time, but has a broad range of possible applications. In comparison, agrid is relatively light, easy to modify, and the dependencies are kept to a minimum. Data held in the agrid environment are not regarded only as a set of values: each observation can include quantified uncertainty, probability or likelihood, and data can also be associated with metadata for provenance. It is advantageous that cells of a grid model can be populated with such allied information, together with the dataset. agrid was initially developed for studies of the Antarctic lithosphere [33,34], and pre-processing of geophysical data for visualisation purposes [22], but with updates as presented here, it can be used in any discipline, geographical region, projection, dimensionality and any resolution. This initial release of the code is presented with tutorial notebooks that demonstrate its usage. The examples given in this paper can be reproduced from the provided SConstruct script [5,6,17]. Subsequent versions of agrid will include additional functionality. We plan, e.g., additional methods for conversion and improved visualisation, support hexagonal 2D grids, curvilinear grid and increased polar and spherical functionality. We hope that colleagues will find this contribution useful, and hopefully encourage scientists to share code and publish reproducible studies.

Implementation and architecture
agrid is structured as a Python module that imports dependencies and defines an agrid class object, Grid(), when imported. When calling Grid(), an object is created that represents the spatial extent of the model space. The grid is initiated with projection, extent and resolution. When the instance of the agrid class object is created, an xarray dataset is defined with dimensions and populated with coordinates. Dimensions includes, but are not limited to, space (X, Y, Z), time (t) and frequency bands (e.g. RGB). Models might also include probability or likelihood. Extent is defined as left, right, up and down, and refers to the rectangular map view. Predefined coordinates are the default units for the projection, e.g. x and y in metres, and degrees in WGS 1984, EPSG:4326. At setup, there is an option of the grid can represent both the corners or the centre points of each cell. The default settings gives a coarse global grid of WGS84 (EPSG:4326), with a resolution of 1° ≈ 111.1 km.
agrid facilitates access to array operations in the spatial domains, as projected grid cells. The data is stored as Figure 1: Components of agrid: accessory methods, the class Grid() and example-specific code (feature methods).
A class object (brown) contains functions for e.g. import and export. It also contains the xarray dataset (gray) and attributes. Various data formats (left) are converted to numpy arrays and incorporated as data arrays in an xarray dataset. Each data array can be associated to coordinates. The dataset also contains metadata (green). Data can be exported or visualized (right). Accessory methods include a download function to link the Grid() class directly to the data source if required, e.g. for dynamic updating. A few example-specific methods are also distributed together with the module [22,33,34]. data arrays in an xarray dataset [12,13]. xarray is built on numpy [23,37] and pandas [20], and provides high level functions for labelled multidimensional datasets. xarray has a structure similar to netCDF file format [27] and netCDF is also used as the native format to store grids. By using dask arrays, only the data used is loaded into memory in chunks [29]. dask also facilitates some parallel computing. Grid cells can be selected with the advanced indexing methods in xarray by geographical coordinates as well as index numbers in the grid. Additional coordinates with different resolution can be created and added to the object at any point. Computations with data grids of different resolution are performed by generating vectors from chunks of the larger array so that the resulting grid sizes are identical. The vectors are unfolded back to the higher resolution grid after the computation. By using this approach, fast numpy operations can be applied on arrays of different shapes and size and there is no need to over-sample low resolution data.
In a research project, agrid can point directly to original data sources. This simplifies the workflow, as development can be done in low resolution or small extent, but larger grids can be used when required and data-sets can easily be swapped. Pre-processing and visualisation can be moved from third part software or stand-alone applications to a condensed workflow (Figure 1 and Listings 1 and 2). This provides overview and facilitates reproducibility and flexibility for the researcher [11].

Example of grid generation and data import
Code in Listing 1 generates a frame of Antarctica, using WGS 84/Antarctic Polar Stereographic projection and a lateral cell size of 10 km × 10 km. The Extent is defined in the default unit of the projection. Coordinate reference system (CRS), is given as an integer and therefore interpreted as an EPSG code. For this example, the 2D grid is Cartesian and quadratic, but the depths slices are defined by the list depths. Due to the convention of indexing arrays as row -column and geographical coordinates as lat -lon, grid coordinates are also given as Y -X for consistency.
The instance of Grid() class contains a number of functions to import data of different types, visualisation and export (Figure 1). Raster data, e.g. GeoTiff, can be imported with a method using rasterio [10] and the underlying gdal [38]. Rasters are warped to fit the extent, resolution and projection of the grid. An imported raster is shown in Figure 2b. Vector data are imported with fiona [9] and geopandas [16] with options for rasterization from agrid.grid import Grid from agrid.acc import download km = 1000  of attribute data and interpolation. Grids or data points can be read from a number of formats and interpolated. A rasterized polygon dataset is shown in Figure 2a and is also used to crop and select data in Figure 2c-d.

Example of visualization and data export
The class also contains functions for visualisation using matplotlib [14] and Cartopy [41] (Figure 2a-c). Map views with e.g. coast lines and coordinates can be produced directly by agrid. Mayavi [26] and the underlying VTK [31] are used for 3D visualisation (Figure 2d). Data can be exported as netCDF, GeoTiff or ASCII formats. JSON format is used to import metadata and export model parameters.

Quality control
The module is published with a number of tutorials to demonstrate the functionality with different data sources, scales and extent. Known limitations exist in the visualization methods for less common projections and some warnings are not handled smoothly. Error handling mainly relies on used dependencies with only limited functionality in agrid itself. Development errors have been ruled out by comparing results from other GIS applications. 2D data that have been imported, processed and exported, have been compared to similar processing in the GIS applications QGIS. Those test cases and additional test code are also available from the project's github repository [32]. The updated issue tracker is likewise available at github.

Operating system
The code is developed and tested in Ubuntu 16.04, 18.04 and macOS High Sierra 10.13.6. It has also been tested on Windows 10.

Additional system requirements
Very low requirements for basic use, but can be scaled up for larger grids. The use of dask arrays relax the need for large RAM.

Dependencies
The class depends on a number of Python packages that can all be installed by package managers, e.g. pip3 or conda: Minimum dependencies: cartopy geopandas matplotlib json numpy pyproj rasterio scipy xarray Additional dependencies used and imported only by some methods: datetime fiona imageio mayavi requests shapely tarfile tqdm zipfile. (3) Reuse potential agrid is deliberately developed for reuse in a broad range of applications. The code is commented and explained to guide and advice modifications. The code could be useful for any spatial processing and analysis in areas such as solid Earth geophysics, geotechnical and environmental applications. For some uses, the complete package might be installed, but with the open architecture, copied snippets or methods can be included into other projects. The MIT license allows for a broad reuse. Functionality and issues may be discussed on the code repository. Python and the used libraries are also supported by large online communities.