Nansat: a Scientist-Orientated Python Package for Geospatial Data Processing

Nansat is a Python toolbox for analysing and processing 2-dimensional geospatial data, such as satellite imagery, output from numerical models, and gridded in-situ data. It is created with strong focus on facilitating research, and development of algorithms and autonomous processing systems. Nansat extends the widely used Geospatial Abstraction Data Library (GDAL) by adding scientific meaning to the datasets through metadata, and by adding common functionality for data analysis and handling (e.g., exporting to various data formats). Nansat uses metadata vocabularies that follow international metadata standards, in particular the Climate and Forecast (CF) conventions, and the NASA Directory Interchange Format (DIF) and Global Change Master Directory (GCMD) keywords. Functionality that is commonly needed in scientific work, such as seamless access to local or remote geospatial data in various file formats, collocation of datasets from different sources and geometries, and visualization, is also built into Nansat. The paper presents Nansat workflows, its functional structure, and examples of typical applications.


Introduction
Geospatial data is information that identifies the geographic location of features and boundaries on Earth, such as natural or constructed features, oceans, and more, usually stored as coordinates and topology, that can be mapped [1].This paper describes a new tool to handle the interpretation and processing of geospatial data from satellite images and numerical models.Typical operations on such data include visualization for human perception of spatial patterns, extraction of geophysical values, pixel-per-pixel or contextual calculation of new geophysical variables based on one or several input datasets, re-gridding to another coordinate system, co-location of several datasets on one spatial grid, subsetting in space, and automatic recognition and description of objects.
With a growing amount of available geospatial data, the complexity of this work is, however, increasing.As an example, a collection of 2-dimensional (2D) datasets acquired over one geographic region at a given point of time will form a 3D cube of data, and time series of such cubes represent 4D datasets.The combination of multidimensional data analysis with significant input data heterogeneity and size (a single satellite may, e.g., provide 20TB of data per day) creates a challenge for researchers [2], and powerful and customized tools are thus required.
A number of analysis and processing tools for gridded data already exist, e.g., ERDAS IMAGINE [3], Bilko [4], and ArcGis [5].Both the European Space Agency (ESA) and the National Aeronautics and Space Administration (NASA) have developed and provide software tools for processing their respective satellite data, including, e.g., the Sentinel Toolbox [6] and SeaDAS [7].However, these tools are written by programmers, from a programmer perspective, in compiled low-level and object-oriented languages [8], such as Java or C/C++ to accomplish highly generic software and reduced execution time.However, scientists often prefer higher-level programming languages in order to minimize the abstraction from their scientific ideas.The nature of scientific work also requires researchers to be able to quickly modify and extend their software tools to be able to analyse new sets of information.The possibility to work from the command line, and create quick and simple visualizations is therefore very important, whereas execution speed is usually less of a concern.Python [9] is a programming language that can meet these needs.It has a large user community, and it is free and open source.Existing Python packages, such as Numerical Python (Numpy) [10], Scientific Python (Scipy) [11] and Matplotlib [12], already provide much of the required flexibility, and code can be speeded up by clever programming practices.Compared to, e.g., Java, it is relatively easy to get started with Python, and the threshold for being able to produce scientific results is lower.As Python is object oriented, it also allows programmers to create reusable and complex software tools.As such, Python is highly suitable for scientists, while overlapping into what programmers want.
Many of the above mentioned tools are available only as Graphical User Interfaces (GUIs).A challenge with such tools is to automate operations that could be performed by computers alone [13,14].In an attempt to solve this problem, scientific users of geospatial data often develop their own software tools which can be more easily integrated into processing chains handling multiple datasets (e.g., an archive of satellite images) [15].With the boom of open source software, the programming paradigm has gradually changed from development of large, strongly coupled packages to modular, unifunctional libraries that can be used together as building blocks for multifunctional applications.The Geospatial Data Abstraction Library (GDAL [16]) is an example of an open source software rarely used alone, but which is included in many applications (even proprietary) and online services such as, e.g., QGIS [17] and GRASS [18].GDAL has a very usable and well documented application programming interface (API), with bindings for many programming languages, and a number of community developed drivers that are capable of reading geospatial data and metadata from many file formats.
The advantage of GDAL is that it provides access to various sources of information using the same universal data model.However, a limitation for scientific users is that GDAL only provides access to the data or metadata stored in the file and does not attach any geophysical meaning to the data.Thus, it may become difficult for a researcher to correctly interpret the content of the dataset with regard to the relevant band name, calibration, physical units, spatial scale, etc.In addition, GDAL provides low-level access to the data, and a scientific user is often required to perform several complicated operations in order to eventually fetch the values of the required geophysical variable at a given resolution and region of interest.Furthermore, the structure of files containing geospatial data varies significantly between its sources, and GDAL does not provide a unified interface.In many datasets, there are specific details about, e.g., subdatasets, ground control points, or the location of metadata, which should be taken into account.
The main goal of Nansat is thus to combine the versatility and power of GDAL with scientific knowledge about the content of geospatial data from various sources in a user-friendly Python package.In order to simplify the data use and interpretation, Nansat adds a priori known scientific metadata following the CF and GCMD conventions [19,20] to the datasets, unifies the access to various data (either in local files or through the OpenDAP protocol) as much as possible, and provides a high level API to enable use in diverse, as well as automated operations.Via GDAL, spatial transformation of data and exporting the data into common formats (e.g., NetCDF-CF, PNG, GeoTIFF) is also facilitated.

Implementation and architecture
In this section, the instantiation of a Nansat object with geospatial satellite data is summarized and schematically presented on Fig. 1 (subsection A), and the Nansat structure including details about classes, mappers and use of third party libraries are given (subsection B).Examples of more advanced processing are provided in section (3) 'Reuse Potential'.

A. Nansat workflow
(1) A user invokes the command n=Nansat (input_ file_name), where input_file_name is a string that includes the full path to the input file containing the geospatial data.(2) The Nansat constructor uses GDAL to open the input file and extract provided metadata.(3) The Nansat constructor generates a list of available mappers based on the content of the nansat.mapperssubpackage.A mapper is a Python class named Mapper, where the structure of the opened file is defined a priori, or where qualified guesses based on, e.g., filenames and metadata following known conventions (e.g., the CF-conventions) are defined.
A mapper maps the structure of the input file to the structure of the Nansat object and provides meaning to the dataset through added metadata.Since there are many file types which can be opened by Nansat, many mapper classes have been developed.
In each mapper class, the structure of the input file (e.g., the number of bands and their content, type and name of the sensor and platform, relevant global metadata) is hardcoded or partly dynamically constructed.The Nansat constructor then loops through the available mappers, and.(4) Each of the mapper constructors tests whether or not the opened file matches the given mapper.(5) If the opened file matches the given mapper, a GDAL Virtual XML file [21] (hereinafter referred to as VRT file), which represents the description of the GDAL data model, is created.The mapper constructor then adds relevant bands from a priori specified bands of data provided in the input file.The Mapper constructor also adds conventional metadata with the standard and short name of geophysical variables, their units, valid minimum and maximum to the VRT file for each band.If scaling of values is needed a corresponding scales and offsets are also added to the VRT file.(6) If the given mapper does not match the provided file, a MapperError, which is a dedicated Python Exception, is immediately raised, and the next mapper is tested.(7) If none of the mappers matches the input file, the Nansat constructor creates the VRT file without scientific metadata.(8) Finally, the Nansat constructor creates a GDAL Dataset object from the VRT file, and adds this to the attributes of the generated Nansat object.A GDAL Dataset is a class that provides various methods for low level access to a set of associated raster bands and metadata [22].

B. Nansat functional structure
Nansat is designed as a modular collection of classes (Fig. 2) to make the code more cohesive and less coupled [23], The Nansat class is the main container for the geospatial data, and performs all high-level operations.An instance of the Nansat class contains information about the geographical reference of the data (e.g., raster size, pixel resolution, type of projection, etc.) and about bands with geophysical variables (e.g., water leaving radiance, normalized radar cross section, chlorophyll concentration).A Nansat instance contains methods for high-level operations on the data, including mainly the following:  parameters in the core VRT file are modified.The actual resampling happens only when the data is requested (e.g., within the method write_figure); and • export: write data to a GeoTIFF or NetCDF [24] file following OGC conventions and standards.
The Domain class is a container for the geographical reference of the raster data.A Domain instance is a GDAL Dataset with information about the geolocation information only (i.e., no bands), containing, e.g., • The type and parameters of a spatial reference system (SRS) (projection), e.g.cylindrical or stereographic, and central meridian and parallel; • The grid width and height (number of pixels or grid cells); • The pixel size in decimal degrees, metres or other units (depends on the selected projection); and • The relation between pixel/line coordinates and geographical coordinates, e.g., as a linear relation or as ground control points (GCP).
There are three ways to store the geo-reference in a GDAL dataset: 1.As a GDAL GeoTransform, which defines a linear relationship between the raster pixels and lines, and the geographical X and Y coordinates; 2. As GCPs (Ground Control Points), which define the relationship between raster pixels and lines, and geographical X and Y coordinates; and 3.As a Geolocation Array, which is a grid of geographical X and Y coordinates for each pixel in a raster dataset.
The relation between row and column coordinates of the raster and geographic (e.g.latitude and longitude) coordinates is defined by the projection type and projection parameters.
The Domain class has methods for basic operations with georeference information: • __init__: (the constructor) creates the georeference based on provided arguments; • get_corners, get_border, get_geolocation_grid: fetch coordinates of the grid corners, the footprint/ coverage of the grid, or the entire raster; and • write_map: create map of the grid footprint, and add coastlines and graticule.
The VRT class is a wrapper around a GDAL vrt file.The vrt file is in XML format, and contains the geo-reference, the global and band metadata, and reference to the data grids (or bands) provided in the source file.The class VRT performs all operations on the VRT files, i.e., copying, modifying, reading, and writing of files, adding bands, setting spatial reference attributes as such projection or coordinate transformation, and more.These operations are performed either directly by methods in the VRT class (e.g., remove_geotransform and get_warped_vrt), or using GDAL functions (e.g., Create, AddBand, SetMetadata, and AutoCreateWarpedVRT).The core of the VRT object is a GDAL Dataset instance generated by the GDAL VRT Driver.
The respective vrt file is located in computer memory, and is accessible using the GDAL VSI functionality (the path to the vrt file is /vsimem).The VRT class can also be used to write binary files with actual data accompanied with a wrapping vrt file that describes its dimensions and data type.This is used when adding data to existing Nansat instances or caching CPU consuming operations.
When Domain is instantiated, the VRT object is created without any bands, only with geo-reference information using the GDAL Dataset methods.On the contrary, when a file is opened with Nansat, the VRT object with both the georeference and the bands is created using the class Mapper.
When Nansat performs operations that change the vrt file, e.g., by reprojecting, cropping, resizing or adding bands, an additional vrt file with a corresponding VRT object is generated.The new VRT object keeps a reference to the original VRT object.This allows to easily and safely undo operations.
The The Nansatmap class generates nice looking maps where several datasets can be displayed together.It extends the Python Basemap [25] class and overrides its constructor for simpler instantiation based on a given Nansat object.Nansatmap also overrides several Basemap methods by taking advantage of a priori knowledge about the spatial extent: • imshow: adds a raster layer to the map in a fast manner and colorizes with a given colormap and value limits; • smooth: spatially smooths the input data; • contour(f): creates (filled) contour lines; • quiver: adds regularly spaced and correctly rotated arrows indicating direction and speed; • draw_grid: adds labeled graticule; and • save: saves to a graphical format at a given resolution.
The Mosaic class extends Nansat and provides capability to create mosaics of several input files using one of two methods: • average: a memory-friendly and multi-threaded method for averaging data.It reads all input files using Nansat, re-projects the raster data of the required bands onto the original Mosaic object, calculates the mean and standard deviation, and adds these as new bands in the Mosaic object.The method tries to get a band with name 'mask' from the input files.The mask should have the following encoding: If the "mask" band is present, it is used to select the valid pixels for averaging (i.e.where mask equals 64).Otherwise it is assumed that all input pixels are valid.
• median: a more memory greedy method working in only one computational thread.Instead of returning the mean, this function calculates the median of specific bands in the input files, and adds the resulting array as a new band in the Mosaic object.

Quality control
Quality control of the code is implemented in two blocks: unit tests and integration tests.Currently, the unit tests cover 78% of the core code [26] and thus the most important functionality.Sample datasets are pre-generated for running the unit tests, and consists of synthetic satellite images with data either on a regular projection or with the projection defined by ground control points (GCPs).The code repository is integrated with the Travis Continuous Integration platform [27] and tests are automatically run on each commit.The integration tests are developed for testing mappers and operation chains on actual data.The integration tests are designed as a separate package based on the Python unittest [28] package, and are run only locally (i.e.,not on TravisCI).This is to prevent downloading of large amounts of satellite data on the test server.Including the integration tests, the coverage of all code including all mappers is 59% (this is not reflected in the Coveralls badge in the readme file at GitHub).

Operating system
Nansat has been tested on OS X and Linux and is also provided in a Vagrant/Ansible configuration for virtual machines (CentOS 7 or Ubuntu 14.04).Thus, it is easy to install the tool also on Windows, although the actual OS would be Ubuntu.

Programming language
Most of Nansat is written in Python 2.7.A small and optional part (GDAL pixel functions [16]) is written in C.

Additional system requirements
The code is 8 MB, and the system requirements largely depend on the application of Nansat.Small satellite images (less than 1000 x 1000 pixels) can be processed with 100 MB of free RAM and a 1.5 GHz processor.Large datasets (e.g.synthetic aperture radar (SAR) images exceeding 10000 x 10000 pixels) require > 4 GB RAM and > 2.5 GHz CPU.

Dependencies
Nansat has both required and optional dependencies.Installation of all the dependencies for full Nansat functionality, i.e., to access data in netCDF, HDF4 and HDF5 files via GDAL, may sometimes be cumbersome.Therefore, we have pre-built fully functional binaries of GDAL for Linux (32 and 64 bits), which are available in the Anaconda Cloud [29,30].To further simplify the installation, we have created a Vagrant/Ansible configuration to install Nansat and all required dependencies on a local virtual machine as referenced in the "Emulation Environment" section.
The basic requirements of Nansat are: • GDAL >= 1.11.4 [16] • Numpy >= 1.10.4[10] • Scipy >= 0.17.1 [11] • Matplotlib >= 1.5.1 [12] • Basemap >= 1.0.8dev[25] • Pillow >= 3.2.0[31] • py-thesaurus-interface = 1.1.1[32] In addition, there are optional dependencies for specific Nansat functionality, in particular: The potential to reuse Nansat for scientific applications is illustrated in three use cases.These examples, plus other tutorials are available in the nansat-lectures repository [33].Nansat is at the time of writing capable of reading data from tens of platforms and instruments in 8 file formats, as specified in the online table at https://github.com/nansencenter/nansat/blob/master/nansat/mappers/ README.yml.Nansat can be extended by inheritance or by adding new mappers, as described in the online manual [34].User support and feedback is realized via GitHub issues and through a mailing list nansat-dev@googlegroups.com.This example covers spatial collocation of two images acquired in May 2016 by the Synthetic Aperture Radar (SAR) instrument onboard the Sentinel-1A [35] satellite, and the Moderate Resolution Imaging Spectroradiometer (MODIS) [36] onboard the Aqua satellite.The two original level-1 datasets have different spatial reference, resolution and coverage, and are collocated and projected onto the same spatial grid, and then visualized using Nansat.The resulting images (Fig. 3) show MODIS  top-of-atmosphere radiances measured at 469, 555 and 858 nm displayed in RGB (Fig. 3A) and a grayscale image of the same area generated from SAR HH (horizontal transmit and horizontal receive) polarized Normalized Radar Cross Section (NRCS) (Fig. 3B) covering parts of the Norwegian Sea.In these images, we see ocean mesoscale eddies as manifested by both the ocean colour (Fig. 3A) and the sea surface roughness (Fig. 3B).   the ESA Ocean Colour Climate Change Initiative) via the OpenDAP protocol.When connection to the data stream is established, Nansat is used to define the transects and retrieve and plot the actual data (Fig. 4).The Matplotlib and Basemap code used for creating Fig. 4 is out-of-scope of the paper and is thus not provided.Institute, collocate the datasets, and create a nice looking map (Fig. 5).
and thus to facilitate independent development of each block.The class Domain describes geographical projection, span and resolution of any kind of gridded data.It contains an instance of the class VRT, which operates on VRT files.The class VRT contains an instance of a GDAL Dataset for the actual reading of data and metadata.VRT is a base class for Mapper classes, which add scientific meaning to various data read by GDAL.A Domain instance also contains an NSR object, which subclasses OGR SpatialReference for work with spatial reference systems.Nansat is the central class in the package and a subclass of Domain.Nansat contains Figure (convenient generation of images) and PointBrowser (interactive digitizer).Nansat is a base class for Mosaic, which performs mosaicking of several files.

•
__get_item__, or [] (square brackets): fetch data as a Numpy ndarray; • write_figure: use the class Figure for writing full resolution data to an RGB or indexed image with, e.g., land mask, automatic correction of brightness/contrast, and logarithmic scaling; • reproject: change the projection of the dataset,and consequently its span and resolution.This is a lazy operation, meaning that only the projection

Figure 1 :
Figure 1: Schematic flowchart of Nansat instantiation.Details of the process are explained in the text.

Figure 2 :
Figure 2: UML diagram of the Nansat package including the most important class methods.White boxes describe third party classes.

Figure
class performs graphical operations.It is used to create figures, append a legend, add land mask, and adjust brightness and colours, as well as saving figures in PNG, JPG or TIFF formats.An instance of Figure can be created in the Nansat write_figure method, or from a Numpy 2D array (or three 2D layers in case of an RGB image).The Figure class uses two Pillow (Python Imaging Library) Image objects, one for the image canvas and one for the legend.Figure has several convenient methods for optimizing the data presentation: • clim_from_histogram: estimates optimal minimum and maximum brightness limits based on the data distribution; • apply_logarithm: applies logarithmic transformation of the original data; • create_legend: applies one of the Matplotlib colormaps, and adds a legend with colorbar and caption; and • array_mod_function: modifies the given band array by a user provided function.

Example 1 :
Open, collocate, and write figure

Figure 3 :
Figure 3: Example of a simple application of Nansat to collocate and visualize synchronous observations acquired by MODIS and Sentinel1A on 25 May 2011.(A) An RGB image from a MODIS/Aqua acquisition over the Norwegian Sea, (B) a grayscale SAR image from Sentinel-1A, and (C) the data coverage.

Example 2 :
Analyse transects using data streamed via the OPeNDAP protocolIn this example Nansat is used to compare two datasets of surface Chlorophyll-A (Chl-A) and Sea Surface Temperature (SST) in the Barents Sea.As shown in the code below, Nansat accesses remote Chl-A and SST datasets streamed from the Norwegian Meteorological Institute and Plymouth Marine Laboratory (PML) (the latter dataset is provided as part of

Figure 4 :
Figure 4: Comparison (A) of Sea Surface Temperature (SST) and Chlorophyll-A (Chl-A) from transect (B) across the Barents Sea (the data is retrieved using Nansat, and the figure is generated using Matplotlib and Basemap).

Figure 5 :
Figure 5: Map of Sea Surface Temperature (SST), sea ice concentration, and sea ice drift streamed from the Norwegian Meteorological Institute, collocated over the Greenland, Barents, and Arctic Seas.The land mass in the centre of the image is Svalbard.