xarray is an open source project and Python package that provides a toolkit and data structures for N-dimensional labeled arrays. Our approach combines an application programing interface (API) inspired by pandas with the Common Data Model for self-described scientific data. Key features of the xarray package include label-based indexing and arithmetic, interoperability with the core scientific Python packages (e.g., pandas, NumPy, Matplotlib), out-of-core computation on datasets that don’t fit into memory, a wide range of serialization and input/output (I/O) options, and advanced multi-dimensional data manipulation tools such as group-by and resampling. xarray, as a data model and analytics toolkit, has been widely adopted in the geoscience community but is also used more broadly for multi-dimensional data analysis in physics, machine learning and finance.
Python has emerged as a leading programing language for both the physical sciences and data sciences. At the core of modern scientific computing and analysis in Python are the NumPy  and SciPy  packages, which provide a robust N-dimensional array object and the fundamental operations required for science and engineering applications. Much of the success of Python in data science and business analytics is due to pandas , which introduced intuitive and fast tabular data analysis tools to Python, inspired by R’s data.frame . The pandas DataFrame and Series objects provide unparalleled analysis tools for data alignment, resampling, grouping, pivoting, and aggregation in Python.
xarray implements data structures and an analytics toolkit for multi-dimensional labeled arrays strongly inspired by pandas. While pandas includes a data structure called the Panel for three dimensional data, its fixed rank design make it unsuitable for applications that require arbitrary rank arrays. Additionally, many of the features that make the pandas DataFrame and Series objects so useful, are not fully available on the Panel. Our approach with xarray adopts Unidata’s self-describing Common Data Model on which the network Common Data Form (netCDF) is built [20, 7]. NetCDF provides a well-defined data model for labeled N-dimensional array-oriented scientific data analysis.
xarray builds on top of, and seamlessly interoperates with, the core scientific Python packages, such as NumPy, SciPy, Matplotlib , and pandas. xarray provides a range of backends for serialization and input/output (IO), including the Pickle, netCDF, OPeNDAP (read-only), GRIB1/2 (read-only), and HDF file formats. Leveraging the dask parallel computing library , xarray can optionally perform efficient parallel, out-of-core analysis on datasets that are too large to fit into memory. Finally, xarray interfaces with existing domain-specific packages such as UV-CDAT , Iris , and Cartopy .
Scientific data is inherently labeled. For example, time series data includes timestamps that label individual periods or points in time, spatial data has coordinates (e.g. longitude, latitude, elevation), and model or laboratory experiments are often identified by unique identifiers. Figure 1 provides an example of a labeled dataset. In this case the data is a map of global air temperature from a numeric weather model. The labels on this particular dataset are time (e.g. “2016-05-01”), longitude (x-axis), and latitude (y-axis).
Unlabeled, N-dimensional arrays of numbers (e.g., NumPy’s ndarray) are the most widely used data structure in scientific computing. However, they lack a meaningful representation of the metadata associated with their data. Implementing such functionality is left to individual users and domain-specific packages. As a result, programmers frequently encounter pitfalls in the form of questions like “is the time axis of my array in the first or third index position?” or “does my array of timestamps still align with my data after resampling?”.
The core motivation for developing xarray was to provide labeled data tools for N-dimensional arrays that render such questions moot. Every operation in xarray both relies on and maintains the consistency of labels.
The network Common Data Form is a collection of self-describing, machine-independent binary data formats and software tools. These data formats and tools facilitate the creation, access, and sharing of scientific data stored in N-dimensional arrays, along with metadata describing the contents of each array . NetCDF has become very popular in the geoscience community, and there are existing libraries for reading and writing netCDF in many programming languages, including C, Fortran, Python, Java, Matlab, and Julia.
The principal data structure in the netCDF data model is the dataset. Each netCDF dataset contains dimensions, variables, and attributes, each of which are identified by a hierarchy of unique names. The dataset and variable objects may contain attributes that describe the contents, units, history, or other metadata of the object. Standardized conventions, such as the Climate and Forecast (CF) Conventions , allow for the associations of coordinate variables with dimensions.
NetCDF forms the basis of the xarray data model and provides a natural and portable serialization format. Building on netCDF, xarray features two main data structures: the DataArray and the Dataset. The API for these data structures is summarized in the following sections and in Figure 2.
The DataArray is xarray’s implementation of a labeled, multi-dimensional array. It has several key properties:
xarray uses dims and coords to enable its core metadata-aware operations. Dimensions provide names that xarray uses instead of the axis argument found in many NumPy functions. Coordinates are ancillary variables used to enable fast label based indexing and alignment, building on the functionality of the pandas Index. DataArray objects also can have a name and can hold arbitrary metadata in the form of their attrs property, which can be used to further describe data (e.g. by providing units). Names and attributes are strictly for users and user-written code; in general xarray makes no attempt to interpret them, and propagates them only in unambiguous cases. In contrast, xarray does interpret and persist coordinates in operations that transform xarray objects.
The Dataset is xarray’s multi-dimensional equivalent of a DataFrame. It is a dict-like container of labeled arrays (DataArrays) with aligned dimensions. It is designed as an in-memory representation of a netCDF dataset. In addition to the dict-like interface of the dataset itself, which can be used to access any DataArray in a Dataset, datasets have four key properties:
DataArray objects inside a Dataset may have any number of dimensions but are presumed to share a common coordinate system. Coordinates can also have any number of dimensions but denote constant/independent quantities, unlike the varying/dependent quantities that belong in data. Figure 3 illustrates these concepts for an example Dataset containing meteorological data.
xarray includes a powerful and growing feature set. The following list highlights some of the key features available in xarray. The xarray documentation  includes a complete description of available features and their usage.
xarray is provided with a large test suite comprised of over 1,500 unit tests. These tests cover the core xarray functionality as well as features facilitated by optional dependencies. The unit tests are executed automatically on the TravisCI (Linux)  and Appveyor (Windows)  continuous integration systems. A selection of sample data is also distributed with the source code, allowing users to reproduce any examples in the xarray documentation.
Linux, Windows and Mac OS X.
Python, versions 2.7, 3.4 and later.
xarray is implemented in pure Python and relies on compiled dependencies for speed.
Persistent identifier: https://doi.org/10.5281/zenodo.264282
License: Apache, v2.0
Version published: 0.9.1
Date published: January 30, 2017
License: Apache, v2.0
Date published: January 30, 2017
xarray was written in a modular, objected-oriented way, to build upon and extend the core scientific Python libraries in a domain-agnostic fashion. The xarray documentation is complete with a wide range of examples and a number of tutorials that use real-world datasets that are available in the xarray repository. We have intentionally avoided including domain-specific functionality in the library, leaving that to third party libraries. It has been widely adopted in the geoscience community [e.g. 6, 10, 9], but has also been used in physics [e.g. 3], time series analytics , and finance. The core xarray data structures (the DataArray and the Dataset) are extensible through subclassing or the preferred approach of composition. We also provide an extensible high-level accessor interface to allow users to implement domain specific methods on xarray data objects.
xarray is developed and supported by a team of volunteers. The primary avenue for user support is StackOverflow , with the “xarray-python” tag. Additionally, we use GitHub for a bug tracker (https://github.com/pydata/xarray/issues) and maintain the “xarray” mailing list on Google Groups (https://groups.google.com/forum/#!forum/xarray).
Initial development of xarray was supported by The Climate Corporation. We thank Matthew Rocklin and Jim Crist for their assistance integrating xarray with Dask, and Todd Small, Francisco Alvarez and Fabien Maussion for their feedback on early drafts of this manuscript.
The authors have no competing interests to declare.
Appveyor (). https://ci.appveyor.com. Accessed: 2015-06-12.
xarray documentation. http://xarray.pydata.org. Accessed: 2017-01-30.
pycalphad: Computational thermodynamics. http://pycalphad.readthedocs.io. Accessed: 2015-06-12.
Stack overflow (). http://stackoverflow.com/questions/tagged/python-xarray. Accessed: 2015-06-12.
Travis CI – test and deploy your code with confidence. https://travis-ci.org. Accessed: 2015-06-12.
xgcm: General circulation model postprocessing with xarray. http://xgcm.readthedocs.io. Accessed: 2015-06-12.
Brown, S A, Folk, M, Goucher, G, Rew, R and Dubois, P F (1993). Software for Portable Scientific Data Management. Computers in Physics 7(3): 304.DOI: https://doi.org/10.1063/1.4823180
Dee, D P, Uppala, S M, Simmons, A J, Berrisford, P, Poli, P, Kobayashi, S, Andrae, U, Balmaseda, M A, Balsamo, G, Bauer, P, Bechtold, P, Beljaars, A C M, van de Berg, L, Bidlot, J, Bormann, N, Delsol, C, Dragani, R, Fuentes, M, Geer, A J, Haimberger, L, Healy, S B, Hersbach, H, Hlm, E V, Isaksen, L, Kllberg, P, Khler, M, Matricardi, M, McNally, A P, Monge-Sanz, B M, Morcrette, J J, Park, B K, Peubey, C, de Rosnay, P, Tavolato, C, Thpaut, J N and Vitart, F (2011). The era-interim reanalysis: configuration and performance of the data assimilation system. Quarterly Journal of the Royal Meteorological Society 137(656): 553–597, DOI: https://doi.org/10.1002/qj.828
Hunter, J D (2007). Matplotlib: A 2D Graphics Environment. Comput. Sci. Eng 9(3): 90–95, DOI: https://doi.org/10.1109/MCSE.2007.55
Rew, R and Davis, G (1990). NetCDF: an interface for scientific data access. IEEE Comput. Grap. Appl Jul 199010(4): 76–82, DOI: https://doi.org/10.1109/38.56302
van der Walt, S, Colbert, S C and Varoquaux, G (2011). The NumPy Array: A Structure for Efficient Numerical Computation. Comput. Sci. Eng Mar 201113(2): 22–30, DOI: https://doi.org/10.1109/MCSE.2011.37
van Rossum, G, Lehtosalo, J and Langa, L (2016). PEP 484 – type hints. https://www.python.org/dev/peps/pep-0484/. Accessed: 01–24.
Wickham, H (2011). The Split-Apply-Combine Strategy for Data Analysis. Journal of Statistical Software 40(1)DOI: https://doi.org/10.18637/jss.v040.i01