🍾 Xarray is now 10 years old! 🎉

Quick overview#

Here are some quick examples of what you can do with xarray.DataArray objects. Everything is explained in much more detail in the rest of the documentation.

To begin, import numpy, pandas and xarray using their customary abbreviations:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: import xarray as xr

Create a DataArray#

You can make a DataArray from scratch by supplying data in the form of a numpy array or list, with optional dimensions and coordinates:

In [4]: data = xr.DataArray(np.random.randn(2, 3), dims=("x", "y"), coords={"x": [10, 20]})

In [5]: data
Out[5]: 
<xarray.DataArray (x: 2, y: 3)> Size: 48B
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
       [-1.13563237,  1.21211203, -0.17321465]])
Coordinates:
  * x        (x) int64 16B 10 20
Dimensions without coordinates: y

In this case, we have generated a 2D array, assigned the names x and y to the two dimensions respectively and associated two coordinate labels ‘10’ and ‘20’ with the two locations along the x dimension. If you supply a pandas Series or DataFrame, metadata is copied directly:

In [6]: xr.DataArray(pd.Series(range(3), index=list("abc"), name="foo"))
Out[6]: 
<xarray.DataArray 'foo' (dim_0: 3)> Size: 24B
array([0, 1, 2])
Coordinates:
  * dim_0    (dim_0) object 24B 'a' 'b' 'c'

Here are the key properties for a DataArray:

# like in pandas, values is a numpy array that you can modify in-place
In [7]: data.values
Out[7]: 
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
       [-1.13563237,  1.21211203, -0.17321465]])

In [8]: data.dims
Out[8]: ('x', 'y')

In [9]: data.coords
Out[9]: 
Coordinates:
  * x        (x) int64 16B 10 20

# you can use this dictionary to store arbitrary metadata
In [10]: data.attrs
Out[10]: {}

Indexing#

Xarray supports four kinds of indexing. Since we have assigned coordinate labels to the x dimension we can use label-based indexing along that dimension just like pandas. The four examples below all yield the same result (the value at x=10) but at varying levels of convenience and intuitiveness.

# positional and by integer label, like numpy
In [11]: data[0, :]
Out[11]: 
<xarray.DataArray (y: 3)> Size: 24B
array([ 0.4691123 , -0.28286334, -1.5090585 ])
Coordinates:
    x        int64 8B 10
Dimensions without coordinates: y

# loc or "location": positional and coordinate label, like pandas
In [12]: data.loc[10]
Out[12]: 
<xarray.DataArray (y: 3)> Size: 24B
array([ 0.4691123 , -0.28286334, -1.5090585 ])
Coordinates:
    x        int64 8B 10
Dimensions without coordinates: y

# isel or "integer select":  by dimension name and integer label
In [13]: data.isel(x=0)
Out[13]: 
<xarray.DataArray (y: 3)> Size: 24B
array([ 0.4691123 , -0.28286334, -1.5090585 ])
Coordinates:
    x        int64 8B 10
Dimensions without coordinates: y

# sel or "select": by dimension name and coordinate label
In [14]: data.sel(x=10)
Out[14]: 
<xarray.DataArray (y: 3)> Size: 24B
array([ 0.4691123 , -0.28286334, -1.5090585 ])
Coordinates:
    x        int64 8B 10
Dimensions without coordinates: y

Unlike positional indexing, label-based indexing frees us from having to know how our array is organized. All we need to know are the dimension name and the label we wish to index i.e. data.sel(x=10) works regardless of whether x is the first or second dimension of the array and regardless of whether 10 is the first or second element of x. We have already told xarray that x is the first dimension when we created data: xarray keeps track of this so we don’t have to. For more, see Indexing and selecting data.

Attributes#

While you’re setting up your DataArray, it’s often a good idea to set metadata attributes. A useful choice is to set data.attrs['long_name'] and data.attrs['units'] since xarray will use these, if present, to automatically label your plots. These special names were chosen following the NetCDF Climate and Forecast (CF) Metadata Conventions. attrs is just a Python dictionary, so you can assign anything you wish.

In [15]: data.attrs["long_name"] = "random velocity"

In [16]: data.attrs["units"] = "metres/sec"

In [17]: data.attrs["description"] = "A random variable created as an example."

In [18]: data.attrs["random_attribute"] = 123

In [19]: data.attrs
Out[19]: 
{'long_name': 'random velocity',
 'units': 'metres/sec',
 'description': 'A random variable created as an example.',
 'random_attribute': 123}

# you can add metadata to coordinates too
In [20]: data.x.attrs["units"] = "x units"

Computation#

Data arrays work very similarly to numpy ndarrays:

In [21]: data + 10
Out[21]: 
<xarray.DataArray (x: 2, y: 3)> Size: 48B
array([[10.4691123 ,  9.71713666,  8.4909415 ],
       [ 8.86436763, 11.21211203,  9.82678535]])
Coordinates:
  * x        (x) int64 16B 10 20
Dimensions without coordinates: y

In [22]: np.sin(data)
Out[22]: 
<xarray.DataArray (x: 2, y: 3)> Size: 48B
array([[ 0.45209466, -0.27910634, -0.99809483],
       [-0.90680094,  0.9363595 , -0.17234978]])
Coordinates:
  * x        (x) int64 16B 10 20
Dimensions without coordinates: y
Attributes:
    long_name:         random velocity
    units:             metres/sec
    description:       A random variable created as an example.
    random_attribute:  123

# transpose
In [23]: data.T
Out[23]: 
<xarray.DataArray (y: 3, x: 2)> Size: 48B
array([[ 0.4691123 , -1.13563237],
       [-0.28286334,  1.21211203],
       [-1.5090585 , -0.17321465]])
Coordinates:
  * x        (x) int64 16B 10 20
Dimensions without coordinates: y
Attributes:
    long_name:         random velocity
    units:             metres/sec
    description:       A random variable created as an example.
    random_attribute:  123

In [24]: data.sum()
Out[24]: 
<xarray.DataArray ()> Size: 8B
array(-1.41954454)

However, aggregation operations can use dimension names instead of axis numbers:

In [25]: data.mean(dim="x")
Out[25]: 
<xarray.DataArray (y: 3)> Size: 24B
array([-0.33326004,  0.46462434, -0.84113658])
Dimensions without coordinates: y

Arithmetic operations broadcast based on dimension name. This means you don’t need to insert dummy dimensions for alignment:

In [26]: a = xr.DataArray(np.random.randn(3), [data.coords["y"]])

In [27]: b = xr.DataArray(np.random.randn(4), dims="z")

In [28]: a
Out[28]: 
<xarray.DataArray (y: 3)> Size: 24B
array([ 0.11920871, -1.04423597, -0.86184896])
Coordinates:
  * y        (y) int64 24B 0 1 2

In [29]: b
Out[29]: 
<xarray.DataArray (z: 4)> Size: 32B
array([-2.10456922, -0.49492927,  1.07180381,  0.72155516])
Dimensions without coordinates: z

In [30]: a + b
Out[30]: 
<xarray.DataArray (y: 3, z: 4)> Size: 96B
array([[-1.98536051, -0.37572056,  1.19101252,  0.84076387],
       [-3.14880519, -1.53916524,  0.02756784, -0.3226808 ],
       [-2.96641818, -1.35677824,  0.20995484, -0.1402938 ]])
Coordinates:
  * y        (y) int64 24B 0 1 2
Dimensions without coordinates: z

It also means that in most cases you do not need to worry about the order of dimensions:

In [31]: data - data.T
Out[31]: 
<xarray.DataArray (x: 2, y: 3)> Size: 48B
array([[0., 0., 0.],
       [0., 0., 0.]])
Coordinates:
  * x        (x) int64 16B 10 20
Dimensions without coordinates: y

Operations also align based on index labels:

In [32]: data[:-1] - data[:1]
Out[32]: 
<xarray.DataArray (x: 1, y: 3)> Size: 24B
array([[0., 0., 0.]])
Coordinates:
  * x        (x) int64 8B 10
Dimensions without coordinates: y

For more, see Computation.

GroupBy#

Xarray supports grouped operations using a very similar API to pandas (see GroupBy: Group and Bin Data):

In [33]: labels = xr.DataArray(["E", "F", "E"], [data.coords["y"]], name="labels")

In [34]: labels
Out[34]: 
<xarray.DataArray 'labels' (y: 3)> Size: 12B
array(['E', 'F', 'E'], dtype='<U1')
Coordinates:
  * y        (y) int64 24B 0 1 2

In [35]: data.groupby(labels).mean("y")
Out[35]: 
<xarray.DataArray (x: 2, labels: 2)> Size: 32B
array([[-0.5199731 , -0.28286334],
       [-0.65442351,  1.21211203]])
Coordinates:
  * x        (x) int64 16B 10 20
  * labels   (labels) object 16B 'E' 'F'
Attributes:
    long_name:         random velocity
    units:             metres/sec
    description:       A random variable created as an example.
    random_attribute:  123

In [36]: data.groupby(labels).map(lambda x: x - x.min())
Out[36]: 
<xarray.DataArray (x: 2, y: 3)> Size: 48B
array([[1.9781708 , 0.        , 0.        ],
       [0.37342613, 1.49497537, 1.33584385]])
Coordinates:
  * x        (x) int64 16B 10 20
Dimensions without coordinates: y

Plotting#

Visualizing your datasets is quick and convenient:

In [37]: data.plot()
Out[37]: <matplotlib.collections.QuadMesh at 0x7f0809cef5b0>
../_images/plotting_quick_overview.png

Note the automatic labeling with names and units. Our effort in adding metadata attributes has paid off! Many aspects of these figures are customizable: see Plotting.

pandas#

Xarray objects can be easily converted to and from pandas objects using the to_series(), to_dataframe() and to_xarray() methods:

In [38]: series = data.to_series()

In [39]: series
Out[39]: 
x   y
10  0    0.469112
    1   -0.282863
    2   -1.509059
20  0   -1.135632
    1    1.212112
    2   -0.173215
dtype: float64

# convert back
In [40]: series.to_xarray()
Out[40]: 
<xarray.DataArray (x: 2, y: 3)> Size: 48B
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
       [-1.13563237,  1.21211203, -0.17321465]])
Coordinates:
  * x        (x) int64 16B 10 20
  * y        (y) int64 24B 0 1 2

Datasets#

xarray.Dataset is a dict-like container of aligned DataArray objects. You can think of it as a multi-dimensional generalization of the pandas.DataFrame:

In [41]: ds = xr.Dataset(dict(foo=data, bar=("x", [1, 2]), baz=np.pi))

In [42]: ds
Out[42]: 
<xarray.Dataset> Size: 88B
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int64 16B 10 20
Dimensions without coordinates: y
Data variables:
    foo      (x, y) float64 48B 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
    bar      (x) int64 16B 1 2
    baz      float64 8B 3.142

This creates a dataset with three DataArrays named foo, bar and baz. Use dictionary or dot indexing to pull out Dataset variables as DataArray objects but note that assignment only works with dictionary indexing:

In [43]: ds["foo"]
Out[43]: 
<xarray.DataArray 'foo' (x: 2, y: 3)> Size: 48B
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
       [-1.13563237,  1.21211203, -0.17321465]])
Coordinates:
  * x        (x) int64 16B 10 20
Dimensions without coordinates: y
Attributes:
    long_name:         random velocity
    units:             metres/sec
    description:       A random variable created as an example.
    random_attribute:  123

In [44]: ds.foo
Out[44]: 
<xarray.DataArray 'foo' (x: 2, y: 3)> Size: 48B
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
       [-1.13563237,  1.21211203, -0.17321465]])
Coordinates:
  * x        (x) int64 16B 10 20
Dimensions without coordinates: y
Attributes:
    long_name:         random velocity
    units:             metres/sec
    description:       A random variable created as an example.
    random_attribute:  123

When creating ds, we specified that foo is identical to data created earlier, bar is one-dimensional with single dimension x and associated values ‘1’ and ‘2’, and baz is a scalar not associated with any dimension in ds. Variables in datasets can have different dtype and even different dimensions, but all dimensions are assumed to refer to points in the same shared coordinate system i.e. if two variables have dimension x, that dimension must be identical in both variables.

For example, when creating ds xarray automatically aligns bar with DataArray foo, i.e., they share the same coordinate system so that ds.bar['x'] == ds.foo['x'] == ds['x']. Consequently, the following works without explicitly specifying the coordinate x when creating ds['bar']:

In [45]: ds.bar.sel(x=10)
Out[45]: 
<xarray.DataArray 'bar' ()> Size: 8B
array(1)
Coordinates:
    x        int64 8B 10

You can do almost everything you can do with DataArray objects with Dataset objects (including indexing and arithmetic) if you prefer to work with multiple variables at once.

Read & write netCDF files#

NetCDF is the recommended file format for xarray objects. Users from the geosciences will recognize that the Dataset data model looks very similar to a netCDF file (which, in fact, inspired it).

You can directly read and write xarray objects to disk using to_netcdf(), open_dataset() and open_dataarray():

In [46]: ds.to_netcdf("example.nc")

In [47]: reopened = xr.open_dataset("example.nc")

In [48]: reopened
Out[48]: 
<xarray.Dataset> Size: 88B
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int64 16B 10 20
Dimensions without coordinates: y
Data variables:
    foo      (x, y) float64 48B ...
    bar      (x) int64 16B ...
    baz      float64 8B ...

It is common for datasets to be distributed across multiple files (commonly one file per timestep). Xarray supports this use-case by providing the open_mfdataset() and the save_mfdataset() methods. For more, see Reading and writing files.