Internal Design#
This page gives an overview of the internal design of xarray.
In totality, the Xarray project defines 4 key data structures. In order of increasing complexity, they are:
The user guide lists only xarray.DataArray
and xarray.Dataset
,
but Variable
is the fundamental object internally,
and DataTree
is a natural generalisation of xarray.Dataset
.
Note
Our Development roadmap includes plans both to document Variable
as fully public API,
and to merge the xarray-datatree package into xarray’s main repository.
Internally private lazy indexing classes are used to avoid loading more data than necessary,
and flexible indexes classes (derived from Index
) provide performant label-based lookups.
Data Structures#
The Data Structures page in the user guide explains the basics and concentrates on user-facing behavior, whereas this section explains how xarray’s data structure classes actually work internally.
Variable Objects#
The core internal data structure in xarray is the Variable
,
which is used as the basic building block behind xarray’s
Dataset
, DataArray
types. A
Variable
consists of:
dims
: A tuple of dimension names.data
: The N-dimensional array (typically a NumPy or Dask array) storing the Variable’s data. It must have the same number of dimensions as the length ofdims
.attrs
: A dictionary of metadata associated with this array. By convention, xarray’s built-in operations never use this metadata.encoding
: Another dictionary used to store information about how these variable’s data is represented on disk. See Reading encoded data for more details.
Variable
has an interface similar to NumPy arrays, but extended to make use
of named dimensions. For example, it uses dim
in preference to an axis
argument for methods like mean
, and supports Broadcasting by dimension name.
However, unlike Dataset
and DataArray
, the basic Variable
does not
include coordinate labels along each axis.
Variable
is public API, but because of its incomplete support for labeled
data, it is mostly intended for advanced uses, such as in xarray itself, for
writing new backends, or when creating custom indexes.
You can access the variable objects that correspond to xarray objects via the (readonly)
Dataset.variables
and
DataArray.variable
attributes.
DataArray Objects#
The simplest data structure used by most users is DataArray
.
A DataArray
is a composite object consisting of multiple
Variable
objects which store related data.
A single Variable
is referred to as the “data variable”, and stored under the variable`
attribute.
A DataArray
inherits all of the properties of this data variable, i.e. dims
, data
, attrs
and encoding
,
all of which are implemented by forwarding on to the underlying Variable
object.
In addition, a DataArray
stores additional Variable
objects stored in a dict under the private _coords
attribute,
each of which is referred to as a “Coordinate Variable”. These coordinate variable objects are only allowed to have dims
that are a subset of the data variable’s dims
,
and each dim has a specific length. This means that the full size
of the dataarray can be represented by a dictionary mapping dimension names to integer sizes.
The underlying data variable has this exact same size, and the attached coordinate variables have sizes which are some subset of the size of the data variable.
Another way of saying this is that all coordinate variables must be “alignable” with the data variable.
When a coordinate is accessed by the user (e.g. via the dict-like __getitem__
syntax),
then a new DataArray
is constructed by finding all coordinate variables that have compatible dimensions and re-attaching them before the result is returned.
This is why most users never see the Variable
class underlying each coordinate variable - it is always promoted to a DataArray
before returning.
Lookups are performed by special Index
objects, which are stored in a dict under the private _indexes
attribute.
Indexes must be associated with one or more coordinates, and essentially act by translating a query given in physical coordinate space
(typically via the sel()
method) into a set of integer indices in array index space that can be used to index the underlying n-dimensional array-like data
.
Indexing in array index space (typically performed via the isel()
method) does not require consulting an Index
object.
Finally a DataArray
defines a name
attribute, which refers to its data
variable but is stored on the wrapping DataArray
class.
The name
attribute is primarily used when one or more DataArray
objects are promoted into a Dataset
(e.g. via to_dataset()
).
Note that the underlying Variable
objects are all unnamed, so they can always be referred to uniquely via a
dict-like mapping.
Dataset Objects#
The Dataset
class is a generalization of the DataArray
class that can hold multiple data variables.
Internally all data variables and coordinate variables are stored under a single variables
dict, and coordinates are
specified by storing their names in a private _coord_names
dict.
The dataset’s dims
are the set of all dims present across any variable, but (similar to in dataarrays) coordinate
variables cannot have a dimension that is not present on any data variable.
When a data variable or coordinate variable is accessed, a new DataArray
is again constructed from all compatible
coordinates before returning.
Note
The way that selecting a variable from a DataArray
or Dataset
actually involves internally wrapping the
Variable
object back up into a DataArray
/Dataset
is the primary reason we recommend against subclassing
Xarray objects. The main problem it creates is that we currently cannot easily guarantee that for example selecting
a coordinate variable from your SubclassedDataArray
would return an instance of SubclassedDataArray
instead
of just an xarray.DataArray
. See GH issue for more details.
Lazy Indexing Classes#
Lazy Loading#
If we open a Variable
object from disk using open_dataset()
we can see that the actual values of
the array wrapped by the data variable are not displayed.
In [1]: da = xr.tutorial.open_dataset("air_temperature")["air"]
In [2]: var = da.variable
In [3]: var
Out[3]:
<xarray.Variable (time: 2920, lat: 25, lon: 53)> Size: 31MB
[3869000 values with dtype=float64]
Attributes:
long_name: 4xDaily Air temperature at sigma level 995
units: degK
precision: 2
GRIB_id: 11
GRIB_name: TMP
var_desc: Air temperature
dataset: NMC Reanalysis
level_desc: Surface
statistic: Individual Obs
parent_stat: Other
actual_range: [185.16 322.1 ]
We can see the size, and the dtype of the underlying array, but not the actual values. This is because the values have not yet been loaded.
If we look at the private attribute _data()
containing the underlying array object, we see
something interesting:
In [4]: var._data
Out[4]: MemoryCachedArray(array=CopyOnWriteArray(array=LazilyIndexedArray(array=_ElementwiseFunctionArray(LazilyIndexedArray(array=<xarray.backends.netCDF4_.NetCDF4ArrayWrapper object at 0x7f0796dd6c00>, key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None)))), func=functools.partial(<function _scale_offset_decoding at 0x7f07bbeea700>, scale_factor=np.float64(0.01), add_offset=None, dtype=<class 'numpy.float64'>), dtype=dtype('float64')), key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None))))))
You’re looking at one of xarray’s internal Lazy Indexing Classes. These powerful classes are hidden from the user, but provide important functionality.
Calling the public data
property loads the underlying array into memory.
In [5]: var.data
Out[5]:
array([[[241.2 , 242.5 , 243.5 , ..., 232.8 , 235.5 , 238.6 ],
[243.8 , 244.5 , 244.7 , ..., 232.8 , 235.3 , 239.3 ],
[250. , 249.8 , 248.89, ..., 233.2 , 236.39, 241.7 ],
...,
[296.6 , 296.2 , 296.4 , ..., 295.4 , 295.1 , 294.7 ],
[295.9 , 296.2 , 296.79, ..., 295.9 , 295.9 , 295.2 ],
[296.29, 296.79, 297.1 , ..., 296.9 , 296.79, 296.6 ]],
[[242.1 , 242.7 , 243.1 , ..., 232. , 233.6 , 235.8 ],
[243.6 , 244.1 , 244.2 , ..., 231. , 232.5 , 235.7 ],
[253.2 , 252.89, 252.1 , ..., 230.8 , 233.39, 238.5 ],
...,
[296.4 , 295.9 , 296.2 , ..., 295.4 , 295.1 , 294.79],
[296.2 , 296.7 , 296.79, ..., 295.6 , 295.5 , 295.1 ],
[296.29, 297.2 , 297.4 , ..., 296.4 , 296.4 , 296.6 ]],
[[242.3 , 242.2 , 242.3 , ..., 234.3 , 236.1 , 238.7 ],
[244.6 , 244.39, 244. , ..., 230.3 , 232. , 235.7 ],
[256.2 , 255.5 , 254.2 , ..., 231.2 , 233.2 , 238.2 ],
...,
[295.6 , 295.4 , 295.4 , ..., 296.29, 295.29, 295. ],
[296.2 , 296.5 , 296.29, ..., 296.4 , 296. , 295.6 ],
[296.4 , 296.29, 296.4 , ..., 297. , 297. , 296.79]],
...,
[[243.49, 242.99, 242.09, ..., 244.19, 244.49, 244.89],
[249.09, 248.99, 248.59, ..., 240.59, 241.29, 242.69],
[262.69, 262.19, 261.69, ..., 239.39, 241.69, 245.19],
...,
[294.79, 295.29, 297.49, ..., 295.49, 295.39, 294.69],
[296.79, 297.89, 298.29, ..., 295.49, 295.49, 294.79],
[298.19, 299.19, 298.79, ..., 296.09, 295.79, 295.79]],
[[245.79, 244.79, 243.49, ..., 243.29, 243.99, 244.79],
[249.89, 249.29, 248.49, ..., 241.29, 242.49, 244.29],
[262.39, 261.79, 261.29, ..., 240.49, 243.09, 246.89],
...,
[293.69, 293.89, 295.39, ..., 295.09, 294.69, 294.29],
[296.29, 297.19, 297.59, ..., 295.29, 295.09, 294.39],
[297.79, 298.39, 298.49, ..., 295.69, 295.49, 295.19]],
[[245.09, 244.29, 243.29, ..., 241.69, 241.49, 241.79],
[249.89, 249.29, 248.39, ..., 239.59, 240.29, 241.69],
[262.99, 262.19, 261.39, ..., 239.89, 242.59, 246.29],
...,
[293.79, 293.69, 295.09, ..., 295.29, 295.09, 294.69],
[296.09, 296.89, 297.19, ..., 295.69, 295.69, 295.19],
[297.69, 298.09, 298.09, ..., 296.49, 296.19, 295.69]]])
This array is now cached, which we can see by accessing the private attribute again:
In [6]: var._data
Out[6]:
MemoryCachedArray(array=NumpyIndexingAdapter(array=array([[[241.2 , 242.5 , 243.5 , ..., 232.8 , 235.5 , 238.6 ],
[243.8 , 244.5 , 244.7 , ..., 232.8 , 235.3 , 239.3 ],
[250. , 249.8 , 248.89, ..., 233.2 , 236.39, 241.7 ],
...,
[296.6 , 296.2 , 296.4 , ..., 295.4 , 295.1 , 294.7 ],
[295.9 , 296.2 , 296.79, ..., 295.9 , 295.9 , 295.2 ],
[296.29, 296.79, 297.1 , ..., 296.9 , 296.79, 296.6 ]],
[[242.1 , 242.7 , 243.1 , ..., 232. , 233.6 , 235.8 ],
[243.6 , 244.1 , 244.2 , ..., 231. , 232.5 , 235.7 ],
[253.2 , 252.89, 252.1 , ..., 230.8 , 233.39, 238.5 ],
...,
[296.4 , 295.9 , 296.2 , ..., 295.4 , 295.1 , 294.79],
[296.2 , 296.7 , 296.79, ..., 295.6 , 295.5 , 295.1 ],
[296.29, 297.2 , 297.4 , ..., 296.4 , 296.4 , 296.6 ]],
[[242.3 , 242.2 , 242.3 , ..., 234.3 , 236.1 , 238.7 ],
[244.6 , 244.39, 244. , ..., 230.3 , 232. , 235.7 ],
[256.2 , 255.5 , 254.2 , ..., 231.2 , 233.2 , 238.2 ],
...,
[295.6 , 295.4 , 295.4 , ..., 296.29, 295.29, 295. ],
[296.2 , 296.5 , 296.29, ..., 296.4 , 296. , 295.6 ],
[296.4 , 296.29, 296.4 , ..., 297. , 297. , 296.79]],
...,
[[243.49, 242.99, 242.09, ..., 244.19, 244.49, 244.89],
[249.09, 248.99, 248.59, ..., 240.59, 241.29, 242.69],
[262.69, 262.19, 261.69, ..., 239.39, 241.69, 245.19],
...,
[294.79, 295.29, 297.49, ..., 295.49, 295.39, 294.69],
[296.79, 297.89, 298.29, ..., 295.49, 295.49, 294.79],
[298.19, 299.19, 298.79, ..., 296.09, 295.79, 295.79]],
[[245.79, 244.79, 243.49, ..., 243.29, 243.99, 244.79],
[249.89, 249.29, 248.49, ..., 241.29, 242.49, 244.29],
[262.39, 261.79, 261.29, ..., 240.49, 243.09, 246.89],
...,
[293.69, 293.89, 295.39, ..., 295.09, 294.69, 294.29],
[296.29, 297.19, 297.59, ..., 295.29, 295.09, 294.39],
[297.79, 298.39, 298.49, ..., 295.69, 295.49, 295.19]],
[[245.09, 244.29, 243.29, ..., 241.69, 241.49, 241.79],
[249.89, 249.29, 248.39, ..., 239.59, 240.29, 241.69],
[262.99, 262.19, 261.39, ..., 239.89, 242.59, 246.29],
...,
[293.79, 293.69, 295.09, ..., 295.29, 295.09, 294.69],
[296.09, 296.89, 297.19, ..., 295.69, 295.69, 295.19],
[297.69, 298.09, 298.09, ..., 296.49, 296.19, 295.69]]])))
Lazy Indexing#
The purpose of these lazy indexing classes is to prevent more data being loaded into memory than is necessary for the subsequent analysis, by deferring loading data until after indexing is performed.
Let’s open the data from disk again.
In [7]: da = xr.tutorial.open_dataset("air_temperature")["air"]
In [8]: var = da.variable
Now, notice how even after subsetting the data has does not get loaded:
In [9]: var.isel(time=0)
Out[9]:
<xarray.Variable (lat: 25, lon: 53)> Size: 11kB
[1325 values with dtype=float64]
Attributes:
long_name: 4xDaily Air temperature at sigma level 995
units: degK
precision: 2
GRIB_id: 11
GRIB_name: TMP
var_desc: Air temperature
dataset: NMC Reanalysis
level_desc: Surface
statistic: Individual Obs
parent_stat: Other
actual_range: [185.16 322.1 ]
The shape has changed, but the values are still not shown.
Looking at the private attribute again shows how this indexing information was propagated via the hidden lazy indexing classes:
In [10]: var.isel(time=0)._data
Out[10]: MemoryCachedArray(array=CopyOnWriteArray(array=LazilyIndexedArray(array=_ElementwiseFunctionArray(LazilyIndexedArray(array=<xarray.backends.netCDF4_.NetCDF4ArrayWrapper object at 0x7f0796bc8f40>, key=BasicIndexer((slice(None, None, None), slice(None, None, None), slice(None, None, None)))), func=functools.partial(<function _scale_offset_decoding at 0x7f07bbeea700>, scale_factor=np.float64(0.01), add_offset=None, dtype=<class 'numpy.float64'>), dtype=dtype('float64')), key=BasicIndexer((0, slice(None, None, None), slice(None, None, None))))))
Note
Currently only certain indexing operations are lazy, not all array operations. For discussion of making all array operations lazy see GH issue #5081.
Lazy Dask Arrays#
Note that xarray’s implementation of Lazy Indexing classes is completely separate from how dask.array.Array
objects evaluate lazily. Dask-backed xarray objects delay almost all operations until compute()
is called (either explicitly or implicitly via plot()
for example). The exceptions to this
laziness are operations whose output shape is data-dependent, such as when calling where()
.