
N-D labeled arrays and datasets in Python¶
xarray (formerly xray) is an open source project and Python package that aims to bring the labeled data power of pandas to the physical sciences, by providing N-dimensional variants of the core pandas data structures.
Our goal is to provide a pandas-like and pandas-compatible toolkit for
analytics on multi-dimensional arrays, rather than the tabular data for which
pandas excels. Our approach adopts the Common Data Model for self-
describing scientific data in widespread use in the Earth sciences:
xarray.Dataset
is an in-memory representation of a netCDF file.
Note
xray is now xarray! See the v0.7.0 release notes for more details. The preferred URL for these docs is now http://xarray.pydata.org.
Documentation¶
What’s New¶
v0.8.2 (18 August 2016)¶
This release includes a number of bug fixes and minor enhancements.
Breaking changes¶
broadcast()
andconcat()
now auto-align inputs, usingjoin=outer
. Previously, these functions raisedValueError
for non-aligned inputs. By Guido Imperiale.
Enhancements¶
- New documentation on Transitioning from pandas.Panel to xarray. By Maximilian Roos.
- New
Dataset
andDataArray
methodsto_dict()
andfrom_dict()
to allow easy conversion between dictionaries and xarray objects (GH432). See dictionary IO for more details. By Julia Signell. - Added
exclude
andindexes
optional parameters toalign()
, andexclude
optional parameter tobroadcast()
. By Guido Imperiale. - Better error message when assigning variables without dimensions (GH971). By Stephan Hoyer.
- Better error message when reindex/align fails due to duplicate index values (GH956). By Stephan Hoyer.
Bug fixes¶
- Ensure xarray works with h5netcdf v0.3.0 for arrays with
dtype=str
(GH953). By Stephan Hoyer. Dataset.__dir__()
(i.e. the method python calls to get autocomplete options) failed if one of the dataset’s keys was not a string (GH852). By Maximilian Roos.Dataset
constructor can now take arbitrary objects as values (GH647). By Maximilian Roos.- Clarified
copy
argument forreindex()
andalign()
, which now consistently always return new xarray objects (GH927). - Fix
open_mfdataset
withengine='pynio'
(GH936). By Stephan Hoyer. groupby_bins
sorted bin labels as strings (GH952). By Stephan Hoyer.- Fix bug introduced by v0.8.0 that broke assignment to datasets when both the left and right side have the same non-unique index values (GH956).
v0.8.1 (5 August 2016)¶
Bug fixes¶
- Fix bug in v0.8.0 that broke assignment to Datasets with non-unique indexes (GH943). By Stephan Hoyer.
v0.8.0 (2 August 2016)¶
This release includes four months of new features and bug fixes, including several breaking changes.
Breaking changes¶
- Dropped support for Python 2.6 (GH855).
- Indexing on multi-index now drop levels, which is consistent with pandas. It also changes the name of the dimension / coordinate when the multi-index is reduced to a single index (GH802).
- Contour plots no longer add a colorbar per default (GH866). Filled contour plots are unchanged.
DataArray.values
and.data
now always returns an NumPy array-like object, even for 0-dimensional arrays with object dtype (GH867). Previously,.values
returned native Python objects in such cases. To convert the values of scalar arrays to Python objects, use the.item()
method.
Enhancements¶
- Groupby operations now support grouping over multidimensional variables. A new
method called
groupby_bins()
has also been added to allow users to specify bins for grouping. The new features are described in Multidimensional Grouping and Working with Multidimensional Coordinates. By Ryan Abernathey. - DataArray and Dataset method
where()
now supports adrop=True
option that clips coordinate elements that are fully masked. By Phillip J. Wolfram. - New top level
merge()
function allows for combining variables from any number ofDataset
and/orDataArray
variables. See Merge for more details. By Stephan Hoyer. - DataArray and Dataset method
resample()
now supports thekeep_attrs=False
option that determines whether variable and dataset attributes are retained in the resampled object. By Jeremy McGibbon. - Better multi-index support in DataArray and Dataset
sel()
andloc()
methods, which now behave more closely to pandas and which also accept dictionaries for indexing based on given level names and labels (see Multi-level indexing). By Benoit Bovy. - New (experimental) decorators
register_dataset_accessor()
andregister_dataarray_accessor()
for registering custom xarray extensions without subclassing. They are described in the new documentation page on xarray Internals. By Stephan Hoyer. - Round trip boolean datatypes. Previously, writing boolean datatypes to netCDF formats would raise an error since netCDF does not have a bool datatype. This feature reads/writes a dtype attribute to boolean variables in netCDF files. By Joe Hamman.
- 2D plotting methods now have two new keywords (cbar_ax and cbar_kwargs), allowing more control on the colorbar (GH872). By Fabien Maussion.
- New Dataset method
filter_by_attrs()
, akin tonetCDF4.Dataset.get_variables_by_attributes
, to easily filter data variables using its attributes. Filipe Fernandes.
Bug fixes¶
Attributes were being retained by default for some resampling operations when they should not. With the
keep_attrs=False
option, they will no longer be retained by default. This may be backwards-incompatible with some scripts, but the attributes may be kept by adding thekeep_attrs=True
option. By Jeremy McGibbon.Concatenating xarray objects along an axis with a MultiIndex or PeriodIndex preserves the nature of the index (GH875). By Stephan Hoyer.
Fixed bug in arithmetic operations on DataArray objects whose dimensions are numpy structured arrays or recarrays GH861, GH837. By Maciek Swat.
decode_cf_timedelta
now accepts arrays withndim
>1 (GH842).This fixes issue GH665. Filipe Fernandes.
Fix a bug where xarray.ufuncs that take two arguments would incorrectly use to numpy functions instead of dask.array functions (GH876). By Stephan Hoyer.
Support for pickling functions from
xarray.ufuncs
(GH901). By Stephan Hoyer.Variable.copy(deep=True)
no longer converts MultiIndex into a base Index (GH769). By Benoit Bovy.Fixes for groupby on dimensions with a multi-index (GH867). By Stephan Hoyer.
Fix printing datasets with unicode attributes on Python 2 (GH892). By Stephan Hoyer.
Fixed incorrect test for dask version (GH891). By Stephan Hoyer.
Fixed dim argument for isel_points/sel_points when a pandas.Index is passed. By Stephan Hoyer.
contour()
now plots the correct number of contours (GH866). By Fabien Maussion.
v0.7.2 (13 March 2016)¶
This release includes two new, entirely backwards compatible features and several bug fixes.
Enhancements¶
New DataArray method
DataArray.dot()
for calculating the dot product of two DataArrays along shared dimensions. By Dean Pospisil.Rolling window operations on DataArray objects are now supported via a new
DataArray.rolling()
method. For example:In [1]: import xarray as xr; import numpy as np In [2]: arr = xr.DataArray(np.arange(0, 7.5, 0.5).reshape(3, 5), dims=('x', 'y')) In [3]: arr Out[3]: <xarray.DataArray (x: 3, y: 5)> array([[ 0. , 0.5, 1. , 1.5, 2. ], [ 2.5, 3. , 3.5, 4. , 4.5], [ 5. , 5.5, 6. , 6.5, 7. ]]) Coordinates: * x (x) int64 0 1 2 * y (y) int64 0 1 2 3 4 In [4]: arr.rolling(y=3, min_periods=2).mean() Out[4]: <xarray.DataArray (x: 3, y: 5)> array([[ nan, 0.25, 0.5 , 1. , 1.5 ], [ nan, 2.75, 3. , 3.5 , 4. ], [ nan, 5.25, 5.5 , 6. , 6.5 ]]) Coordinates: * x (x) int64 0 1 2 * y (y) int64 0 1 2 3 4
See Rolling window operations for more details. By Joe Hamman.
Bug fixes¶
- Fixed an issue where plots using pcolormesh and Cartopy axes were being distorted
by the inference of the axis interval breaks. This change chooses not to modify
the coordinate variables when the axes have the attribute
projection
, allowing Cartopy to handle the extent of pcolormesh plots (GH781). By Joe Hamman. - 2D plots now better handle additional coordinates which are not
DataArray
dimensions (GH788). By Fabien Maussion.
v0.7.1 (16 February 2016)¶
This is a bug fix release that includes two small, backwards compatible enhancements. We recommend that all users upgrade.
Enhancements¶
Bug fixes¶
- Restore checks for shape consistency between data and coordinates in the DataArray constructor (GH758).
- Single dimension variables no longer transpose as part of a broader
.transpose
. This behavior was causingpandas.PeriodIndex
dimensions to lose their type (GH749) Dataset
labels remain as their native type on.to_dataset
. Previously they were coerced to strings (GH745)- Fixed a bug where replacing a
DataArray
index coordinate would improperly align the coordinate (GH725). DataArray.reindex_like
now maintains the dtype of complex numbers when reindexing leads to NaN values (GH738).Dataset.rename
andDataArray.rename
support the old and new names being the same (GH724).- Fix
from_dataset()
for DataFrames with Categorical column and a MultiIndex index (GH737). - Fixes to ensure xarray works properly after the upcoming pandas v0.18 and NumPy v1.11 releases.
Acknowledgments¶
The following individuals contributed to this release:
- Edward Richards
- Maximilian Roos
- Rafael Guedes
- Spencer Hill
- Stephan Hoyer
v0.7.0 (21 January 2016)¶
This major release includes redesign of DataArray
internals, as well as new methods for reshaping, rolling and shifting
data. It includes preliminary support for pandas.MultiIndex
,
as well as a number of other features and bug fixes, several of which
offer improved compatibility with pandas.
New name¶
The project formerly known as “xray” is now “xarray”, pronounced “x-array”! This avoids a namespace conflict with the entire field of x-ray science. Renaming our project seemed like the right thing to do, especially because some scientists who work with actual x-rays are interested in using this project in their work. Thanks for your understanding and patience in this transition. You can now find our documentation and code repository at new URLs:
To ease the transition, we have simultaneously released v0.7.0 of both
xray
and xarray
on the Python Package Index. These packages are
identical. For now, import xray
still works, except it issues a
deprecation warning. This will be the last xray release. Going forward, we
recommend switching your import statements to import xarray as xr
.
Breaking changes¶
The internal data model used by
DataArray
has been rewritten to fix several outstanding issues (GH367, GH634, this stackoverflow report). Internally,DataArray
is now implemented in terms of._variable
and._coords
attributes instead of holding variables in aDataset
object.This refactor ensures that if a DataArray has the same name as one of its coordinates, the array and the coordinate no longer share the same data.
In practice, this means that creating a DataArray with the same
name
as one of its dimensions no longer automatically uses that array to label the corresponding coordinate. You will now need to provide coordinate labels explicitly. Here’s the old behavior:In [5]: xray.DataArray([4, 5, 6], dims='x', name='x') Out[5]: <xray.DataArray 'x' (x: 3)> array([4, 5, 6]) Coordinates: * x (x) int64 4 5 6
and the new behavior (compare the values of the
x
coordinate):In [6]: xray.DataArray([4, 5, 6], dims='x', name='x') Out[6]: <xray.DataArray 'x' (x: 3)> array([4, 5, 6]) Coordinates: * x (x) int64 0 1 2
It is no longer possible to convert a DataArray to a Dataset with
xray.DataArray.to_dataset()
if it is unnamed. This will now raiseValueError
. If the array is unnamed, you need to supply thename
argument.
Enhancements¶
Basic support for
MultiIndex
coordinates on xray objects, including indexing,stack()
andunstack()
:In [7]: df = pd.DataFrame({'foo': range(3), ...: 'x': ['a', 'b', 'b'], ...: 'y': [0, 0, 1]}) ...: In [8]: s = df.set_index(['x', 'y'])['foo'] In [9]: arr = xray.DataArray(s, dims='z') In [10]: arr Out[10]: <xray.DataArray 'foo' (z: 3)> array([0, 1, 2]) Coordinates: * z (z) object ('a', 0) ('b', 0) ('b', 1) In [11]: arr.indexes['z'] Out[11]: MultiIndex(levels=[[u'a', u'b'], [0, 1]], labels=[[0, 1, 1], [0, 0, 1]], names=[u'x', u'y']) In [12]: arr.unstack('z') Out[12]: <xray.DataArray 'foo' (x: 2, y: 2)> array([[ 0., nan], [ 1., 2.]]) Coordinates: * x (x) object 'a' 'b' * y (y) int64 0 1 In [13]: arr.unstack('z').stack(z=('x', 'y')) Out[13]: <xray.DataArray 'foo' (z: 4)> array([ 0., nan, 1., 2.]) Coordinates: * z (z) object ('a', 0) ('a', 1) ('b', 0) ('b', 1)
See Stack and unstack for more details.
Warning
xray’s MultiIndex support is still experimental, and we have a long to- do list of desired additions (GH719), including better display of multi-index levels when printing a
Dataset
, and support for saving datasets with a MultiIndex to a netCDF file. User contributions in this area would be greatly appreciated.Support for reading GRIB, HDF4 and other file formats via PyNIO. See Formats supported by PyNIO for more details.
Better error message when a variable is supplied with the same name as one of its dimensions.
Plotting: more control on colormap parameters (GH642).
vmin
andvmax
will not be silently ignored anymore. Settingcenter=False
prevents automatic selection of a divergent colormap.New
shift()
androll()
methods for shifting/rotating datasets or arrays along a dimension:In [14]: array = xray.DataArray([5, 6, 7, 8], dims='x') In [15]: array.shift(x=2) Out[15]: <xarray.DataArray (x: 4)> array([ nan, nan, 5., 6.]) Coordinates: * x (x) int64 0 1 2 3 In [16]: array.roll(x=2) Out[16]: <xarray.DataArray (x: 4)> array([7, 8, 5, 6]) Coordinates: * x (x) int64 2 3 0 1
Notice that
shift
moves data independently of coordinates, butroll
moves both data and coordinates.Assigning a
pandas
object directly as aDataset
variable is now permitted. Its index names correspond to thedims
of theDataset
, and its data is aligned.Passing a
pandas.DataFrame
orpandas.Panel
to a Dataset constructor is now permitted.New function
broadcast()
for explicitly broadcastingDataArray
andDataset
objects against each other. For example:In [17]: a = xray.DataArray([1, 2, 3], dims='x') In [18]: b = xray.DataArray([5, 6], dims='y') In [19]: a Out[19]: <xarray.DataArray (x: 3)> array([1, 2, 3]) Coordinates: * x (x) int64 0 1 2 In [20]: b Out[20]: <xarray.DataArray (y: 2)> array([5, 6]) Coordinates: * y (y) int64 0 1 In [21]: a2, b2 = xray.broadcast(a, b) In [22]: a2 Out[22]: <xarray.DataArray (x: 3, y: 2)> array([[1, 1], [2, 2], [3, 3]]) Coordinates: * x (x) int64 0 1 2 * y (y) int64 0 1 In [23]: b2 Out[23]: <xarray.DataArray (x: 3, y: 2)> array([[5, 6], [5, 6], [5, 6]]) Coordinates: * y (y) int64 0 1 * x (x) int64 0 1 2
Bug fixes¶
- Fixes for several issues found on
DataArray
objects with the same name as one of their coordinates (see Breaking changes for more details). DataArray.to_masked_array
always returns masked array with mask being an array (not a scalar value) (GH684)- Allows for (imperfect) repr of Coords when underlying index is PeriodIndex (GH645).
- Fixes for several issues found on
DataArray
objects with the same name as one of their coordinates (see Breaking changes for more details). - Attempting to assign a
Dataset
orDataArray
variable/attribute using attribute-style syntax (e.g.,ds.foo = 42
) now raises an error rather than silently failing (GH656, GH714). - You can now pass pandas objects with non-numpy dtypes (e.g.,
categorical
ordatetime64
with a timezone) into xray without an error (GH716).
Acknowledgments¶
The following individuals contributed to this release:
- Antony Lee
- Fabien Maussion
- Joe Hamman
- Maximilian Roos
- Stephan Hoyer
- Takeshi Kanmae
- femtotrader
v0.6.1 (21 October 2015)¶
This release contains a number of bug and compatibility fixes, as well as enhancements to plotting, indexing and writing files to disk.
Note that the minimum required version of dask for use with xray is now version 0.6.
API Changes¶
- The handling of colormaps and discrete color lists for 2D plots in
plot()
was changed to provide more compatibility with matplotlib’scontour
andcontourf
functions (GH538). Now discrete lists of colors should be specified usingcolors
keyword, rather thancmap
.
Enhancements¶
Faceted plotting through
FacetGrid
and theplot()
method. See Faceting for more details and examples.sel()
andreindex()
now support thetolerance
argument for controlling nearest-neighbor selection (GH629):In [24]: array = xray.DataArray([1, 2, 3], dims='x') In [25]: array.reindex(x=[0.9, 1.5], method='nearest', tolerance=0.2) Out[25]: <xray.DataArray (x: 2)> array([ 2., nan]) Coordinates: * x (x) float64 0.9 1.5
This feature requires pandas v0.17 or newer.
New
encoding
argument into_netcdf()
for writing netCDF files with compression, as described in the new documentation section on Writing encoded data.Add
real
andimag
attributes to Dataset and DataArray (GH553).More informative error message with
from_dataframe()
if the frame has duplicate columns.xray now uses deterministic names for dask arrays it creates or opens from disk. This allows xray users to take advantage of dask’s nascent support for caching intermediate computation results. See GH555 for an example.
Bug fixes¶
- Forwards compatibility with the latest pandas release (v0.17.0). We were using some internal pandas routines for datetime conversion, which unfortunately have now changed upstream (GH569).
- Aggregation functions now correctly skip
NaN
for data forcomplex128
dtype (GH554). - Fixed indexing 0d arrays with unicode dtype (GH568).
name()
and Dataset keys must be a string or None to be written to netCDF (GH533).where()
now uses dask instead of numpy if either the array orother
is a dask array. Previously, ifother
was a numpy array the method was evaluated eagerly.- Global attributes are now handled more consistently when loading remote
datasets using
engine='pydap'
(GH574). - It is now possible to assign to the
.data
attribute of DataArray objects. coordinates
attribute is now kept in the encoding dictionary after decoding (GH610).- Compatibility with numpy 1.10 (GH617).
Acknowledgments¶
The following individuals contributed to this release:
- Ryan Abernathey
- Pete Cable
- Clark Fitzgerald
- Joe Hamman
- Stephan Hoyer
- Scott Sinclair
v0.6.0 (21 August 2015)¶
This release includes numerous bug fixes and enhancements. Highlights
include the introduction of a plotting module and the new Dataset and DataArray
methods isel_points()
, sel_points()
,
where()
and diff()
. There are no
breaking changes from v0.5.2.
Enhancements¶
Plotting methods have been implemented on DataArray objects
plot()
through integration with matplotlib (GH185). For an introduction, see Plotting.Variables in netCDF files with multiple missing values are now decoded as NaN after issuing a warning if open_dataset is called with mask_and_scale=True.
We clarified our rules for when the result from an xray operation is a copy vs. a view (see Copies vs. views for more details).
Dataset variables are now written to netCDF files in order of appearance when using the netcdf4 backend (GH479).
Added
isel_points()
andsel_points()
to support pointwise indexing of Datasets and DataArrays (GH475).In [26]: da = xray.DataArray(np.arange(56).reshape((7, 8)), ....: coords={'x': list('abcdefg'), ....: 'y': 10 * np.arange(8)}, ....: dims=['x', 'y']) ....: In [27]: da Out[27]: <xray.DataArray (x: 7, y: 8)> array([[ 0, 1, 2, 3, 4, 5, 6, 7], [ 8, 9, 10, 11, 12, 13, 14, 15], [16, 17, 18, 19, 20, 21, 22, 23], [24, 25, 26, 27, 28, 29, 30, 31], [32, 33, 34, 35, 36, 37, 38, 39], [40, 41, 42, 43, 44, 45, 46, 47], [48, 49, 50, 51, 52, 53, 54, 55]]) Coordinates: * y (y) int64 0 10 20 30 40 50 60 70 * x (x) |S1 'a' 'b' 'c' 'd' 'e' 'f' 'g' # we can index by position along each dimension In [28]: da.isel_points(x=[0, 1, 6], y=[0, 1, 0], dim='points') Out[28]: <xray.DataArray (points: 3)> array([ 0, 9, 48]) Coordinates: y (points) int64 0 10 0 x (points) |S1 'a' 'b' 'g' * points (points) int64 0 1 2 # or equivalently by label In [29]: da.sel_points(x=['a', 'b', 'g'], y=[0, 10, 0], dim='points') Out[29]: <xray.DataArray (points: 3)> array([ 0, 9, 48]) Coordinates: y (points) int64 0 10 0 x (points) |S1 'a' 'b' 'g' * points (points) int64 0 1 2
New
where()
method for masking xray objects according to some criteria. This works particularly well with multi-dimensional data:In [30]: ds = xray.Dataset(coords={'x': range(100), 'y': range(100)}) In [31]: ds['distance'] = np.sqrt(ds.x ** 2 + ds.y ** 2) In [32]: ds.distance.where(ds.distance < 100).plot() Out[32]: <matplotlib.collections.QuadMesh at 0x7fe8eeb78a10>
Added new methods
DataArray.diff
andDataset.diff
for finite difference calculations along a given axis.New
to_masked_array()
convenience method for returning a numpy.ma.MaskedArray.In [33]: da = xray.DataArray(np.random.random_sample(size=(5, 4))) In [34]: da.where(da < 0.5) Out[34]: <xarray.DataArray (dim_0: 5, dim_1: 4)> array([[ 0.127, nan, 0.26 , nan], [ 0.377, 0.336, 0.451, nan], [ 0.123, nan, 0.373, 0.448], [ 0.129, nan, nan, 0.352], [ 0.229, nan, nan, 0.138]]) Coordinates: * dim_0 (dim_0) int64 0 1 2 3 4 * dim_1 (dim_1) int64 0 1 2 3 In [35]: da.where(da < 0.5).to_masked_array(copy=True) Out[35]: masked_array(data = [[0.12696983303810094 -- 0.26047600586578334 --] [0.37674971618967135 0.33622174433445307 0.45137647047539964 --] [0.12310214428849964 -- 0.37301222522143085 0.4479968246859435] [0.12944067971751294 -- -- 0.35205353914802473] [0.2288873043216132 -- -- 0.1375535565632705]], mask = [[False True False True] [False False False True] [False True False False] [False True True False] [False True True False]], fill_value = 1e+20)
Added new flag “drop_variables” to
open_dataset()
for excluding variables from being parsed. This may be useful to drop variables with problems or inconsistent values.
Bug fixes¶
- Fixed aggregation functions (e.g., sum and mean) on big-endian arrays when bottleneck is installed (GH489).
- Dataset aggregation functions dropped variables with unsigned integer dtype (GH505).
.any()
and.all()
were not lazy when used on xray objects containing dask arrays.- Fixed an error when attempting to saving datetime64 variables to netCDF
files when the first element is
NaT
(GH528). - Fix pickle on DataArray objects (GH515).
- Fixed unnecessary coercion of float64 to float32 when using netcdf3 and netcdf4_classic formats (GH526).
v0.5.2 (16 July 2015)¶
This release contains bug fixes, several additional options for opening and
saving netCDF files, and a backwards incompatible rewrite of the advanced
options for xray.concat
.
Backwards incompatible changes¶
- The optional arguments
concat_over
andmode
inconcat()
have been removed and replaced bydata_vars
andcoords
. The new arguments are both more easily understood and more robustly implemented, and allowed us to fix a bug whereconcat
accidentally loaded data into memory. If you set values for these optional arguments manually, you will need to update your code. The default behavior should be unchanged.
Enhancements¶
open_mfdataset()
now supports apreprocess
argument for preprocessing datasets prior to concatenaton. This is useful if datasets cannot be otherwise merged automatically, e.g., if the original datasets have conflicting index coordinates (GH443).open_dataset()
andopen_mfdataset()
now use a global thread lock by default for reading from netCDF files with dask. This avoids possible segmentation faults for reading from netCDF4 files when HDF5 is not configured properly for concurrent access (GH444).Added support for serializing arrays of complex numbers with engine=’h5netcdf’.
The new
save_mfdataset()
function allows for saving multiple datasets to disk simultaneously. This is useful when processing large datasets with dask.array. For example, to save a dataset too big to fit into memory to one file per year, we could write:In [36]: years, datasets = zip(*ds.groupby('time.year')) In [37]: paths = ['%s.nc' % y for y in years] In [38]: xray.save_mfdataset(datasets, paths)
Bug fixes¶
- Fixed
min
,max
,argmin
andargmax
for arrays with string or unicode types (GH453). open_dataset()
andopen_mfdataset()
support supplying chunks as a single integer.- Fixed a bug in serializing scalar datetime variable to netCDF.
- Fixed a bug that could occur in serialization of 0-dimensional integer arrays.
- Fixed a bug where concatenating DataArrays was not always lazy (GH464).
- When reading datasets with h5netcdf, bytes attributes are decoded to strings. This allows conventions decoding to work properly on Python 3 (GH451).
v0.5.1 (15 June 2015)¶
This minor release fixes a few bugs and an inconsistency with pandas. It also
adds the pipe
method, copied from pandas.
Enhancements¶
- Added
pipe()
, replicating the new pandas method in version 0.16.2. See Transforming datasets for more details. assign()
andassign_coords()
now assign new variables in sorted (alphabetical) order, mirroring the behavior in pandas. Previously, the order was arbitrary.
v0.5 (1 June 2015)¶
Highlights¶
The headline feature in this release is experimental support for out-of-core
computing (data that doesn’t fit into memory) with dask. This includes a new
top-level function open_mfdataset()
that makes it easy to open
a collection of netCDF (using dask) as a single xray.Dataset
object. For
more on dask, read the blog post introducing xray + dask and the new
documentation section Out of core computation with dask.
Dask makes it possible to harness parallelism and manipulate gigantic datasets with xray. It is currently an optional dependency, but it may become required in the future.
Backwards incompatible changes¶
The logic used for choosing which variables are concatenated with
concat()
has changed. Previously, by default any variables which were equal across a dimension were not concatenated. This lead to some surprising behavior, where the behavior of groupby and concat operations could depend on runtime values (GH268). For example:In [39]: ds = xray.Dataset({'x': 0}) In [40]: xray.concat([ds, ds], dim='y') Out[40]: <xray.Dataset> Dimensions: () Coordinates: *empty* Data variables: x int64 0
Now, the default always concatenates data variables:
In [41]: xray.concat([ds, ds], dim='y') Out[41]: <xarray.Dataset> Dimensions: (y: 2) Coordinates: * y (y) int64 0 1 Data variables: x (y) int64 0 0
To obtain the old behavior, supply the argument
concat_over=[]
.
Enhancements¶
New
to_array()
and enhancedto_dataset()
methods make it easy to switch back and forth between arrays and datasets:In [42]: ds = xray.Dataset({'a': 1, 'b': ('x', [1, 2, 3])}, ....: coords={'c': 42}, attrs={'Conventions': 'None'}) ....: In [43]: ds.to_array() Out[43]: <xarray.DataArray (variable: 2, x: 3)> array([[1, 1, 1], [1, 2, 3]]) Coordinates: * variable (variable) |S1 'a' 'b' * x (x) int64 0 1 2 c int64 42 Attributes: Conventions: None In [44]: ds.to_array().to_dataset(dim='variable') Out[44]: <xarray.Dataset> Dimensions: (x: 3) Coordinates: * x (x) int64 0 1 2 c int64 42 Data variables: a (x) int64 1 1 1 b (x) int64 1 2 3 Attributes: Conventions: None
New
fillna()
method to fill missing values, modeled off the pandas method of the same name:In [45]: array = xray.DataArray([np.nan, 1, np.nan, 3], dims='x') In [46]: array.fillna(0) Out[46]: <xarray.DataArray (x: 4)> array([ 0., 1., 0., 3.]) Coordinates: * x (x) int64 0 1 2 3
fillna
works on bothDataset
andDataArray
objects, and uses index based alignment and broadcasting like standard binary operations. It also can be applied by group, as illustrated in Fill missing values with climatology.New
assign()
andassign_coords()
methods patterned off the newDataFrame.assign
method in pandas:In [47]: ds = xray.Dataset({'y': ('x', [1, 2, 3])}) In [48]: ds.assign(z = lambda ds: ds.y ** 2) Out[48]: <xarray.Dataset> Dimensions: (x: 3) Coordinates: * x (x) int64 0 1 2 Data variables: y (x) int64 1 2 3 z (x) int64 1 4 9 In [49]: ds.assign_coords(z = ('x', ['a', 'b', 'c'])) Out[49]: <xarray.Dataset> Dimensions: (x: 3) Coordinates: * x (x) int64 0 1 2 z (x) |S1 'a' 'b' 'c' Data variables: y (x) int64 1 2 3
These methods return a new Dataset (or DataArray) with updated data or coordinate variables.
sel()
now supports themethod
parameter, which works like the paramter of the same name onreindex()
. It provides a simple interface for doing nearest-neighbor interpolation:In [50]: ds.sel(x=1.1, method='nearest') Out[50]: <xray.Dataset> Dimensions: () Coordinates: x int64 1 Data variables: y int64 2 In [51]: ds.sel(x=[1.1, 2.1], method='pad') Out[51]: <xray.Dataset> Dimensions: (x: 2) Coordinates: * x (x) int64 1 2 Data variables: y (x) int64 2 3
See Nearest neighbor lookups for more details.
You can now control the underlying backend used for accessing remote datasets (via OPeNDAP) by specifying
engine='netcdf4'
orengine='pydap'
.xray now provides experimental support for reading and writing netCDF4 files directly via h5py with the h5netcdf package, avoiding the netCDF4-Python package. You will need to install h5netcdf and specify
engine='h5netcdf'
to try this feature.Accessing data from remote datasets now has retrying logic (with exponential backoff) that should make it robust to occasional bad responses from DAP servers.
You can control the width of the Dataset repr with
xray.set_options
. It can be used either as a context manager, in which case the default is restored outside the context:In [52]: ds = xray.Dataset({'x': np.arange(1000)}) In [53]: with xray.set_options(display_width=40): ....: print(ds) ....: <xarray.Dataset> Dimensions: (x: 1000) Coordinates: * x (x) int64 0 1 2 3 4 5 6 ... Data variables: *empty*
Or to set a global option:
In [54]: xray.set_options(display_width=80)
The default value for the
display_width
option is 80.
Deprecations¶
- The method
load_data()
has been renamed to the more succinctload()
.
v0.4.1 (18 March 2015)¶
The release contains bug fixes and several new features. All changes should be fully backwards compatible.
Enhancements¶
New documentation sections on Time series data and Combining multiple files.
resample()
lets you resample a dataset or data array to a new temporal resolution. The syntax is the same as pandas, except you need to supply the time dimension explicitly:In [55]: time = pd.date_range('2000-01-01', freq='6H', periods=10) In [56]: array = xray.DataArray(np.arange(10), [('time', time)]) In [57]: array.resample('1D', dim='time') Out[57]: <xarray.DataArray (time: 3)> array([ 1.5, 5.5, 8.5]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
You can specify how to do the resampling with the
how
argument and other options such asclosed
andlabel
let you control labeling:In [58]: array.resample('1D', dim='time', how='sum', label='right') Out[58]: <xarray.DataArray (time: 3)> array([ 6, 22, 17]) Coordinates: * time (time) datetime64[ns] 2000-01-02 2000-01-03 2000-01-04
If the desired temporal resolution is higher than the original data (upsampling), xray will insert missing values:
In [59]: array.resample('3H', 'time') Out[59]: <xarray.DataArray (time: 19)> array([ 0., nan, 1., ..., 8., nan, 9.]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-01T03:00:00 ...
first
andlast
methods on groupby objects let you take the first or last examples from each group along the grouped axis:In [60]: array.groupby('time.day').first() Out[60]: <xarray.DataArray (day: 3)> array([0, 4, 8]) Coordinates: * day (day) int64 1 2 3
These methods combine well with
resample
:In [61]: array.resample('1D', dim='time', how='first') Out[61]: <xarray.DataArray (time: 3)> array([0, 4, 8]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
swap_dims()
allows for easily swapping one dimension out for another:In [62]: ds = xray.Dataset({'x': range(3), 'y': ('x', list('abc'))}) In [63]: ds Out[63]: <xarray.Dataset> Dimensions: (x: 3) Coordinates: * x (x) int64 0 1 2 Data variables: y (x) |S1 'a' 'b' 'c' In [64]: ds.swap_dims({'x': 'y'}) Out[64]: <xarray.Dataset> Dimensions: (y: 3) Coordinates: * y (y) |S1 'a' 'b' 'c' x (y) int64 0 1 2 Data variables: *empty*
This was possible in earlier versions of xray, but required some contortions.
open_dataset()
andto_netcdf()
now accept anengine
argument to explicitly select which underlying library (netcdf4 or scipy) is used for reading/writing a netCDF file.
Bug fixes¶
- Fixed a bug where data netCDF variables read from disk with
engine='scipy'
could still be associated with the file on disk, even after closing the file (GH341). This manifested itself in warnings about mmapped arrays and segmentation faults (if the data was accessed). - Silenced spurious warnings about all-NaN slices when using nan-aware aggregation methods (GH344).
- Dataset aggregations with
keep_attrs=True
now preserve attributes on data variables, not just the dataset itself. - Tests for xray now pass when run on Windows (GH360).
- Fixed a regression in v0.4 where saving to netCDF could fail with the error
ValueError: could not automatically determine time units
.
v0.4 (2 March, 2015)¶
This is one of the biggest releases yet for xray: it includes some major changes that may break existing code, along with the usual collection of minor enhancements and bug fixes. On the plus side, this release includes all hitherto planned breaking changes, so the upgrade path for xray should be smoother going forward.
Breaking changes¶
We now automatically align index labels in arithmetic, dataset construction, merging and updating. This means the need for manually invoking methods like
align()
andreindex_like()
should be vastly reduced.For arithmetic, we align based on the intersection of labels:
In [65]: lhs = xray.DataArray([1, 2, 3], [('x', [0, 1, 2])]) In [66]: rhs = xray.DataArray([2, 3, 4], [('x', [1, 2, 3])]) In [67]: lhs + rhs Out[67]: <xarray.DataArray (x: 2)> array([4, 6]) Coordinates: * x (x) int64 1 2
For dataset construction and merging, we align based on the union of labels:
In [68]: xray.Dataset({'foo': lhs, 'bar': rhs}) Out[68]: <xarray.Dataset> Dimensions: (x: 4) Coordinates: * x (x) int64 0 1 2 3 Data variables: foo (x) float64 1.0 2.0 3.0 nan bar (x) float64 nan 2.0 3.0 4.0
For update and __setitem__, we align based on the original object:
In [69]: lhs.coords['rhs'] = rhs In [70]: lhs Out[70]: <xarray.DataArray (x: 3)> array([1, 2, 3]) Coordinates: * x (x) int64 0 1 2 rhs (x) float64 nan 2.0 3.0
Aggregations like
mean
ormedian
now skip missing values by default:In [71]: xray.DataArray([1, 2, np.nan, 3]).mean() Out[71]: <xarray.DataArray ()> array(2.0)
You can turn this behavior off by supplying the keyword arugment
skipna=False
.These operations are lightning fast thanks to integration with bottleneck, which is a new optional dependency for xray (numpy is used if bottleneck is not installed).
Scalar coordinates no longer conflict with constant arrays with the same value (e.g., in arithmetic, merging datasets and concat), even if they have different shape (GH243). For example, the coordinate
c
here persists through arithmetic, even though it has different shapes on each DataArray:In [72]: a = xray.DataArray([1, 2], coords={'c': 0}, dims='x') In [73]: b = xray.DataArray([1, 2], coords={'c': ('x', [0, 0])}, dims='x') In [74]: (a + b).coords Out[74]: Coordinates: c (x) int64 0 0 * x (x) int64 0 1
This functionality can be controlled through the
compat
option, which has also been added to theDataset
constructor.Datetime shortcuts such as
'time.month'
now return aDataArray
with the name'month'
, not'time.month'
(GH345). This makes it easier to index the resulting arrays when they are used withgroupby
:In [75]: time = xray.DataArray(pd.date_range('2000-01-01', periods=365), ....: dims='time', name='time') ....: In [76]: counts = time.groupby('time.month').count() In [77]: counts.sel(month=2) Out[77]: <xarray.DataArray 'time' ()> array(29) Coordinates: month int64 2
Previously, you would need to use something like
counts.sel(**{'time.month': 2}})
, which is much more awkward.The
season
datetime shortcut now returns an array of string labels such ‘DJF’:In [78]: ds = xray.Dataset({'t': pd.date_range('2000-01-01', periods=12, freq='M')}) In [79]: ds['t.season'] Out[79]: <xarray.DataArray 'season' (t: 12)> array(['DJF', 'DJF', 'MAM', ..., 'SON', 'SON', 'DJF'], dtype='|S3') Coordinates: * t (t) datetime64[ns] 2000-01-31 2000-02-29 2000-03-31 2000-04-30 ...
Previously, it returned numbered seasons 1 through 4.
We have updated our use of the terms of “coordinates” and “variables”. What were known in previous versions of xray as “coordinates” and “variables” are now referred to throughout the documentation as “coordinate variables” and “data variables”. This brings xray in closer alignment to CF Conventions. The only visible change besides the documentation is that
Dataset.vars
has been renamedDataset.data_vars
.You will need to update your code if you have been ignoring deprecation warnings: methods and attributes that were deprecated in xray v0.3 or earlier (e.g.,
dimensions
,attributes`
) have gone away.
Enhancements¶
Support for
reindex()
with a fill method. This provides a useful shortcut for upsampling:In [80]: data = xray.DataArray([1, 2, 3], dims='x') In [81]: data.reindex(x=[0.5, 1, 1.5, 2, 2.5], method='pad') Out[81]: <xarray.DataArray (x: 5)> array([1, 2, 2, 3, 3]) Coordinates: * x (x) float64 0.5 1.0 1.5 2.0 2.5
This will be especially useful once pandas 0.16 is released, at which point xray will immediately support reindexing with method=’nearest’.
Use functions that return generic ndarrays with DataArray.groupby.apply and Dataset.apply (GH327 and GH329). Thanks Jeff Gerard!
Consolidated the functionality of
dumps
(writing a dataset to a netCDF3 bytestring) intoto_netcdf()
(GH333).to_netcdf()
now supports writing to groups in netCDF4 files (GH333). It also finally has a full docstring – you should read it!open_dataset()
andto_netcdf()
now work on netCDF3 files when netcdf4-python is not installed as long as scipy is available (GH333).The new
Dataset.drop
andDataArray.drop
methods makes it easy to drop explicitly listed variables or index labels:# drop variables In [82]: ds = xray.Dataset({'x': 0, 'y': 1}) In [83]: ds.drop('x') Out[83]: <xarray.Dataset> Dimensions: () Coordinates: *empty* Data variables: y int64 1 # drop index labels In [84]: arr = xray.DataArray([1, 2, 3], coords=[('x', list('abc'))]) In [85]: arr.drop(['a', 'c'], dim='x') Out[85]: <xarray.DataArray (x: 1)> array([2]) Coordinates: * x (x) |S1 'b'
broadcast_equals()
has been added to correspond to the newcompat
option.Long attributes are now truncated at 500 characters when printing a dataset (GH338). This should make things more convenient for working with datasets interactively.
Added a new documentation example, Calculating Seasonal Averages from Timeseries of Monthly Means. Thanks Joe Hamman!
Bug fixes¶
- Several bug fixes related to decoding time units from netCDF files (GH316, GH330). Thanks Stefan Pfenninger!
- xray no longer requires
decode_coords=False
when reading datasets with unparseable coordinate attributes (GH308). - Fixed
DataArray.loc
indexing with...
(GH318). - Fixed an edge case that resulting in an error when reindexing multi-dimensional variables (GH315).
- Slicing with negative step sizes (GH312).
- Invalid conversion of string arrays to numeric dtype (GH305).
- Fixed``repr()`` on dataset objects with non-standard dates (GH347).
Deprecations¶
dump
anddumps
have been deprecated in favor ofto_netcdf()
.drop_vars
has been deprecated in favor ofdrop()
.
Future plans¶
The biggest feature I’m excited about working toward in the immediate future is supporting out-of-core operations in xray using Dask, a part of the Blaze project. For a preview of using Dask with weather data, read this blog post by Matthew Rocklin. See GH328 for more details.
v0.3.2 (23 December, 2014)¶
This release focused on bug-fixes, speedups and resolving some niggling inconsistencies.
There are a few cases where the behavior of xray differs from the previous version. However, I expect that in almost all cases your code will continue to run unmodified.
Warning
xray now requires pandas v0.15.0 or later. This was necessary for supporting TimedeltaIndex without too many painful hacks.
Backwards incompatible changes¶
Arrays of
datetime.datetime
objects are now automatically cast todatetime64[ns]
arrays when stored in an xray object, using machinery borrowed from pandas:In [86]: from datetime import datetime In [87]: xray.Dataset({'t': [datetime(2000, 1, 1)]}) Out[87]: <xarray.Dataset> Dimensions: (t: 1) Coordinates: * t (t) datetime64[ns] 2000-01-01 Data variables: *empty*
xray now has support (including serialization to netCDF) for
TimedeltaIndex
.datetime.timedelta
objects are thus accordingly cast totimedelta64[ns]
objects when appropriate.Masked arrays are now properly coerced to use
NaN
as a sentinel value (GH259).
Enhancements¶
Due to popular demand, we have added experimental attribute style access as a shortcut for dataset variables, coordinates and attributes:
In [88]: ds = xray.Dataset({'tmin': ([], 25, {'units': 'celcius'})}) In [89]: ds.tmin.units Out[89]: 'celcius'
Tab-completion for these variables should work in editors such as IPython. However, setting variables or attributes in this fashion is not yet supported because there are some unresolved ambiguities (GH300).
You can now use a dictionary for indexing with labeled dimensions. This provides a safe way to do assignment with labeled dimensions:
In [90]: array = xray.DataArray(np.zeros(5), dims=['x']) In [91]: array[dict(x=slice(3))] = 1 In [92]: array Out[92]: <xarray.DataArray (x: 5)> array([ 1., 1., 1., 0., 0.]) Coordinates: * x (x) int64 0 1 2 3 4
Non-index coordinates can now be faithfully written to and restored from netCDF files. This is done according to CF conventions when possible by using the
coordinates
attribute on a data variable. When not possible, xray defines a globalcoordinates
attribute.Preliminary support for converting
xray.DataArray
objects to and from CDATcdms2
variables.We sped up any operation that involves creating a new Dataset or DataArray (e.g., indexing, aggregation, arithmetic) by a factor of 30 to 50%. The full speed up requires cyordereddict to be installed.
Bug fixes¶
Future plans¶
- I am contemplating switching to the terms “coordinate variables” and “data
variables” instead of the (currently used) “coordinates” and “variables”,
following their use in CF Conventions (GH293). This would mostly
have implications for the documentation, but I would also change the
Dataset
attributevars
todata
. - I no longer certain that automatic label alignment for arithmetic would be a good idea for xray – it is a feature from pandas that I have not missed (GH186).
- The main API breakage that I do anticipate in the next release is finally
making all aggregation operations skip missing values by default
(GH130). I’m pretty sick of writing
ds.reduce(np.nanmean, 'time')
. - The next version of xray (0.4) will remove deprecated features and aliases whose use currently raises a warning.
If you have opinions about any of these anticipated changes, I would love to hear them – please add a note to any of the referenced GitHub issues.
v0.3.1 (22 October, 2014)¶
This is mostly a bug-fix release to make xray compatible with the latest release of pandas (v0.15).
We added several features to better support working with missing values and exporting xray objects to pandas. We also reorganized the internal API for serializing and deserializing datasets, but this change should be almost entirely transparent to users.
Other than breaking the experimental DataStore API, there should be no backwards incompatible changes.
New features¶
- Added
count()
anddropna()
methods, copied from pandas, for working with missing values (GH247, GH58). - Added
DataArray.to_pandas
for converting a data array into the pandas object with the same dimensionality (1D to Series, 2D to DataFrame, etc.) (GH255). - Support for reading gzipped netCDF3 files (GH239).
- Reduced memory usage when writing netCDF files (GH251).
- ‘missing_value’ is now supported as an alias for the ‘_FillValue’ attribute on netCDF variables (GH245).
- Trivial indexes, equivalent to
range(n)
wheren
is the length of the dimension, are no longer written to disk (GH245).
Bug fixes¶
- Compatibility fixes for pandas v0.15 (GH262).
- Fixes for display and indexing of
NaT
(not-a-time) (GH238, GH240) - Fix slicing by label was an argument is a data array (GH250).
- Test data is now shipped with the source distribution (GH253).
- Ensure order does not matter when doing arithmetic with scalar data arrays (GH254).
- Order of dimensions preserved with
DataArray.to_dataframe
(GH260).
v0.3 (21 September 2014)¶
New features¶
- Revamped coordinates: “coordinates” now refer to all arrays that are not used to index a dimension. Coordinates are intended to allow for keeping track of arrays of metadata that describe the grid on which the points in “variable” arrays lie. They are preserved (when unambiguous) even though mathematical operations.
- Dataset math
Dataset
objects now support all arithmetic operations directly. Dataset-array operations map across all dataset variables; dataset-dataset operations act on each pair of variables with the same name. - GroupBy math: This provides a convenient shortcut for normalizing by the average value of a group.
- The dataset
__repr__
method has been entirely overhauled; dataset objects now show their values when printed. - You can now index a dataset with a list of variables to return a new dataset:
ds[['foo', 'bar']]
.
Backwards incompatible changes¶
Dataset.__eq__
andDataset.__ne__
are now element-wise operations instead of comparing all values to obtain a single boolean. Use the methodequals()
instead.
Deprecations¶
Dataset.noncoords
is deprecated: useDataset.vars
instead.Dataset.select_vars
deprecated: index aDataset
with a list of variable names instead.DataArray.select_vars
andDataArray.drop_vars
deprecated: usereset_coords()
instead.
v0.2 (14 August 2014)¶
This is major release that includes some new features and quite a few bug fixes. Here are the highlights:
- There is now a direct constructor for
DataArray
objects, which makes it possible to create a DataArray without using a Dataset. This is highlighted in the refreshed tutorial. - You can perform aggregation operations like
mean
directly onDataset
objects, thanks to Joe Hamman. These aggregation methods also worked on grouped datasets. - xray now works on Python 2.6, thanks to Anna Kuznetsova.
- A number of methods and attributes were given more sensible (usually shorter)
names:
labeled
->sel
,indexed
->isel
,select
->select_vars
,unselect
->drop_vars
,dimensions
->dims
,coordinates
->coords
,attributes
->attrs
. - New
load_data()
andclose()
methods for datasets facilitate lower level of control of data loaded from disk.
v0.1.1 (20 May 2014)¶
xray 0.1.1 is a bug-fix release that includes changes that should be almost entirely backwards compatible with v0.1:
- Python 3 support (GH53)
- Required numpy version relaxed to 1.7 (GH129)
- Return numpy.datetime64 arrays for non-standard calendars (GH126)
- Support for opening datasets associated with NetCDF4 groups (GH127)
- Bug-fixes for concatenating datetime arrays (GH134)
Special thanks to new contributors Thomas Kluyver, Joe Hamman and Alistair Miles.
v0.1 (2 May 2014)¶
Initial release.
Overview: Why xarray?¶
Features¶
Adding dimensions names and coordinate indexes to numpy’s ndarray makes many powerful array operations possible:
- Apply operations over dimensions by name:
x.sum('time')
. - Select values by label instead of integer location:
x.loc['2014-01-01']
orx.sel(time='2014-01-01')
. - Mathematical operations (e.g.,
x - y
) vectorize across multiple dimensions (array broadcasting) based on dimension names, not shape. - Flexible split-apply-combine operations with groupby:
x.groupby('time.dayofyear').mean()
. - Database like alignment based on coordinate labels that smoothly
handles missing values:
x, y = xr.align(x, y, join='outer')
. - Keep track of arbitrary metadata in the form of a Python dictionary:
x.attrs
.
pandas provides many of these features, but it does not make use of dimension names, and its core data structures are fixed dimensional arrays.
The N-dimensional nature of xarray’s data structures makes it suitable for dealing
with multi-dimensional scientific data, and its use of dimension names
instead of axis labels (dim='time'
instead of axis=0
) makes such
arrays much more manageable than the raw numpy ndarray: with xarray, you don’t
need to keep track of the order of arrays dimensions or insert dummy dimensions
(e.g., np.newaxis
) to align arrays.
Core data structures¶
xarray has two core data structures. Both are fundamentally N-dimensional:
DataArray
is our implementation of a labeled, N-dimensional array. It is an N-D generalization of apandas.Series
. The nameDataArray
itself is borrowed from Fernando Perez’s datarray project, which prototyped a similar data structure.Dataset
is a multi-dimensional, in-memory array database. It is a dict-like container ofDataArray
objects aligned along any number of shared dimensions, and serves a similar purpose in xarray to thepandas.DataFrame
.
The value of attaching labels to numpy’s numpy.ndarray
may be
fairly obvious, but the dataset may need more motivation.
The power of the dataset over a plain dictionary is that, in addition to
pulling out arrays by name, it is possible to select or combine data along a
dimension across all arrays simultaneously. Like a
DataFrame
, datasets facilitate array operations with
heterogeneous data – the difference is that the arrays in a dataset can not
only have different data types, but can also have different numbers of
dimensions.
This data model is borrowed from the netCDF file format, which also provides xarray with a natural and portable serialization format. NetCDF is very popular in the geosciences, and there are existing libraries for reading and writing netCDF in many programming languages, including Python.
xarray distinguishes itself from many tools for working with netCDF data in-so-far as it provides data structures for in-memory analytics that both utilize and preserve labels. You only need to do the tedious work of adding metadata once, not every time you save a file.
Goals and aspirations¶
pandas excels at working with tabular data. That suffices for many statistical analyses, but physical scientists rely on N-dimensional arrays – which is where xarray comes in.
xarray aims to provide a data analysis toolkit as powerful as pandas but designed for working with homogeneous N-dimensional arrays instead of tabular data. When possible, we copy the pandas API and rely on pandas’s highly optimized internals (in particular, for fast indexing).
Importantly, xarray has robust support for converting its objects to and
from a numpy ndarray
or a pandas DataFrame
or Series
, providing
compatibility with the full PyData ecosystem.
Our target audience is anyone who needs N-dimensional labeled arrays, but we are particularly focused on the data analysis needs of physical scientists – especially geoscientists who already know and love netCDF.
Frequently Asked Questions¶
Why is pandas not enough?¶
pandas, thanks to its unrivaled speed and flexibility, has emerged as the premier python package for working with labeled arrays. So why are we contributing to further fragmentation in the ecosystem for working with data arrays in Python?
Sometimes, we really want to work with collections of higher dimensional arrays (ndim > 2), or arrays for which the order of dimensions (e.g., columns vs rows) shouldn’t really matter. For example, climate and weather data is often natively expressed in 4 or more dimensions: time, x, y and z.
Pandas does support N-dimensional panels, but the implementation is very limited:
- You need to create a new factory type for each dimensionality.
- You can’t do math between NDPanels with different dimensionality.
- Each dimension in a NDPanel has a name (e.g., ‘labels’, ‘items’, ‘major_axis’, etc.) but the dimension names refer to order, not their meaning. You can’t specify an operation as to be applied along the “time” axis.
Fundamentally, the N-dimensional panel is limited by its context in pandas’s
tabular model, which treats a 2D DataFrame
as a collections of 1D
Series
, a 3D Panel
as a collection of 2D DataFrame
, and so on. In
my experience, it usually easier to work with a DataFrame with a hierarchical
index rather than to use higher dimensional (N > 3) data structures in
pandas.
Another use case is handling collections of arrays with different numbers of
dimensions. For example, suppose you have a 2D array and a handful of
associated 1D arrays that share one of the same axes. Storing these in one
pandas object is possible but awkward – you can either upcast all the 1D
arrays to 2D and store everything in a Panel
, or put everything in a
DataFrame
, where the first few columns have a different meaning than the
other columns. In contrast, this sort of data structure fits very naturally in
an xarray Dataset
.
Pandas gets a lot of things right, but scientific users need fully multi- dimensional data structures.
How do xarray data structures differ from those found in pandas?¶
The main distinguishing feature of xarray’s DataArray
over labeled arrays in
pandas is that dimensions can have names (e.g., “time”, “latitude”,
“longitude”). Names are much easier to keep track of than axis numbers, and
xarray uses dimension names for indexing, aggregation and broadcasting. Not only
can you write x.sel(time='2000-01-01')
and x.mean(dim='time')
, but
operations like x - x.mean(dim='time')
always work, no matter the order
of the “time” dimension. You never need to reshape arrays (e.g., with
np.newaxis
) to align them for arithmetic operations in xarray.
Should I use xarray instead of pandas?¶
It’s not an either/or choice! xarray provides robust support for converting back and forth between the tabular data-structures of pandas and its own multi-dimensional data-structures.
That said, you should only bother with xarray if some aspect of data is fundamentally multi-dimensional. If your data is unstructured or one-dimensional, stick with pandas, which is a more developed toolkit for doing data analysis in Python.
What is your approach to metadata?¶
We are firm believers in the power of labeled data! In addition to dimensions
and coordinates, xarray supports arbitrary metadata in the form of global
(Dataset) and variable specific (DataArray) attributes (attrs
).
Automatic interpretation of labels is powerful but also reduces flexibility.
With xarray, we draw a firm line between labels that the library understands
(dims
and coords
) and labels for users and user code (attrs
). For
example, we do not automatically interpret and enforce units or CF
conventions. (An exception is serialization to and from netCDF files.)
An implication of this choice is that we do not propagate attrs
through
most operations unless explicitly flagged (some methods have a keep_attrs
option). Similarly, xarray does not check for conflicts between attrs
when
combining arrays and datasets, unless explicitly requested with the option
compat='identical'
. The guiding principle is that metadata should not be
allowed to get in the way.
How should I cite xarray?¶
If you are using xarray and would like to cite it in academic publication, we would certainly appreciate it. We recommend two citations.
At a minimum, we recommend citing the xarray overview journal article, to be submitted to the Journal of Open Research Software.
Hoyer, S., Hamman, J. (In preparation). Xarray: N-D labeled arrays and datasets in Python. Journal of Open Research Software.
Here’s an example of a BibTeX entry:
@article{hoyer2016xarray, title = {xarray: {N-D} labeled arrays and datasets in {Python}}, author = {Hoyer, S. and J. Hamman}, journal = {in prep, J. Open Res. Software}, year = {2016} }You may also want to cite a specific version of the xarray package. We provide a Zenodo citation and DOI for this purpose.
Hoyer, S. et al.. (2016). xarray: v0.8.0. Zenodo. 10.5281/zenodo.59499
An example BibTeX entry:
@misc{xarray_v0_8_0, author = {Stephan Hoyer and Clark Fitzgerald and Joe Hamman and others}, title = {xarray: v0.8.0}, month = aug, year = 2016, doi = {10.5281/zenodo.59499}, url = {http://dx.doi.org/10.5281/zenodo.59499} }
Examples¶
Quick overview¶
Here are some quick examples of what you can do with xarray.DataArray
objects. Everything is explained in much more detail in the rest of the
documentation.
To begin, import numpy, pandas and xarray using their customary abbreviations:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: import xarray as xr
Create a DataArray¶
You can make a DataArray from scratch by supplying data in the form of a numpy array or list, with optional dimensions and coordinates:
In [4]: xr.DataArray(np.random.randn(2, 3))
Out[4]:
<xarray.DataArray (dim_0: 2, dim_1: 3)>
array([[-1.344, 0.845, 1.076],
[-0.109, 1.644, -1.469]])
Coordinates:
* dim_0 (dim_0) int64 0 1
* dim_1 (dim_1) int64 0 1 2
In [5]: data = xr.DataArray(np.random.randn(2, 3), [('x', ['a', 'b']), ('y', [-2, 0, 2])])
In [6]: data
Out[6]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.357, -0.675, -1.777],
[-0.969, -1.295, 0.414]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
If you supply a pandas Series
or
DataFrame
, metadata is copied directly:
In [7]: xr.DataArray(pd.Series(range(3), index=list('abc'), name='foo'))
Out[7]:
<xarray.DataArray 'foo' (dim_0: 3)>
array([0, 1, 2])
Coordinates:
* dim_0 (dim_0) object 'a' 'b' 'c'
Here are the key properties for a DataArray
:
# like in pandas, values is a numpy array that you can modify in-place
In [8]: data.values
Out[8]:
array([[ 0.357, -0.675, -1.777],
[-0.969, -1.295, 0.414]])
In [9]: data.dims
Out[9]: ('x', 'y')
In [10]: data.coords
Out[10]:
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
# you can use this dictionary to store arbitrary metadata
In [11]: data.attrs
Out[11]: OrderedDict()
Indexing¶
xarray supports four kind of indexing. These operations are just as fast as in pandas, because we borrow pandas’ indexing machinery.
# positional and by integer label, like numpy
In [12]: data[[0, 1]]
Out[12]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.357, -0.675, -1.777],
[-0.969, -1.295, 0.414]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
# positional and by coordinate label, like pandas
In [13]: data.loc['a':'b']
Out[13]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.357, -0.675, -1.777],
[-0.969, -1.295, 0.414]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
# by dimension name and integer label
In [14]: data.isel(x=slice(2))
Out[14]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.357, -0.675, -1.777],
[-0.969, -1.295, 0.414]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
# by dimension name and coordinate label
In [15]: data.sel(x=['a', 'b'])
Out[15]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.357, -0.675, -1.777],
[-0.969, -1.295, 0.414]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
Computation¶
Data arrays work very similarly to numpy ndarrays:
In [16]: data + 10
Out[16]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 10.357, 9.325, 8.223],
[ 9.031, 8.705, 10.414]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
In [17]: np.sin(data)
Out[17]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.349, -0.625, -0.979],
[-0.824, -0.962, 0.402]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
In [18]: data.T
Out[18]:
<xarray.DataArray (y: 3, x: 2)>
array([[ 0.357, -0.969],
[-0.675, -1.295],
[-1.777, 0.414]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
In [19]: data.sum()
Out[19]:
<xarray.DataArray ()>
array(-3.9441825539138033)
However, aggregation operations can use dimension names instead of axis numbers:
In [20]: data.mean(dim='x')
Out[20]:
<xarray.DataArray (y: 3)>
array([-0.306, -0.985, -0.682])
Coordinates:
* y (y) int64 -2 0 2
Arithmetic operations broadcast based on dimension name. This means you don’t need to insert dummy dimensions for alignment:
In [21]: a = xr.DataArray(np.random.randn(3), [data.coords['y']])
In [22]: b = xr.DataArray(np.random.randn(4), dims='z')
In [23]: a
Out[23]:
<xarray.DataArray (y: 3)>
array([ 0.277, -0.472, -0.014])
Coordinates:
* y (y) int64 -2 0 2
In [24]: b
Out[24]:
<xarray.DataArray (z: 4)>
array([-0.363, -0.006, -0.923, 0.896])
Coordinates:
* z (z) int64 0 1 2 3
In [25]: a + b
Out[25]:
<xarray.DataArray (y: 3, z: 4)>
array([[-0.086, 0.271, -0.646, 1.172],
[-0.835, -0.478, -1.395, 0.424],
[-0.377, -0.02 , -0.937, 0.882]])
Coordinates:
* y (y) int64 -2 0 2
* z (z) int64 0 1 2 3
It also means that in most cases you do not need to worry about the order of dimensions:
In [26]: data - data.T
Out[26]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0., 0., 0.],
[ 0., 0., 0.]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
Operations also align based on index labels:
In [27]: data[:-1] - data[:1]
Out[27]:
<xarray.DataArray (x: 1, y: 3)>
array([[ 0., 0., 0.]])
Coordinates:
* x (x) |S1 'a'
* y (y) int64 -2 0 2
GroupBy¶
xarray supports grouped operations using a very similar API to pandas:
In [28]: labels = xr.DataArray(['E', 'F', 'E'], [data.coords['y']], name='labels')
In [29]: labels
Out[29]:
<xarray.DataArray 'labels' (y: 3)>
array(['E', 'F', 'E'],
dtype='|S1')
Coordinates:
* y (y) int64 -2 0 2
In [30]: data.groupby(labels).mean('y')
Out[30]:
<xarray.DataArray (x: 2, labels: 2)>
array([[-0.71 , -0.675],
[-0.278, -1.295]])
Coordinates:
* x (x) |S1 'a' 'b'
* labels (labels) object 'E' 'F'
In [31]: data.groupby(labels).apply(lambda x: x - x.min())
Out[31]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 2.134, 0.62 , 0. ],
[ 0.808, 0. , 2.191]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
labels (y) |S1 'E' 'F' 'E'
Convert to pandas¶
A key feature of xarray is robust conversion to and from pandas objects:
In [32]: data.to_series()
Out[32]:
x y
a -2 0.357021
0 -0.674600
2 -1.776904
b -2 -0.968914
0 -1.294524
2 0.413738
dtype: float64
In [33]: data.to_pandas()
Out[33]:
y -2 0 2
x
a 0.357021 -0.674600 -1.776904
b -0.968914 -1.294524 0.413738
Datasets and NetCDF¶
xarray.Dataset
is a dict-like container of DataArray
objects that share
index labels and dimensions. It looks a lot like a netCDF file:
In [34]: ds = data.to_dataset(name='foo')
In [35]: ds
Out[35]:
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
Data variables:
foo (x, y) float64 0.357 -0.6746 -1.777 -0.9689 -1.295 0.4137
You can do almost everything you can do with DataArray
objects with
Dataset
objects if you prefer to work with multiple variables at once.
Datasets also let you easily read and write netCDF files:
In [36]: ds.to_netcdf('example.nc')
In [37]: xr.open_dataset('example.nc')
Out[37]:
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* y (y) int32 -2 0 2
* x (x) |S1 'a' 'b'
Data variables:
foo (x, y) float64 0.357 -0.6746 -1.777 -0.9689 -1.295 0.4137
Toy weather data¶
Here is an example of how to easily manipulate a toy weather dataset using xarray and other recommended Python libraries:
Shared setup:
import xarray as xr
import numpy as np
import pandas as pd
import seaborn as sns # pandas aware plotting library
np.random.seed(123)
times = pd.date_range('2000-01-01', '2001-12-31', name='time')
annual_cycle = np.sin(2 * np.pi * (times.dayofyear / 365.25 - 0.28))
base = 10 + 15 * annual_cycle.reshape(-1, 1)
tmin_values = base + 3 * np.random.randn(annual_cycle.size, 3)
tmax_values = base + 10 + 3 * np.random.randn(annual_cycle.size, 3)
ds = xr.Dataset({'tmin': (('time', 'location'), tmin_values),
'tmax': (('time', 'location'), tmax_values)},
{'time': times, 'location': ['IA', 'IN', 'IL']})
Examine a dataset with pandas and seaborn¶
In [1]: ds
Out[1]:
<xarray.Dataset>
Dimensions: (location: 3, time: 731)
Coordinates:
* location (location) |S2 'IA' 'IN' 'IL'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
Data variables:
tmax (time, location) float64 12.98 3.31 6.779 0.4479 6.373 4.843 ...
tmin (time, location) float64 -8.037 -1.788 -3.932 -9.341 -6.558 ...
In [2]: df = ds.to_dataframe()
In [3]: df.head()
Out[3]:
tmax tmin
location time
IA 2000-01-01 12.980549 -8.037369
2000-01-02 0.447856 -9.341157
2000-01-03 5.322699 -12.139719
2000-01-04 1.889425 -7.492914
2000-01-05 0.791176 -0.447129
In [4]: df.describe()
Out[4]:
tmax tmin
count 2193.000000 2193.000000
mean 20.108232 9.975426
std 11.010569 10.963228
min -3.506234 -13.395763
25% 9.853905 -0.040347
50% 19.967409 10.060403
75% 30.045588 20.083590
max 43.271148 33.456060
In [5]: ds.mean(dim='location').to_dataframe().plot()
Out[5]: <matplotlib.axes._subplots.AxesSubplot at 0x7fe8ed77c710>

In [6]: sns.pairplot(df.reset_index(), vars=ds.data_vars)
Out[6]: <seaborn.axisgrid.PairGrid at 0x7f0fd2368a10>

Probability of freeze by calendar month¶
In [7]: freeze = (ds['tmin'] <= 0).groupby('time.month').mean('time')
In [8]: freeze
Out[8]:
<xarray.DataArray 'tmin' (month: 12, location: 3)>
array([[ 0.952, 0.887, 0.935],
[ 0.842, 0.719, 0.772],
[ 0.242, 0.129, 0.161],
...,
[ 0. , 0.016, 0. ],
[ 0.333, 0.35 , 0.233],
[ 0.935, 0.855, 0.823]])
Coordinates:
* location (location) |S2 'IA' 'IN' 'IL'
* month (month) int64 1 2 3 4 5 6 7 8 9 10 11 12
In [9]: freeze.to_pandas().plot()
Out[9]: <matplotlib.axes._subplots.AxesSubplot at 0x7fe8ee751590>

Monthly averaging¶
In [10]: monthly_avg = ds.resample('1MS', dim='time', how='mean')
In [11]: monthly_avg.sel(location='IA').to_dataframe().plot(style='s-')
Out[11]: <matplotlib.axes._subplots.AxesSubplot at 0x7fe8e46f0c50>

Note that MS
here refers to Month-Start; M
labels Month-End (the last
day of the month).
Calculate monthly anomalies¶
In climatology, “anomalies” refer to the difference between observations and typical weather for a particular season. Unlike observations, anomalies should not show any seasonal cycle.
In [12]: climatology = ds.groupby('time.month').mean('time')
In [13]: anomalies = ds.groupby('time.month') - climatology
In [14]: anomalies.mean('location').to_dataframe()[['tmin', 'tmax']].plot()
Out[14]: <matplotlib.axes._subplots.AxesSubplot at 0x7fe8e46158d0>

Fill missing values with climatology¶
The fillna()
method on grouped objects lets you easily
fill missing values by group:
# throw away the first half of every month
In [15]: some_missing = ds.tmin.sel(time=ds['time.day'] > 15).reindex_like(ds)
In [16]: filled = some_missing.groupby('time.month').fillna(climatology.tmin)
In [17]: both = xr.Dataset({'some_missing': some_missing, 'filled': filled})
In [18]: both
Out[18]:
<xarray.Dataset>
Dimensions: (location: 3, time: 731)
Coordinates:
* location (location) object 'IA' 'IN' 'IL'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
month (time) int32 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
Data variables:
some_missing (time, location) float64 nan nan nan nan nan nan nan nan ...
filled (time, location) float64 -5.163 -4.216 -4.681 -5.163 ...
In [19]: df = both.sel(time='2000').mean('location').reset_coords(drop=True).to_dataframe()
In [20]: df[['filled', 'some_missing']].plot()
Out[20]: <matplotlib.axes._subplots.AxesSubplot at 0x7fe8e44e9c50>

Calculating Seasonal Averages from Timeseries of Monthly Means¶
Author: Joe Hamman
The data for this example can be found in the xray-data repository. This example is also available in an IPython Notebook that is available here.
Suppose we have a netCDF or xray Dataset of monthly mean data and we want to calculate the seasonal average. To do this properly, we need to calculate the weighted average considering that each month has a different number of days.
%matplotlib inline
import numpy as np
import pandas as pd
import xray
from netCDF4 import num2date
import matplotlib.pyplot as plt
print("numpy version : ", np.__version__)
print("pandas version : ", pd.version.version)
print("xray version : ", xray.version.version)
numpy version : 1.9.2
pandas version : 0.16.2
xray version : 0.5.1
Some calendar information so we can support any netCDF calendar.¶
dpm = {'noleap': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
'365_day': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
'standard': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
'gregorian': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
'proleptic_gregorian': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
'all_leap': [0, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
'366_day': [0, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
'360_day': [0, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30]}
A few calendar functions to determine the number of days in each month¶
If you were just using the standard calendar, it would be easy to use
the calendar.month_range
function.
def leap_year(year, calendar='standard'):
"""Determine if year is a leap year"""
leap = False
if ((calendar in ['standard', 'gregorian',
'proleptic_gregorian', 'julian']) and
(year % 4 == 0)):
leap = True
if ((calendar == 'proleptic_gregorian') and
(year % 100 == 0) and
(year % 400 != 0)):
leap = False
elif ((calendar in ['standard', 'gregorian']) and
(year % 100 == 0) and (year % 400 != 0) and
(year < 1583)):
leap = False
return leap
def get_dpm(time, calendar='standard'):
"""
return a array of days per month corresponding to the months provided in `months`
"""
month_length = np.zeros(len(time), dtype=np.int)
cal_days = dpm[calendar]
for i, (month, year) in enumerate(zip(time.month, time.year)):
month_length[i] = cal_days[month]
if leap_year(year, calendar=calendar):
month_length[i] += 1
return month_length
Open the Dataset
¶
monthly_mean_file = 'RASM_example_data.nc'
ds = xray.open_dataset(monthly_mean_file, decode_coords=False)
print(ds)
<xray.Dataset>
Dimensions: (time: 36, x: 275, y: 205)
Coordinates:
* time (time) datetime64[ns] 1980-09-16T12:00:00 1980-10-17 ...
* x (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
* y (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
Tair (time, y, x) float64 nan nan nan nan nan nan nan nan nan nan ...
Attributes:
title: /workspace/jhamman/processed/R1002RBRxaaa01a/lnd/temp/R1002RBRxaaa01a.vic.ha.1979-09-01.nc
institution: U.W.
source: RACM R1002RBRxaaa01a
output_frequency: daily
output_mode: averaged
convention: CF-1.4
references: Based on the initial model of Liang et al., 1994, JGR, 99, 14,415- 14,429.
comment: Output from the Variable Infiltration Capacity (VIC) model.
nco_openmp_thread_number: 1
NCO: 4.3.7
history: history deleted for brevity
Now for the heavy lifting:¶
We first have to come up with the weights, - calculate the month lengths
for each monthly data record - calculate weights using
groupby('time.season')
Finally, we just need to multiply our weights by the Dataset
and sum
allong the time dimension.
# Make a DataArray with the number of days in each month, size = len(time)
month_length = xray.DataArray(get_dpm(ds.time.to_index(),
calendar='noleap'),
coords=[ds.time], name='month_length')
# Calculate the weights by grouping by 'time.season'.
# Conversion to float type ('astype(float)') only necessary for Python 2.x
weights = month_length.groupby('time.season') / month_length.astype(float).groupby('time.season').sum()
# Test that the sum of the weights for each season is 1.0
np.testing.assert_allclose(weights.groupby('time.season').sum().values, np.ones(4))
# Calculate the weighted average
ds_weighted = (ds * weights).groupby('time.season').sum(dim='time')
print(ds_weighted)
<xray.Dataset>
Dimensions: (season: 4, x: 275, y: 205)
Coordinates:
* y (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
* x (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
* season (season) object 'DJF' 'JJA' 'MAM' 'SON'
Data variables:
Tair (season, y, x) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
# only used for comparisons
ds_unweighted = ds.groupby('time.season').mean('time')
ds_diff = ds_weighted - ds_unweighted
# Quick plot to show the results
is_null = np.isnan(ds_unweighted['Tair'][0].values)
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(14,12))
for i, season in enumerate(('DJF', 'MAM', 'JJA', 'SON')):
plt.sca(axes[i, 0])
plt.pcolormesh(np.ma.masked_where(is_null, ds_weighted['Tair'].sel(season=season).values),
vmin=-30, vmax=30, cmap='Spectral_r')
plt.colorbar(extend='both')
plt.sca(axes[i, 1])
plt.pcolormesh(np.ma.masked_where(is_null, ds_unweighted['Tair'].sel(season=season).values),
vmin=-30, vmax=30, cmap='Spectral_r')
plt.colorbar(extend='both')
plt.sca(axes[i, 2])
plt.pcolormesh(np.ma.masked_where(is_null, ds_diff['Tair'].sel(season=season).values),
vmin=-0.1, vmax=.1, cmap='RdBu_r')
plt.colorbar(extend='both')
for j in range(3):
axes[i, j].axes.get_xaxis().set_ticklabels([])
axes[i, j].axes.get_yaxis().set_ticklabels([])
axes[i, j].axes.axis('tight')
axes[i, 0].set_ylabel(season)
axes[0, 0].set_title('Weighted by DPM')
axes[0, 1].set_title('Equal Weighting')
axes[0, 2].set_title('Difference')
plt.tight_layout()
fig.suptitle('Seasonal Surface Air Temperature', fontsize=16, y=1.02)

# Wrap it into a simple function
def season_mean(ds, calendar='standard'):
# Make a DataArray of season/year groups
year_season = xray.DataArray(ds.time.to_index().to_period(freq='Q-NOV').to_timestamp(how='E'),
coords=[ds.time], name='year_season')
# Make a DataArray with the number of days in each month, size = len(time)
month_length = xray.DataArray(get_dpm(ds.time.to_index(), calendar=calendar),
coords=[ds.time], name='month_length')
# Calculate the weights by grouping by 'time.season'
weights = month_length.groupby('time.season') / month_length.groupby('time.season').sum()
# Test that the sum of the weights for each season is 1.0
np.testing.assert_allclose(weights.groupby('time.season').sum().values, np.ones(4))
# Calculate the weighted average
return (ds * weights).groupby('time.season').sum(dim='time')
Working with Multidimensional Coordinates¶
Author: Ryan Abernathey
Many datasets have physical coordinates which differ from their logical coordinates. Xarray provides several ways to plot and analyze such datasets.
%matplotlib inline
import numpy as np
import pandas as pd
import xarray as xr
import cartopy.crs as ccrs
from matplotlib import pyplot as plt
print("numpy version : ", np.__version__)
print("pandas version : ", pd.__version__)
print("xarray version : ", xr.version.version)
('numpy version : ', '1.11.0')
('pandas version : ', u'0.18.0')
('xarray version : ', '0.7.2-32-gf957eb8')
As an example, consider this dataset from the xarray-data repository.
! curl -L -O https://github.com/pydata/xarray-data/raw/master/RASM_example_data.nc
ds = xr.open_dataset('RASM_example_data.nc')
ds
<xarray.Dataset>
Dimensions: (time: 36, x: 275, y: 205)
Coordinates:
* time (time) datetime64[ns] 1980-09-16T12:00:00 1980-10-17 ...
yc (y, x) float64 16.53 16.78 17.02 17.27 17.51 17.76 18.0 18.25 ...
xc (y, x) float64 189.2 189.4 189.6 189.7 189.9 190.1 190.2 190.4 ...
* x (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
* y (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
Tair (time, y, x) float64 nan nan nan nan nan nan nan nan nan nan ...
Attributes:
title: /workspace/jhamman/processed/R1002RBRxaaa01a/lnd/temp/R1002RBRxaaa01a.vic.ha.1979-09-01.nc
institution: U.W.
source: RACM R1002RBRxaaa01a
output_frequency: daily
output_mode: averaged
convention: CF-1.4
references: Based on the initial model of Liang et al., 1994, JGR, 99, 14,415- 14,429.
comment: Output from the Variable Infiltration Capacity (VIC) model.
nco_openmp_thread_number: 1
NCO: 4.3.7
history: history deleted for brevity
In this example, the logical coordinates are x
and y
, while
the physical coordinates are xc
and yc
, which represent the
latitudes and longitude of the data.
print(ds.xc.attrs)
print(ds.yc.attrs)
OrderedDict([(u'long_name', u'longitude of grid cell center'), (u'units', u'degrees_east'), (u'bounds', u'xv')])
OrderedDict([(u'long_name', u'latitude of grid cell center'), (u'units', u'degrees_north'), (u'bounds', u'yv')])
Plotting¶
Let’s examine these coordinate variables by plotting them.
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(14,4))
ds.xc.plot(ax=ax1)
ds.yc.plot(ax=ax2)
<matplotlib.collections.QuadMesh at 0x118688fd0>
/Users/rpa/anaconda/lib/python2.7/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
if self._edgecolors == str('face'):

Note that the variables xc
(longitude) and yc
(latitude) are
two-dimensional scalar fields.
If we try to plot the data variable Tair
, by default we get the
logical coordinates.
ds.Tair[0].plot()
<matplotlib.collections.QuadMesh at 0x11b6da890>

In order to visualize the data on a conventional latitude-longitude grid, we can take advantage of xarray’s ability to apply cartopy map projections.
plt.figure(figsize=(14,6))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.set_global()
ds.Tair[0].plot.pcolormesh(ax=ax, transform=ccrs.PlateCarree(), x='xc', y='yc', add_colorbar=False)
ax.coastlines()
ax.set_ylim([0,90]);

Multidimensional Groupby¶
The above example allowed us to visualize the data on a regular
latitude-longitude grid. But what if we want to do a calculation that
involves grouping over one of these physical coordinates (rather than
the logical coordinates), for example, calculating the mean temperature
at each latitude. This can be achieved using xarray’s groupby
function, which accepts multidimensional variables. By default,
groupby
will use every unique value in the variable, which is
probably not what we want. Instead, we can use the groupby_bins
function to specify the output coordinates of the group.
# define two-degree wide latitude bins
lat_bins = np.arange(0,91,2)
# define a label for each bin corresponding to the central latitude
lat_center = np.arange(1,90,2)
# group according to those bins and take the mean
Tair_lat_mean = ds.Tair.groupby_bins('xc', lat_bins, labels=lat_center).mean()
# plot the result
Tair_lat_mean.plot()
[<matplotlib.lines.Line2D at 0x11cb92e90>]

Note that the resulting coordinate for the groupby_bins
operation
got the _bins
suffix appended: xc_bins
. This help us distinguish
it from the original multidimensional variable xc
.
Installation¶
Optional dependencies¶
For netCDF and IO¶
- netCDF4: recommended if you want to use xarray for reading or writing netCDF files
- scipy: used as a fallback for reading/writing netCDF3
- pydap: used as a fallback for accessing OPeNDAP
- h5netcdf: an alternative library for reading and writing netCDF4 files that does not use the netCDF-C libraries
- pynio: for reading GRIB and other geoscience specific file formats
For accelerating xarray¶
- bottleneck: speeds up NaN-skipping and rolling window aggregations by a large factor
- cyordereddict: speeds up most internal operations with xarray data structures
For parallel computing¶
- dask.array: required for Out of core computation with dask.
Instructions¶
xarray itself is a pure Python package, but its dependencies are not. The easiest way to get everything installed is to use conda. To install xarray with its recommended dependencies using the conda command line tool:
$ conda install xarray dask netCDF4 bottleneck
We recommend using the community maintained conda-forge channel if you need difficult to build dependencies such as cartopy or pynio:
$ conda install -c conda-forge xarray cartopy pynio
New releases may also appear in conda-forge before being updated in the default channel.
If you don’t use conda, be sure you have the required dependencies (numpy and pandas) installed first. Then, install xarray with pip:
$ pip install xarray
To run the test suite after installing xarray, install
py.test (pip install pytest
) and run
py.test xarray
.
Data Structures¶
DataArray¶
xarray.DataArray
is xarray’s implementation of a labeled,
multi-dimensional array. It has several key properties:
values
: anumpy.ndarray
holding the array’s valuesdims
: dimension names for each axis (e.g.,('x', 'y', 'z')
)coords
: a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings)attrs
: anOrderedDict
to hold arbitrary metadata (attributes)
xarray uses dims
and coords
to enable its core metadata aware operations.
Dimensions provide names that xarray uses instead of the axis
argument found
in many numpy functions. Coordinates enable fast label based indexing and
alignment, building on the functionality of the index
found on a pandas
DataFrame
or Series
.
DataArray objects also can have a name
and can hold arbitrary metadata in
the form of their attrs
property (an ordered dictionary). Names and
attributes are strictly for users and user-written code: xarray makes no attempt
to interpret them, and propagates them only in unambiguous cases (see FAQ,
What is your approach to metadata?).
Creating a DataArray¶
The DataArray
constructor takes:
data
: a multi-dimensional array of values (e.g., a numpy ndarray,Series
,DataFrame
orPanel
)coords
: a list or dictionary of coordinatesdims
: a list of dimension names. If omitted, dimension names are taken fromcoords
if possibleattrs
: a dictionary of attributes to add to the instancename
: a string that names the instance
In [1]: data = np.random.rand(4, 3)
In [2]: locs = ['IA', 'IL', 'IN']
In [3]: times = pd.date_range('2000-01-01', periods=4)
In [4]: foo = xr.DataArray(data, coords=[times, locs], dims=['time', 'space'])
In [5]: foo
Out[5]:
<xarray.DataArray (time: 4, space: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Only data
is required; all of other arguments will be filled
in with default values:
In [6]: xr.DataArray(data)
Out[6]:
<xarray.DataArray (dim_0: 4, dim_1: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
* dim_0 (dim_0) int64 0 1 2 3
* dim_1 (dim_1) int64 0 1 2
As you can see, dimensions and coordinate arrays corresponding to each dimension are always present. This behavior is similar to pandas, which fills in index values in the same way.
Coordinates can take the following forms:
- A list of
(dim, ticks[, attrs])
pairs with length equal to the number of dimensions - A dictionary of
{coord_name: coord}
where the values are each a scalar value, a 1D array or a tuple. Tuples are be in the same form as the above, and multiple dimensions can be supplied with the form(dims, data[, attrs])
. Supplying as a tuple allows other coordinates than those corresponding to dimensions (more on these later).
As a list of tuples:
In [7]: xr.DataArray(data, coords=[('time', times), ('space', locs)])
Out[7]:
<xarray.DataArray (time: 4, space: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
As a dictionary:
In [8]: xr.DataArray(data, coords={'time': times, 'space': locs, 'const': 42,
...: 'ranking': ('space', [1, 2, 3])},
...: dims=['time', 'space'])
...:
Out[8]:
<xarray.DataArray (time: 4, space: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
ranking (space) int64 1 2 3
* space (space) |S2 'IA' 'IL' 'IN'
const int64 42
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
As a dictionary with coords across multiple dimensions:
In [9]: xr.DataArray(data, coords={'time': times, 'space': locs, 'const': 42,
...: 'ranking': (('time', 'space'), np.arange(12).reshape(4,3))},
...: dims=['time', 'space'])
...:
Out[9]:
<xarray.DataArray (time: 4, space: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
ranking (time, space) int64 0 1 2 3 4 5 6 7 8 9 10 11
* space (space) |S2 'IA' 'IL' 'IN'
const int64 42
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
If you create a DataArray
by supplying a pandas
Series
, DataFrame
or
Panel
, any non-specified arguments in the
DataArray
constructor will be filled in from the pandas object:
In [10]: df = pd.DataFrame({'x': [0, 1], 'y': [2, 3]}, index=['a', 'b'])
In [11]: df.index.name = 'abc'
In [12]: df.columns.name = 'xyz'
In [13]: df
Out[13]:
xyz x y
abc
a 0 2
b 1 3
In [14]: xr.DataArray(df)
Out[14]:
<xarray.DataArray (abc: 2, xyz: 2)>
array([[0, 2],
[1, 3]])
Coordinates:
* abc (abc) object 'a' 'b'
* xyz (xyz) object 'x' 'y'
Xarray supports labeling coordinate values with a pandas.MultiIndex
.
While it handles multi-indexes with unnamed levels, it is recommended that you
explicitly set the names of the levels.
DataArray properties¶
Let’s take a look at the important properties on our array:
In [15]: foo.values
Out[15]:
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
In [16]: foo.dims
Out[16]: ('time', 'space')
In [17]: foo.coords
Out[17]:
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
In [18]: foo.attrs
Out[18]: OrderedDict()
In [19]: print(foo.name)
None
You can even modify values
inplace:
In [20]: foo.values = 1.0 * foo.values
Note
The array values in a DataArray
have a single
(homogeneous) data type. To work with heterogeneous or structured data
types in xarray, use coordinates, or put separate DataArray
objects
in a single Dataset
(see below).
Now fill in some of that missing metadata:
In [21]: foo.name = 'foo'
In [22]: foo.attrs['units'] = 'meters'
In [23]: foo
Out[23]:
<xarray.DataArray 'foo' (time: 4, space: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Attributes:
units: meters
The rename()
method is another option, returning a
new data array:
In [24]: foo.rename('bar')
Out[24]:
<xarray.DataArray 'bar' (time: 4, space: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Attributes:
units: meters
DataArray Coordinates¶
The coords
property is dict
like. Individual coordinates can be
accessed from the coordinates by name, or even by indexing the data array
itself:
In [25]: foo.coords['time']
Out[25]:
<xarray.DataArray 'time' (time: 4)>
array(['2000-01-01T00:00:00.000000000', '2000-01-02T00:00:00.000000000',
'2000-01-03T00:00:00.000000000', '2000-01-04T00:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
In [26]: foo['time']
Out[26]:
<xarray.DataArray 'time' (time: 4)>
array(['2000-01-01T00:00:00.000000000', '2000-01-02T00:00:00.000000000',
'2000-01-03T00:00:00.000000000', '2000-01-04T00:00:00.000000000'], dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
These are also DataArray
objects, which contain tick-labels
for each dimension.
Coordinates can also be set or removed by using the dictionary like syntax:
In [27]: foo['ranking'] = ('space', [1, 2, 3])
In [28]: foo.coords
Out[28]:
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
ranking (space) int64 1 2 3
In [29]: del foo['ranking']
In [30]: foo.coords
Out[30]:
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Dataset¶
xarray.Dataset
is xarray’s multi-dimensional equivalent of a
DataFrame
. It is a dict-like
container of labeled arrays (DataArray
objects) with aligned
dimensions. It is designed as an in-memory representation of the data model
from the netCDF file format.
In addition to the dict-like interface of the dataset itself, which can be used to access any variable in a dataset, datasets have four key properties:
dims
: a dictionary mapping from dimension names to the fixed length of each dimension (e.g.,{'x': 6, 'y': 6, 'time': 8}
)data_vars
: a dict-like container of DataArrays corresponding to variablescoords
: another dict-like container of DataArrays intended to label points used indata_vars
(e.g., arrays of numbers, datetime objects or strings)attrs
: anOrderedDict
to hold arbitrary metadata
The distinction between whether a variables falls in data or coordinates (borrowed from CF conventions) is mostly semantic, and you can probably get away with ignoring it if you like: dictionary like access on a dataset will supply variables found in either category. However, xarray does make use of the distinction for indexing and computations. Coordinates indicate constant/fixed/independent quantities, unlike the varying/measured/dependent quantities that belong in data.
Here is an example of how we might structure a dataset for a weather forecast:

In this example, it would be natural to call temperature
and
precipitation
“data variables” and all the other arrays “coordinate
variables” because they label the points along the dimensions. (see [1] for
more background on this example).
Creating a Dataset¶
To make an Dataset
from scratch, supply dictionaries for any
variables (data_vars
), coordinates (coords
) and attributes (attrs
).
data_vars
are supplied as a dictionary with each key as the name of the variable and each
value as one of:
- A DataArray
- A tuple of the form (dims, data[, attrs])
- A pandas object
coords
are supplied as dictionary of {coord_name: coord}
where the values are scalar values,
arrays or tuples in the form of (dims, data[, attrs])
.
Let’s create some fake data for the example we show above:
In [31]: temp = 15 + 8 * np.random.randn(2, 2, 3)
In [32]: precip = 10 * np.random.rand(2, 2, 3)
In [33]: lon = [[-99.83, -99.32], [-99.79, -99.23]]
In [34]: lat = [[42.25, 42.21], [42.63, 42.59]]
# for real use cases, its good practice to supply array attributes such as
# units, but we won't bother here for the sake of brevity
In [35]: ds = xr.Dataset({'temperature': (['x', 'y', 'time'], temp),
....: 'precipitation': (['x', 'y', 'time'], precip)},
....: coords={'lon': (['x', 'y'], lon),
....: 'lat': (['x', 'y'], lat),
....: 'time': pd.date_range('2014-09-06', periods=3),
....: 'reference_time': pd.Timestamp('2014-09-05')})
....:
In [36]: ds
Out[36]:
<xarray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
Data variables:
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
Notice that we did not explicitly include coordinates for the “x” or “y” dimensions, so they were filled in array of ascending integers of the proper length.
Here we pass xarray.DataArray
objects or a pandas object as values
in the dictionary:
In [37]: xr.Dataset({'bar': foo})
Out[37]:
<xarray.Dataset>
Dimensions: (space: 3, time: 4)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Data variables:
bar (time, space) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 ...
In [38]: xr.Dataset({'bar': foo.to_pandas()})
Out[38]:
<xarray.Dataset>
Dimensions: (space: 3, time: 4)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) object 'IA' 'IL' 'IN'
Data variables:
bar (time, space) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 ...
Where a pandas object is supplied as a value, the names of its indexes are used as dimension names, and its data is aligned to any existing dimensions.
You can also create an dataset from:
- A pandas.DataFrame
or pandas.Panel
along its columns and items
respectively, by passing it into thexarray.Dataset
directly
- A
pandas.DataFrame
withDataset.from_dataframe
, which will additionally handle MultiIndexes See Working with pandas - A netCDF file on disk with
open_dataset()
. See Serialization and IO.
Dataset contents¶
Dataset
implements the Python dictionary interface, with
values given by xarray.DataArray
objects:
In [39]: 'temperature' in ds
Out[39]: True
In [40]: ds.keys()
Out[40]:
['precipitation',
'temperature',
'lat',
'reference_time',
'lon',
'time',
'x',
'y']
In [41]: ds['temperature']
Out[41]:
<xarray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[ 11.041, 23.574, 20.772],
[ 9.346, 6.683, 17.175]],
[[ 11.6 , 19.536, 17.21 ],
[ 6.301, 9.61 , 15.909]]])
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
The valid keys include each listed coordinate and data variable.
Data and coordinate variables are also contained separately in the
data_vars
and coords
dictionary-like attributes:
In [42]: ds.data_vars
Out[42]:
Data variables:
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 0.3777 ...
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
In [43]: ds.coords
Out[43]:
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
Finally, like data arrays, datasets also store arbitrary metadata in the form of attributes:
In [44]: ds.attrs
Out[44]: OrderedDict()
In [45]: ds.attrs['title'] = 'example attribute'
In [46]: ds
Out[46]:
<xarray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
Data variables:
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
Attributes:
title: example attribute
xarray does not enforce any restrictions on attributes, but serialization to
some file formats may fail if you use objects that are not strings, numbers
or numpy.ndarray
objects.
As a useful shortcut, you can use attribute style access for reading (but not setting) variables and attributes:
In [47]: ds.temperature
Out[47]:
<xarray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[ 11.041, 23.574, 20.772],
[ 9.346, 6.683, 17.175]],
[[ 11.6 , 19.536, 17.21 ],
[ 6.301, 9.61 , 15.909]]])
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
This is particularly useful in an exploratory context, because you can tab-complete these variable names with tools like IPython.
Dictionary like methods¶
We can update a dataset in-place using Python’s standard dictionary syntax. For example, to create this example dataset from scratch, we could have written:
In [48]: ds = xr.Dataset()
In [49]: ds['temperature'] = (('x', 'y', 'time'), temp)
In [50]: ds['precipitation'] = (('x', 'y', 'time'), precip)
In [51]: ds.coords['lat'] = (('x', 'y'), lat)
In [52]: ds.coords['lon'] = (('x', 'y'), lon)
In [53]: ds.coords['time'] = pd.date_range('2014-09-06', periods=3)
In [54]: ds.coords['reference_time'] = pd.Timestamp('2014-09-05')
To change the variables in a Dataset
, you can use all the standard dictionary
methods, including values
, items
, __delitem__
, get
and
update()
. Note that assigning a DataArray
or pandas
object to a Dataset
variable using __setitem__
or update
will
automatically align the array(s) to the original
dataset’s indexes.
You can copy a Dataset
by calling the copy()
method. By default, the copy is shallow, so only the container will be copied:
the arrays in the Dataset
will still be stored in the same underlying
numpy.ndarray
objects. You can copy all data by calling
ds.copy(deep=True)
.
Transforming datasets¶
In addition to dictionary-like methods (described above), xarray has additional methods (like pandas) for transforming datasets into new objects.
For removing variables, you can select and drop an explicit list of variables
by indexing with a list of names or using the drop()
methods to return a new Dataset
. These operations keep around coordinates:
In [55]: list(ds[['temperature']])
Out[55]: ['temperature', 'reference_time', 'lon', 'y', 'time', 'lat', 'x']
In [56]: list(ds[['x']])
Out[56]: ['x', 'reference_time']
In [57]: list(ds.drop('temperature'))
Out[57]: ['x', 'y', 'time', 'precipitation', 'lat', 'lon', 'reference_time']
If a dimension name is given as an argument to drop
, it also drops all
variables that use that dimension:
In [58]: list(ds.drop('time'))
Out[58]: ['x', 'y', 'lat', 'lon', 'reference_time']
As an alternate to dictionary-like modifications, you can use
assign()
and assign_coords()
.
These methods return a new dataset with additional (or replaced) or values:
In [59]: ds.assign(temperature2 = 2 * ds.temperature)
Out[59]:
<xarray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
* x (x) int64 0 1
* y (y) int64 0 1
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
Data variables:
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
temperature2 (x, y, time) float64 22.08 47.15 41.54 18.69 13.37 34.35 ...
There is also the pipe()
method that allows you to use
a method call with an external function (e.g., ds.pipe(func)
) instead of
simply calling it (e.g., func(ds)
). This allows you to write pipelines for
transforming you data (using “method chaining”) instead of writing hard to
follow nested function calls:
# these lines are equivalent, but with pipe we can make the logic flow
# entirely from left to right
In [60]: plt.plot((2 * ds.temperature.sel(x=0)).mean('y'))
Out[60]: [<matplotlib.lines.Line2D at 0x7fe8ec1ad610>]
In [61]: (ds.temperature
....: .sel(x=0)
....: .pipe(lambda x: 2 * x)
....: .mean('y')
....: .pipe(plt.plot))
....:
Out[61]: [<matplotlib.lines.Line2D at 0x7fe8ec1adb10>]
Both pipe
and assign
replicate the pandas methods of the same names
(DataFrame.pipe
and
DataFrame.assign
).
With xarray, there is no performance penalty for creating new datasets, even if variables are lazily loaded from a file on disk. Creating new objects instead of mutating existing objects often results in easier to understand code, so we encourage using this approach.
Renaming variables¶
Another useful option is the rename()
method to rename
dataset variables:
In [62]: ds.rename({'temperature': 'temp', 'precipitation': 'precip'})
Out[62]:
<xarray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
* x (x) int64 0 1
* y (y) int64 0 1
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
Data variables:
temp (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
precip (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
The related swap_dims()
method allows you do to swap
dimension and non-dimension variables:
In [63]: ds.coords['day'] = ('time', [6, 7, 8])
In [64]: ds.swap_dims({'time': 'day'})
Out[64]:
<xarray.Dataset>
Dimensions: (day: 3, x: 2, y: 2)
Coordinates:
* x (x) int64 0 1
* y (y) int64 0 1
time (day) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
* day (day) int64 6 7 8
Data variables:
temperature (x, y, day) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
precipitation (x, y, day) float64 5.904 2.453 3.404 9.847 9.195 0.3777 ...
Coordinates¶
Coordinates are ancillary variables stored for DataArray
and Dataset
objects in the coords
attribute:
In [65]: ds.coords
Out[65]:
Coordinates:
* x (x) int64 0 1
* y (y) int64 0 1
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
day (time) int64 6 7 8
Unlike attributes, xarray does interpret and persist coordinates in operations that transform xarray objects.
One dimensional coordinates with a name equal to their sole dimension (marked
by *
when printing a dataset or data array) take on a special meaning in
xarray. They are used for label based indexing and alignment,
like the index
found on a pandas DataFrame
or
Series
. Indeed, these “dimension” coordinates use a
pandas.Index
internally to store their values.
Other than for indexing, xarray does not make any direct use of the values associated with coordinates. Coordinates with names not matching a dimension are not used for alignment or indexing, nor are they required to match when doing arithmetic (see Coordinates).
Modifying coordinates¶
To entirely add or remove coordinate arrays, you can use dictionary like syntax, as shown above.
To convert back and forth between data and coordinates, you can use the
set_coords()
and
reset_coords()
methods:
In [66]: ds.reset_coords()
Out[66]:
<xarray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
* x (x) int64 0 1
* y (y) int64 0 1
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
Data variables:
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
day (time) int64 6 7 8
In [67]: ds.set_coords(['temperature', 'precipitation'])
Out[67]:
<xarray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
* x (x) int64 0 1
* y (y) int64 0 1
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
day (time) int64 6 7 8
Data variables:
*empty*
In [68]: ds['temperature'].reset_coords(drop=True)
Out[68]:
<xarray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[ 11.041, 23.574, 20.772],
[ 9.346, 6.683, 17.175]],
[[ 11.6 , 19.536, 17.21 ],
[ 6.301, 9.61 , 15.909]]])
Coordinates:
* x (x) int64 0 1
* y (y) int64 0 1
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
Notice that these operations skip coordinates with names given by dimensions, as used for indexing. This mostly because we are not entirely sure how to design the interface around the fact that xarray cannot store a coordinate and variable with the name but different values in the same dictionary. But we do recognize that supporting something like this would be useful.
Coordinates methods¶
Coordinates
objects also have a few useful methods, mostly for converting
them into dataset objects:
In [69]: ds.coords.to_dataset()
Out[69]:
<xarray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* y (y) int64 0 1
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
lat (x, y) float64 42.25 42.21 42.63 42.59
* x (x) int64 0 1
day (time) int64 6 7 8
Data variables:
*empty*
The merge method is particularly interesting, because it implements the same logic used for merging coordinates in arithmetic operations (see Computation):
In [70]: alt = xr.Dataset(coords={'z': [10], 'lat': 0, 'lon': 0})
In [71]: ds.coords.merge(alt.coords)
Out[71]:
<xarray.Dataset>
Dimensions: (time: 3, x: 2, y: 2, z: 1)
Coordinates:
* x (x) int64 0 1
* y (y) int64 0 1
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
reference_time datetime64[ns] 2014-09-05
day (time) int64 6 7 8
* z (z) int64 10
Data variables:
*empty*
The coords.merge
method may be useful if you want to implement your own
binary operations that act on xarray objects. In the future, we hope to write
more helper functions so that you can easily make your functions act like
xarray’s built-in arithmetic.
Indexes¶
To convert a coordinate (or any DataArray
) into an actual
pandas.Index
, use the to_index()
method:
In [72]: ds['time'].to_index()
Out[72]: DatetimeIndex(['2014-09-06', '2014-09-07', '2014-09-08'], dtype='datetime64[ns]', name=u'time', freq='D')
A useful shortcut is the indexes
property (on both DataArray
and
Dataset
), which lazily constructs a dictionary whose keys are given by each
dimension and whose the values are Index
objects:
In [73]: ds.indexes
Out[73]:
y: Int64Index([0, 1], dtype='int64', name=u'y')
x: Int64Index([0, 1], dtype='int64', name=u'x')
time: DatetimeIndex(['2014-09-06', '2014-09-07', '2014-09-08'], dtype='datetime64[ns]', name=u'time', freq='D')
[1] | Latitude and longitude are 2D arrays because the dataset uses
projected coordinates. reference_time refers to the reference time
at which the forecast was made, rather than time which is the valid time
for which the forecast applies. |
Indexing and selecting data¶
Similarly to pandas objects, xarray objects support both integer and label based lookups along each dimension. However, xarray objects also have named dimensions, so you can optionally use dimension names instead of relying on the positional ordering of dimensions.
Thus in total, xarray supports four different kinds of indexing, as described below and summarized in this table:
Dimension lookup | Index lookup | DataArray syntax |
Dataset syntax |
---|---|---|---|
Positional | By integer | arr[:, 0] |
not available |
Positional | By label | arr.loc[:, 'IA'] |
not available |
By name | By integer | arr.isel(space=0) or arr[dict(space=0)] |
ds.isel(space=0) or ds[dict(space=0)] |
By name | By label | arr.sel(space='IA') or arr.loc[dict(space='IA')] |
ds.sel(space='IA') or ds.loc[dict(space='IA')] |
Positional indexing¶
Indexing a DataArray
directly works (mostly) just like it
does for numpy arrays, except that the returned object is always another
DataArray:
In [1]: arr = xr.DataArray(np.random.rand(4, 3),
...: [('time', pd.date_range('2000-01-01', periods=4)),
...: ('space', ['IA', 'IL', 'IN'])])
...:
In [2]: arr[:2]
Out[2]:
<xarray.DataArray (time: 2, space: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
* space (space) |S2 'IA' 'IL' 'IN'
In [3]: arr[0, 0]
Out[3]:
<xarray.DataArray ()>
array(0.12696983303810094)
Coordinates:
time datetime64[ns] 2000-01-01
space |S2 'IA'
In [4]: arr[:, [2, 1]]
Out[4]:
<xarray.DataArray (time: 4, space: 2)>
array([[ 0.26 , 0.967],
[ 0.336, 0.377],
[ 0.123, 0.84 ],
[ 0.448, 0.373]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IN' 'IL'
Attributes are persisted in all indexing operations.
Warning
Positional indexing deviates from the NumPy when indexing with multiple
arrays like arr[[0, 1], [0, 1]]
, as described in Orthogonal (outer) vs. vectorized indexing.
See Pointwise indexing for how to achieve this functionality in
xarray.
xarray also supports label-based indexing, just like pandas. Because
we use a pandas.Index
under the hood, label based indexing is very
fast. To do label based indexing, use the loc
attribute:
In [5]: arr.loc['2000-01-01':'2000-01-02', 'IA']
Out[5]:
<xarray.DataArray (time: 2)>
array([ 0.127, 0.897])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
space |S2 'IA'
You can perform any of the label indexing operations supported by pandas, including indexing with individual, slices and arrays of labels, as well as indexing with boolean arrays. Like pandas, label based indexing in xarray is inclusive of both the start and stop bounds.
Setting values with label based indexing is also supported:
In [6]: arr.loc['2000-01-01', ['IL', 'IN']] = -10
In [7]: arr
Out[7]:
<xarray.DataArray (time: 4, space: 3)>
array([[ 0.127, -10. , -10. ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Indexing with labeled dimensions¶
With labeled dimensions, we do not have to rely on dimension order and can use them explicitly to slice data. There are two ways to do this:
Use a dictionary as the argument for array positional or label based array indexing:
# index by integer array indices In [8]: arr[dict(space=0, time=slice(None, 2))] Out[8]: <xarray.DataArray (time: 2)> array([ 0.127, 0.897]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 space |S2 'IA' # index by dimension coordinate labels In [9]: arr.loc[dict(time=slice('2000-01-01', '2000-01-02'))] Out[9]: <xarray.DataArray (time: 2, space: 3)> array([[ 0.127, -10. , -10. ], [ 0.897, 0.377, 0.336]]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 * space (space) |S2 'IA' 'IL' 'IN'
Use the
sel()
andisel()
convenience methods:# index by integer array indices In [10]: arr.isel(space=0, time=slice(None, 2)) Out[10]: <xarray.DataArray (time: 2)> array([ 0.127, 0.897]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 space |S2 'IA' # index by dimension coordinate labels In [11]: arr.sel(time=slice('2000-01-01', '2000-01-02')) Out[11]: <xarray.DataArray (time: 2, space: 3)> array([[ 0.127, -10. , -10. ], [ 0.897, 0.377, 0.336]]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 * space (space) |S2 'IA' 'IL' 'IN'
The arguments to these methods can be any objects that could index the array
along the dimension given by the keyword, e.g., labels for an individual value,
Python slice()
objects or 1-dimensional arrays.
Note
We would love to be able to do indexing with labeled dimension names inside
brackets, but unfortunately, Python does yet not support indexing with
keyword arguments like arr[space=0]
Warning
Do not try to assign values when using any of the indexing methods isel
,
isel_points
, sel
or sel_points
:
# DO NOT do this
arr.isel(space=0) = 0
Depending on whether the underlying numpy indexing returns a copy or a view, the method will fail, and when it fails, it will fail silently. Instead, you should use normal index assignment:
# this is safe
arr[dict(space=0)] = 0
Pointwise indexing¶
xarray pointwise indexing supports the indexing along multiple labeled dimensions
using list-like objects. While isel()
performs
orthogonal indexing, the isel_points()
method
provides similar numpy indexing behavior as if you were using multiple
lists to index an array (e.g. arr[[0, 1], [0, 1]]
):
# index by integer array indices
In [12]: da = xr.DataArray(np.arange(56).reshape((7, 8)), dims=['x', 'y'])
In [13]: da
Out[13]:
<xarray.DataArray (x: 7, y: 8)>
array([[ 0, 1, 2, ..., 5, 6, 7],
[ 8, 9, 10, ..., 13, 14, 15],
[16, 17, 18, ..., 21, 22, 23],
...,
[32, 33, 34, ..., 37, 38, 39],
[40, 41, 42, ..., 45, 46, 47],
[48, 49, 50, ..., 53, 54, 55]])
Coordinates:
* x (x) int64 0 1 2 3 4 5 6
* y (y) int64 0 1 2 3 4 5 6 7
In [14]: da.isel_points(x=[0, 1, 6], y=[0, 1, 0])
Out[14]:
<xarray.DataArray (points: 3)>
array([ 0, 9, 48])
Coordinates:
y (points) int64 0 1 0
x (points) int64 0 1 6
* points (points) int64 0 1 2
There is also sel_points()
, which analogously
allows you to do point-wise indexing by label:
In [15]: times = pd.to_datetime(['2000-01-03', '2000-01-02', '2000-01-01'])
In [16]: arr.sel_points(space=['IA', 'IL', 'IN'], time=times)
Out[16]:
<xarray.DataArray (points: 3)>
array([ 0.451, 0.377, -10. ])
Coordinates:
time (points) datetime64[ns] 2000-01-03 2000-01-02 2000-01-01
space (points) |S2 'IA' 'IL' 'IN'
* points (points) int64 0 1 2
The equivalent pandas method to sel_points
is
lookup()
.
Dataset indexing¶
We can also use these methods to index all variables in a dataset simultaneously, returning a new dataset:
In [17]: ds = arr.to_dataset(name='foo')
In [18]: ds.isel(space=[0], time=[0])
Out[18]:
<xarray.Dataset>
Dimensions: (space: 1, time: 1)
Coordinates:
* time (time) datetime64[ns] 2000-01-01
* space (space) |S2 'IA'
Data variables:
foo (time, space) float64 0.127
In [19]: ds.sel(time='2000-01-01')
Out[19]:
<xarray.Dataset>
Dimensions: (space: 3)
Coordinates:
time datetime64[ns] 2000-01-01
* space (space) |S2 'IA' 'IL' 'IN'
Data variables:
foo (space) float64 0.127 -10.0 -10.0
In [20]: ds2 = da.to_dataset(name='bar')
In [21]: ds2.isel_points(x=[0, 1, 6], y=[0, 1, 0], dim='points')
Out[21]:
<xarray.Dataset>
Dimensions: (points: 3)
Coordinates:
y (points) int64 0 1 0
x (points) int64 0 1 6
* points (points) int64 0 1 2
Data variables:
bar (points) int64 0 9 48
Positional indexing on a dataset is not supported because the ordering of dimensions in a dataset is somewhat ambiguous (it can vary between different arrays). However, you can do normal indexing with labeled dimensions:
In [22]: ds[dict(space=[0], time=[0])]
Out[22]:
<xarray.Dataset>
Dimensions: (space: 1, time: 1)
Coordinates:
* time (time) datetime64[ns] 2000-01-01
* space (space) |S2 'IA'
Data variables:
foo (time, space) float64 0.127
In [23]: ds.loc[dict(time='2000-01-01')]
Out[23]:
<xarray.Dataset>
Dimensions: (space: 3)
Coordinates:
time datetime64[ns] 2000-01-01
* space (space) |S2 'IA' 'IL' 'IN'
Data variables:
foo (space) float64 0.127 -10.0 -10.0
Using indexing to assign values to a subset of dataset (e.g.,
ds[dict(space=0)] = 1
) is not yet supported.
Dropping labels¶
The drop()
method returns a new object with the listed
index labels along a dimension dropped:
In [24]: ds.drop(['IN', 'IL'], dim='space')
Out[24]:
<xarray.Dataset>
Dimensions: (space: 1, time: 4)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA'
Data variables:
foo (time, space) float64 0.127 0.8972 0.4514 0.543
drop
is both a Dataset
and DataArray
method.
Nearest neighbor lookups¶
The label based selection methods sel()
,
reindex()
and reindex_like()
all
support method
and tolerance
keyword argument. The method parameter allows for
enabling nearest neighbor (inexact) lookups by use of the methods 'pad'
,
'backfill'
or 'nearest'
:
In [25]: data = xr.DataArray([1, 2, 3], dims='x')
In [26]: data.sel(x=[1.1, 1.9], method='nearest')
Out[26]:
<xarray.DataArray (x: 2)>
array([2, 3])
Coordinates:
* x (x) int64 1 2
In [27]: data.sel(x=0.1, method='backfill')
Out[27]:
<xarray.DataArray ()>
array(2)
Coordinates:
x int64 1
In [28]: data.reindex(x=[0.5, 1, 1.5, 2, 2.5], method='pad')
Out[28]:
<xarray.DataArray (x: 5)>
array([1, 2, 2, 3, 3])
Coordinates:
* x (x) float64 0.5 1.0 1.5 2.0 2.5
Tolerance limits the maximum distance for valid matches with an inexact lookup:
In [29]: data.reindex(x=[1.1, 1.5], method='nearest', tolerance=0.2)
Out[29]:
<xarray.DataArray (x: 2)>
array([ 2., nan])
Coordinates:
* x (x) float64 1.1 1.5
Using method='nearest'
or a scalar argument with .sel()
requires pandas
version 0.16 or newer. Using tolerance
requries pandas version 0.17 or newer.
The method parameter is not yet supported if any of the arguments
to .sel()
is a slice
object:
In [30]: data.sel(x=slice(1, 3), method='nearest')
NotImplementedError
However, you don’t need to use method
to do inexact slicing. Slicing
already returns all values inside the range (inclusive), as long as the index
labels are monotonic increasing:
In [31]: data.sel(x=slice(0.9, 3.1))
Out[31]:
<xarray.DataArray (x: 2)>
array([2, 3])
Coordinates:
* x (x) int64 1 2
Indexing axes with monotonic decreasing labels also works, as long as the
slice
or .loc
arguments are also decreasing:
In [32]: reversed_data = data[::-1]
In [33]: reversed_data.loc[3.1:0.9]
Out[33]:
<xarray.DataArray (x: 2)>
array([3, 2])
Coordinates:
* x (x) int64 2 1
Masking with where
¶
Indexing methods on xarray objects generally return a subset of the original data.
However, it is sometimes useful to select an object with the same shape as the
original data, but with some elements masked. To do this type of selection in
xarray, use where()
:
In [34]: arr2 = xr.DataArray(np.arange(16).reshape(4, 4), dims=['x', 'y'])
In [35]: arr2.where(arr2.x + arr2.y < 4)
Out[35]:
<xarray.DataArray (x: 4, y: 4)>
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., nan],
[ 8., 9., nan, nan],
[ 12., nan, nan, nan]])
Coordinates:
* x (x) int64 0 1 2 3
* y (y) int64 0 1 2 3
This is particularly useful for ragged indexing of multi-dimensional data,
e.g., to apply a 2D mask to an image. Note that where
follows all the
usual xarray broadcasting and alignment rules for binary operations (e.g.,
+
) between the object being indexed and the condition, as described in
Computation:
In [36]: arr2.where(arr2.y < 2)
Out[36]:
<xarray.DataArray (x: 4, y: 4)>
array([[ 0., 1., nan, nan],
[ 4., 5., nan, nan],
[ 8., 9., nan, nan],
[ 12., 13., nan, nan]])
Coordinates:
* x (x) int64 0 1 2 3
* y (y) int64 0 1 2 3
By default where
maintains the original size of the data. For cases
where the selected data size is much smaller than the original data,
use of the option drop=True
clips coordinate
elements that are fully masked:
In [37]: arr2.where(arr2.y < 2, drop=True)
Out[37]:
<xarray.DataArray (x: 4, y: 2)>
array([[ 0., 1.],
[ 4., 5.],
[ 8., 9.],
[ 12., 13.]])
Coordinates:
* x (x) int64 0 1 2 3
* y (y) int64 0 1
Multi-level indexing¶
Just like pandas, advanced indexing on multi-level indexes is possible with
loc
and sel
. You can slice a multi-index by providing multiple indexers,
i.e., a tuple of slices, labels, list of labels, or any selector allowed by
pandas:
In [38]: midx = pd.MultiIndex.from_product([list('abc'), [0, 1]],
....: names=('one', 'two'))
....:
In [39]: mda = xr.DataArray(np.random.rand(6, 3),
....: [('x', midx), ('y', range(3))])
....:
In [40]: mda
Out[40]:
<xarray.DataArray (x: 6, y: 3)>
array([[ 0.129, 0.86 , 0.82 ],
[ 0.352, 0.229, 0.777],
[ 0.595, 0.138, 0.853],
[ 0.236, 0.146, 0.59 ],
[ 0.574, 0.061, 0.59 ],
[ 0.245, 0.34 , 0.985]])
Coordinates:
* x (x) object ('a', 0) ('a', 1) ('b', 0) ('b', 1) ('c', 0) ('c', 1)
* y (y) int64 0 1 2
In [41]: mda.sel(x=(list('ab'), [0]))
Out[41]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.129, 0.86 , 0.82 ],
[ 0.595, 0.138, 0.853]])
Coordinates:
* x (x) object ('a', 0) ('b', 0)
* y (y) int64 0 1 2
You can also select multiple elements by providing a list of labels or tuples or a slice of tuples:
In [42]: mda.sel(x=[('a', 0), ('b', 1)])
Out[42]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.129, 0.86 , 0.82 ],
[ 0.236, 0.146, 0.59 ]])
Coordinates:
* x (x) object ('a', 0) ('b', 1)
* y (y) int64 0 1 2
Additionally, xarray supports dictionaries:
In [43]: mda.sel(x={'one': 'a', 'two': 0})
Out[43]:
<xarray.DataArray (y: 3)>
array([ 0.129, 0.86 , 0.82 ])
Coordinates:
x object ('a', 0)
* y (y) int64 0 1 2
In [44]: mda.loc[{'one': 'a'}, ...]
Out[44]:
<xarray.DataArray (two: 2, y: 3)>
array([[ 0.129, 0.86 , 0.82 ],
[ 0.352, 0.229, 0.777]])
Coordinates:
* two (two) int64 0 1
* y (y) int64 0 1 2
Like pandas, xarray handles partial selection on multi-index (level drop). As shown in the last example above, it also renames the dimension / coordinate when the multi-index is reduced to a single index.
Unlike pandas, xarray does not guess whether you provide index levels or
dimensions when using loc
in some ambiguous cases. For example, for
mda.loc[{'one': 'a', 'two': 0}]
and mda.loc['a', 0]
xarray
always interprets (‘one’, ‘two’) and (‘a’, 0) as the names and
labels of the 1st and 2nd dimension, respectively. You must specify all
dimensions or use the ellipsis in the loc
specifier, e.g. in the example
above, mda.loc[{'one': 'a', 'two': 0}, :]
or mda.loc[('a', 0), ...]
.
Multi-dimensional indexing¶
xarray does not yet support efficient routines for generalized multi-dimensional indexing or regridding. However, we are definitely interested in adding support for this in the future (see GH475 for the ongoing discussion).
Copies vs. views¶
Whether array indexing returns a view or a copy of the underlying data depends on the nature of the labels. For positional (integer) indexing, xarray follows the same rules as NumPy:
- Positional indexing with only integers and slices returns a view.
- Positional indexing with arrays or lists returns a copy.
The rules for label based indexing are more complex:
- Label-based indexing with only slices returns a view.
- Label-based indexing with arrays returns a copy.
- Label-based indexing with scalars returns a view or a copy, depending upon if the corresponding positional indexer can be represented as an integer or a slice object. The exact rules are determined by pandas.
Whether data is a copy or a view is more predictable in xarray than in pandas, so unlike pandas, xarray does not produce SettingWithCopy warnings. However, you should still avoid assignment with chained indexing.
Orthogonal (outer) vs. vectorized indexing¶
Indexing with xarray objects has one important difference from indexing numpy arrays: you can only use one-dimensional arrays to index xarray objects, and each indexer is applied “orthogonally” along independent axes, instead of using numpy’s broadcasting rules to vectorize indexers. This means you can do indexing like this, which would require slightly more awkward syntax with numpy arrays:
In [45]: arr[arr['time.day'] > 1, arr['space'] != 'IL']
Out[45]:
<xarray.DataArray (time: 3, space: 2)>
array([[ 0.897, 0.336],
[ 0.451, 0.123],
[ 0.543, 0.448]])
Coordinates:
* time (time) datetime64[ns] 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IN'
This is a much simpler model than numpy’s advanced indexing. If you would like to do advanced-style array indexing in xarray, you have several options:
- Pointwise indexing
- Masking with where
- Index the underlying NumPy array directly using
.values
, e.g.,
In [46]: arr.values[arr.values > 0.5]
Out[46]: array([ 0.897, 0.84 , 0.543])
Align and reindex¶
xarray’s reindex
, reindex_like
and align
impose a DataArray
or
Dataset
onto a new set of coordinates corresponding to dimensions. The
original values are subset to the index labels still found in the new labels,
and values corresponding to new labels not found in the original object are
in-filled with NaN.
xarray operations that combine multiple objects generally automatically align their arguments to share the same indexes. However, manual alignment can be useful for greater control and for increased performance.
To reindex a particular dimension, use reindex()
:
In [47]: arr.reindex(space=['IA', 'CA'])
Out[47]:
<xarray.DataArray (time: 4, space: 2)>
array([[ 0.127, nan],
[ 0.897, nan],
[ 0.451, nan],
[ 0.543, nan]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'CA'
The reindex_like()
method is a useful shortcut.
To demonstrate, we will make a subset DataArray with new values:
In [48]: foo = arr.rename('foo')
In [49]: baz = (10 * arr[:2, :2]).rename('baz')
In [50]: baz
Out[50]:
<xarray.DataArray 'baz' (time: 2, space: 2)>
array([[ 1.27 , -100. ],
[ 8.972, 3.767]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
* space (space) |S2 'IA' 'IL'
Reindexing foo
with baz
selects out the first two values along each
dimension:
In [51]: foo.reindex_like(baz)
Out[51]:
<xarray.DataArray 'foo' (time: 2, space: 2)>
array([[ 0.127, -10. ],
[ 0.897, 0.377]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
* space (space) object 'IA' 'IL'
The opposite operation asks us to reindex to a larger shape, so we fill in the missing values with NaN:
In [52]: baz.reindex_like(foo)
Out[52]:
<xarray.DataArray 'baz' (time: 4, space: 3)>
array([[ 1.27 , -100. , nan],
[ 8.972, 3.767, nan],
[ nan, nan, nan],
[ nan, nan, nan]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) object 'IA' 'IL' 'IN'
The align()
function lets us perform more flexible database-like
'inner'
, 'outer'
, 'left'
and 'right'
joins:
In [53]: xr.align(foo, baz, join='inner')
Out[53]:
(<xarray.DataArray 'foo' (time: 2, space: 2)>
array([[ 0.127, -10. ],
[ 0.897, 0.377]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
* space (space) object 'IA' 'IL',
<xarray.DataArray 'baz' (time: 2, space: 2)>
array([[ 1.27 , -100. ],
[ 8.972, 3.767]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
* space (space) object 'IA' 'IL')
In [54]: xr.align(foo, baz, join='outer')
Out[54]:
(<xarray.DataArray 'foo' (time: 4, space: 3)>
array([[ 0.127, -10. , -10. ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) object 'IA' 'IL' 'IN',
<xarray.DataArray 'baz' (time: 4, space: 3)>
array([[ 1.27 , -100. , nan],
[ 8.972, 3.767, nan],
[ nan, nan, nan],
[ nan, nan, nan]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) object 'IA' 'IL' 'IN')
Both reindex_like
and align
work interchangeably between
DataArray
and Dataset
objects, and with any number of matching dimension names:
In [55]: ds
Out[55]:
<xarray.Dataset>
Dimensions: (space: 3, time: 4)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Data variables:
foo (time, space) float64 0.127 -10.0 -10.0 0.8972 0.3767 0.3362 ...
In [56]: ds.reindex_like(baz)
Out[56]:
<xarray.Dataset>
Dimensions: (space: 2, time: 2)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
* space (space) object 'IA' 'IL'
Data variables:
foo (time, space) float64 0.127 -10.0 0.8972 0.3767
In [57]: other = xr.DataArray(['a', 'b', 'c'], dims='other')
# this is a no-op, because there are no shared dimension names
In [58]: ds.reindex_like(other)
Out[58]:
<xarray.Dataset>
Dimensions: (space: 3, time: 4)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Data variables:
foo (time, space) float64 0.127 -10.0 -10.0 0.8972 0.3767 0.3362 ...
Computation¶
The labels associated with DataArray
and
Dataset
objects enables some powerful shortcuts for
computation, notably including aggregation and broadcasting by dimension
names.
Basic array math¶
Arithmetic operations with a single DataArray automatically vectorize (like numpy) over all array values:
In [1]: arr = xr.DataArray(np.random.randn(2, 3),
...: [('x', ['a', 'b']), ('y', [10, 20, 30])])
...:
In [2]: arr - 3
Out[2]:
<xarray.DataArray (x: 2, y: 3)>
array([[-2.5308877 , -3.28286334, -4.5090585 ],
[-4.13563237, -1.78788797, -3.17321465]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
In [3]: abs(arr)
Out[3]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.4691123 , 0.28286334, 1.5090585 ],
[ 1.13563237, 1.21211203, 0.17321465]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
You can also use any of numpy’s or scipy’s many ufunc functions directly on a DataArray:
In [4]: np.sin(arr)
Out[4]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.45209466, -0.27910634, -0.99809483],
[-0.90680094, 0.9363595 , -0.17234978]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
Data arrays also implement many numpy.ndarray
methods:
In [5]: arr.round(2)
Out[5]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.47, -0.28, -1.51],
[-1.14, 1.21, -0.17]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
In [6]: arr.T
Out[6]:
<xarray.DataArray (y: 3, x: 2)>
array([[ 0.4691123 , -1.13563237],
[-0.28286334, 1.21211203],
[-1.5090585 , -0.17321465]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
Missing values¶
xarray objects borrow the isnull()
,
notnull()
, count()
,
dropna()
and fillna()
methods
for working with missing data from pandas:
In [7]: x = xr.DataArray([0, 1, np.nan, np.nan, 2], dims=['x'])
In [8]: x.isnull()
Out[8]:
<xarray.DataArray (x: 5)>
array([False, False, True, True, False], dtype=bool)
Coordinates:
* x (x) int64 0 1 2 3 4
In [9]: x.notnull()
Out[9]:
<xarray.DataArray (x: 5)>
array([ True, True, False, False, True], dtype=bool)
Coordinates:
* x (x) int64 0 1 2 3 4
In [10]: x.count()
Out[10]:
<xarray.DataArray ()>
array(3)
In [11]: x.dropna(dim='x')
Out[11]:
<xarray.DataArray (x: 3)>
array([ 0., 1., 2.])
Coordinates:
* x (x) int64 0 1 4
In [12]: x.fillna(-1)
Out[12]:
<xarray.DataArray (x: 5)>
array([ 0., 1., -1., -1., 2.])
Coordinates:
* x (x) int64 0 1 2 3 4
Like pandas, xarray uses the float value np.nan
(not-a-number) to represent
missing values.
Aggregation¶
Aggregation methods have been updated to take a dim argument instead of axis. This allows for very intuitive syntax for aggregation methods that are applied along particular dimension(s):
In [13]: arr.sum(dim='x')
Out[13]:
<xarray.DataArray (y: 3)>
array([-0.66652007, 0.92924868, -1.68227315])
Coordinates:
* y (y) int64 10 20 30
In [14]: arr.std(['x', 'y'])
Out[14]:
<xarray.DataArray ()>
array(0.9156385956757354)
In [15]: arr.min()
Out[15]:
<xarray.DataArray ()>
array(-1.5090585031735124)
If you need to figure out the axis number for a dimension yourself (say,
for wrapping code designed to work with numpy arrays), you can use the
get_axis_num()
method:
In [16]: arr.get_axis_num('y')
Out[16]: 1
These operations automatically skip missing values, like in pandas:
In [17]: xr.DataArray([1, 2, np.nan, 3]).mean()
Out[17]:
<xarray.DataArray ()>
array(2.0)
If desired, you can disable this behavior by invoking the aggregation method
with skipna=False
.
Rolling window operations¶
DataArray
objects include a rolling()
method. This
method supports rolling window aggregation:
In [18]: arr = xr.DataArray(np.arange(0, 7.5, 0.5).reshape(3, 5),
....: dims=('x', 'y'))
....:
In [19]: arr
Out[19]:
<xarray.DataArray (x: 3, y: 5)>
array([[ 0. , 0.5, 1. , 1.5, 2. ],
[ 2.5, 3. , 3.5, 4. , 4.5],
[ 5. , 5.5, 6. , 6.5, 7. ]])
Coordinates:
* x (x) int64 0 1 2
* y (y) int64 0 1 2 3 4
rolling()
is applied along one dimension using the
name of the dimension as a key (e.g. y
) and the window size as the value
(e.g. 3
). We get back a Rolling
object:
In [20]: arr.rolling(y=3)
Out[20]: DataArrayRolling [window->3,center->False,dim->y]
The label position and minimum number of periods in the rolling window are
controlled by the center
and min_periods
arguments:
In [21]: arr.rolling(y=3, min_periods=2, center=True)
Out[21]: DataArrayRolling [window->3,min_periods->2,center->True,dim->y]
Aggregation and summary methods can be applied directly to the Rolling
object:
In [22]: r = arr.rolling(y=3)
In [23]: r.mean()
Out[23]:
<xarray.DataArray (y: 5, x: 3)>
array([[ nan, nan, nan],
[ nan, nan, nan],
[ 0.5, 3. , 5.5],
[ 1. , 3.5, 6. ],
[ 1.5, 4. , 6.5]])
Coordinates:
* x (x) int64 0 1 2
* y (y) int64 0 1 2 3 4
In [24]: r.reduce(np.std)
Out[24]:
<xarray.DataArray (y: 5, x: 3)>
array([[ nan, nan, nan],
[ nan, nan, nan],
[ 0.40824829, 0.40824829, 0.40824829],
[ 0.40824829, 0.40824829, 0.40824829],
[ 0.40824829, 0.40824829, 0.40824829]])
Coordinates:
* x (x) int64 0 1 2
* y (y) int64 0 1 2 3 4
Note that rolling window aggregations are much faster (both asymptotically and because they avoid a loop in Python) when bottleneck is installed. Otherwise, we fall back to a slower, pure Python implementation.
Finally, we can manually iterate through Rolling
objects:
In [25]: for label, arr_window in r:
# arr_window is a view of x
Broadcasting by dimension name¶
DataArray
objects are automatically align themselves (“broadcasting” in
the numpy parlance) by dimension name instead of axis order. With xarray, you
do not need to transpose arrays or insert dimensions of length 1 to get array
operations to work, as commonly done in numpy with np.reshape()
or
np.newaxis
.
This is best illustrated by a few examples. Consider two one-dimensional arrays with different sizes aligned along different dimensions:
In [26]: a = xr.DataArray([1, 2], [('x', ['a', 'b'])])
In [27]: a
Out[27]:
<xarray.DataArray (x: 2)>
array([1, 2])
Coordinates:
* x (x) |S1 'a' 'b'
In [28]: b = xr.DataArray([-1, -2, -3], [('y', [10, 20, 30])])
In [29]: b
Out[29]:
<xarray.DataArray (y: 3)>
array([-1, -2, -3])
Coordinates:
* y (y) int64 10 20 30
With xarray, we can apply binary mathematical operations to these arrays, and their dimensions are expanded automatically:
In [30]: a * b
Out[30]:
<xarray.DataArray (x: 2, y: 3)>
array([[-1, -2, -3],
[-2, -4, -6]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
Moreover, dimensions are always reordered to the order in which they first appeared:
In [31]: c = xr.DataArray(np.arange(6).reshape(3, 2), [b['y'], a['x']])
In [32]: c
Out[32]:
<xarray.DataArray (y: 3, x: 2)>
array([[0, 1],
[2, 3],
[4, 5]])
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
In [33]: a + c
Out[33]:
<xarray.DataArray (x: 2, y: 3)>
array([[1, 3, 5],
[3, 5, 7]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
This means, for example, that you always subtract an array from its transpose:
In [34]: c - c.T
Out[34]:
<xarray.DataArray (y: 3, x: 2)>
array([[0, 0],
[0, 0],
[0, 0]])
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
You can explicitly broadcast xaray data structures by using the
broadcast()
function:
a2, b2 = xr.broadcast(a, b2) a2 b2
Automatic alignment¶
xarray enforces alignment between index Coordinates (that is,
coordinates with the same name as a dimension, marked by *
) on objects used
in binary operations.
Similarly to pandas, this alignment is automatic for arithmetic on binary operations. Note that unlike pandas, this the result of a binary operation is by the intersection (not the union) of coordinate labels:
In [35]: arr + arr[:1]
Out[35]:
<xarray.DataArray (x: 1, y: 5)>
array([[ 0., 1., 2., 3., 4.]])
Coordinates:
* x (x) int64 0
* y (y) int64 0 1 2 3 4
If the result would be empty, an error is raised instead:
In [36]: arr[:2] + arr[2:]
ValueError: no overlapping labels for some dimensions: ['x']
Before loops or performance critical code, it’s a good idea to align arrays
explicitly (e.g., by putting them in the same Dataset or using
align()
) to avoid the overhead of repeated alignment with each
operation. See Align and reindex for more details.
Note
There is no automatic alignment between arguments when performing in-place
arithmetic operations such as +=
. You will need to use
manual alignment. This ensures in-place
arithmetic never needs to modify data types.
Coordinates¶
Although index coordinates are aligned, other coordinates are not, and if their values conflict, they will be dropped. This is necessary, for example, because indexing turns 1D coordinates into scalar coordinates:
In [37]: arr[0]
Out[37]:
<xarray.DataArray (y: 5)>
array([ 0. , 0.5, 1. , 1.5, 2. ])
Coordinates:
x int64 0
* y (y) int64 0 1 2 3 4
In [38]: arr[1]
Out[38]:
<xarray.DataArray (y: 5)>
array([ 2.5, 3. , 3.5, 4. , 4.5])
Coordinates:
x int64 1
* y (y) int64 0 1 2 3 4
# notice that the scalar coordinate 'x' is silently dropped
In [39]: arr[1] - arr[0]
Out[39]:
<xarray.DataArray (y: 5)>
array([ 2.5, 2.5, 2.5, 2.5, 2.5])
Coordinates:
* y (y) int64 0 1 2 3 4
Still, xarray will persist other coordinates in arithmetic, as long as there are no conflicting values:
# only one argument has the 'x' coordinate
In [40]: arr[0] + 1
Out[40]:
<xarray.DataArray (y: 5)>
array([ 1. , 1.5, 2. , 2.5, 3. ])
Coordinates:
x int64 0
* y (y) int64 0 1 2 3 4
# both arguments have the same 'x' coordinate
In [41]: arr[0] - arr[0]
Out[41]:
<xarray.DataArray (y: 5)>
array([ 0., 0., 0., 0., 0.])
Coordinates:
x int64 0
* y (y) int64 0 1 2 3 4
Math with datasets¶
Datasets support arithmetic operations by automatically looping over all data variables:
In [42]: ds = xr.Dataset({'x_and_y': (('x', 'y'), np.random.randn(3, 5)),
....: 'x_only': ('x', np.random.randn(3))},
....: coords=arr.coords)
....:
In [43]: ds > 0
Out[43]:
<xarray.Dataset>
Dimensions: (x: 3, y: 5)
Coordinates:
* y (y) int64 0 1 2 3 4
* x (x) int64 0 1 2
Data variables:
x_only (x) bool True False True
x_and_y (x, y) bool True False False False False True True False False ...
Datasets support most of the same methods found on data arrays:
In [44]: ds.mean(dim='x')
Out[44]:
<xarray.Dataset>
Dimensions: (y: 5)
Coordinates:
* y (y) int64 0 1 2 3 4
Data variables:
x_only float64 -0.2799
x_and_y (y) float64 0.2553 0.08145 -0.4308 -1.411 -0.2989
In [45]: abs(ds)
Out[45]:
<xarray.Dataset>
Dimensions: (x: 3, y: 5)
Coordinates:
* y (y) int64 0 1 2 3 4
* x (x) int64 0 1 2
Data variables:
x_only (x) float64 0.1136 1.478 0.525
x_and_y (x, y) float64 0.1192 1.044 0.8618 2.105 0.4949 1.072 0.7216 ...
Unfortunately, a limitation of the current version of numpy means that we
cannot override ufuncs for datasets, because datasets cannot be written as
a single array [1]. apply()
works around this
limitation, by applying the given function to each variable in the dataset:
In [46]: ds.apply(np.sin)
Out[46]:
<xarray.Dataset>
Dimensions: (x: 3, y: 5)
Coordinates:
* x (x) int64 0 1 2
* y (y) int64 0 1 2 3 4
Data variables:
x_only (x) float64 0.1134 -0.9957 0.5012
x_and_y (x, y) float64 0.1189 -0.8645 -0.759 -0.8609 -0.475 0.8781 ...
Datasets also use looping over variables for broadcasting in binary
arithmetic. You can do arithmetic between any DataArray
and a dataset:
In [47]: ds + arr
Out[47]:
<xarray.Dataset>
Dimensions: (x: 3, y: 5)
Coordinates:
* x (x) int64 0 1 2
* y (y) int64 0 1 2 3 4
Data variables:
x_only (x, y) float64 0.1136 0.6136 1.114 1.614 2.114 1.022 1.522 ...
x_and_y (x, y) float64 0.1192 -0.5442 0.1382 -0.6046 1.505 3.572 3.722 ...
Arithmetic between two datasets matches data variables of the same name:
In [48]: ds2 = xr.Dataset({'x_and_y': 0, 'x_only': 100})
In [49]: ds - ds2
Out[49]:
<xarray.Dataset>
Dimensions: (x: 3, y: 5)
Coordinates:
* x (x) int64 0 1 2
* y (y) int64 0 1 2 3 4
Data variables:
x_only (x) float64 -99.89 -101.5 -99.48
x_and_y (x, y) float64 0.1192 -1.044 -0.8618 -2.105 -0.4949 1.072 ...
Similarly to index based alignment, the result has the intersection of all
matching variables, and ValueError
is raised if the result would be empty.
[1] | In some future version of NumPy, we should be able to override ufuncs for
datasets by making use of __numpy_ufunc__ . |
GroupBy: split-apply-combine¶
xarray supports “group by” operations with the same API as pandas to implement the split-apply-combine strategy:
- Split your data into multiple independent groups.
- Apply some function to each group.
- Combine your groups back into a single data object.
Group by operations work on both Dataset
and
DataArray
objects. Most of the examples focus on grouping by
a single one-dimensional variable, although support for grouping
over a multi-dimensional variable has recently been implemented. Note that for
one-dimensional data, it is usually faster to rely on pandas’ implementation of
the same pipeline.
Split¶
Let’s create a simple example dataset:
In [1]: ds = xr.Dataset({'foo': (('x', 'y'), np.random.rand(4, 3))},
...: coords={'x': [10, 20, 30, 40],
...: 'letters': ('x', list('abba'))})
...:
In [2]: arr = ds['foo']
In [3]: ds
Out[3]:
<xarray.Dataset>
Dimensions: (x: 4, y: 3)
Coordinates:
* x (x) int64 10 20 30 40
letters (x) |S1 'a' 'b' 'b' 'a'
* y (y) int64 0 1 2
Data variables:
foo (x, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 0.4514 ...
If we groupby the name of a variable or coordinate in a dataset (we can also
use a DataArray directly), we get back a GroupBy
object:
In [4]: ds.groupby('letters')
Out[4]: <xarray.core.groupby.DatasetGroupBy at 0x7fe8e40e4d90>
This object works very similarly to a pandas GroupBy object. You can view
the group indices with the groups
attribute:
In [5]: ds.groupby('letters').groups
Out[5]: {'a': [0, 3], 'b': [1, 2]}
You can also iterate over over groups in (label, group)
pairs:
In [6]: list(ds.groupby('letters'))
Out[6]:
[('a', <xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) int64 10 40
letters (x) |S1 'a' 'a'
* y (y) int64 0 1 2
Data variables:
foo (x, y) float64 0.127 0.9667 0.2605 0.543 0.373 0.448),
('b', <xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) int64 20 30
letters (x) |S1 'b' 'b'
* y (y) int64 0 1 2
Data variables:
foo (x, y) float64 0.8972 0.3767 0.3362 0.4514 0.8403 0.1231)]
Just like in pandas, creating a GroupBy object is cheap: it does not actually split the data until you access particular values.
Binning¶
Sometimes you don’t want to use all the unique values to determine the groups
but instead want to “bin” the data into coarser groups. You could always create
a customized coordinate, but xarray facilitates this via the
groupby_bins()
method.
In [7]: x_bins = [0,25,50]
In [8]: ds.groupby_bins('x', x_bins).groups
Out[8]: {'(0, 25]': [0, 1], '(25, 50]': [2, 3]}
The binning is implemented via pandas.cut, whose documentation details how the bins are assigned. As seen in the example above, by default, the bins are labeled with strings using set notation to precisely identify the bin limits. To override this behavior, you can specify the bin labels explicitly. Here we choose float labels which identify the bin centers:
In [9]: x_bin_labels = [12.5,37.5]
In [10]: ds.groupby_bins('x', x_bins, labels=x_bin_labels).groups
Out[10]: {12.5: [0, 1], 37.5: [2, 3]}
Apply¶
To apply a function to each group, you can use the flexible
apply()
method. The resulting objects are automatically
concatenated back together along the group axis:
In [11]: def standardize(x):
....: return (x - x.mean()) / x.std()
....:
In [12]: arr.groupby('letters').apply(standardize)
Out[12]:
<xarray.DataArray 'foo' (x: 4, y: 3)>
array([[-1.23 , 1.937, -0.726],
[ 1.42 , -0.46 , -0.607],
[-0.191, 1.214, -1.376],
[ 0.339, -0.302, -0.019]])
Coordinates:
* y (y) int64 0 1 2
* x (x) int64 10 20 30 40
letters (x) |S1 'a' 'b' 'b' 'a'
GroupBy objects also have a reduce()
method and
methods like mean()
as shortcuts for applying an
aggregation function:
In [13]: arr.groupby('letters').mean(dim='x')
Out[13]:
<xarray.DataArray 'foo' (letters: 2, y: 3)>
array([[ 0.335, 0.67 , 0.354],
[ 0.674, 0.609, 0.23 ]])
Coordinates:
* y (y) int64 0 1 2
* letters (letters) object 'a' 'b'
Using a groupby is thus also a convenient shortcut for aggregating over all dimensions other than the provided one:
In [14]: ds.groupby('x').std()
Out[14]:
<xarray.Dataset>
Dimensions: (x: 4)
Coordinates:
* x (x) int64 10 20 30 40
letters (x) |S1 'a' 'b' 'b' 'a'
Data variables:
foo (x) float64 0.3684 0.2554 0.2931 0.06957
First and last¶
There are two special aggregation operations that are currently only found on groupby objects: first and last. These provide the first or last example of values for group along the grouped dimension:
In [15]: ds.groupby('letters').first()
Out[15]:
<xarray.Dataset>
Dimensions: (letters: 2, y: 3)
Coordinates:
* y (y) int64 0 1 2
* letters (letters) object 'a' 'b'
Data variables:
foo (letters, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362
By default, they skip missing values (control this with skipna
).
Grouped arithmetic¶
GroupBy objects also support a limited set of binary arithmetic operations, as
a shortcut for mapping over all unique labels. Binary arithmetic is supported
for (GroupBy, Dataset)
and (GroupBy, DataArray)
pairs, as long as the
dataset or data array uses the unique grouped values as one of its index
coordinates. For example:
In [16]: alt = arr.groupby('letters').mean()
In [17]: alt
Out[17]:
<xarray.DataArray 'foo' (letters: 2)>
array([ 0.453, 0.504])
Coordinates:
* letters (letters) object 'a' 'b'
In [18]: ds.groupby('letters') - alt
Out[18]:
<xarray.Dataset>
Dimensions: (x: 4, y: 3)
Coordinates:
* y (y) int64 0 1 2
* x (x) int64 10 20 30 40
letters (x) |S1 'a' 'b' 'b' 'a'
Data variables:
foo (x, y) float64 -0.3261 0.5137 -0.1926 0.3931 -0.1274 -0.1679 ...
This last line is roughly equivalent to the following:
results = []
for label, group in ds.groupby('letters'):
results.append(group - alt.sel(x=label))
xr.concat(results, dim='x')
Squeezing¶
When grouping over a dimension, you can control whether the dimension is
squeezed out or if it should remain with length one on each group by using
the squeeze
parameter:
In [19]: next(iter(arr.groupby('x')))
Out[19]:
(10, <xarray.DataArray 'foo' (y: 3)>
array([ 0.127, 0.967, 0.26 ])
Coordinates:
x int64 10
letters |S1 'a'
* y (y) int64 0 1 2)
In [20]: next(iter(arr.groupby('x', squeeze=False)))
Out[20]:
(10, <xarray.DataArray 'foo' (x: 1, y: 3)>
array([[ 0.127, 0.967, 0.26 ]])
Coordinates:
* x (x) int64 10
letters (x) |S1 'a'
* y (y) int64 0 1 2)
Although xarray will attempt to automatically
transpose
dimensions back into their original order
when you use apply, it is sometimes useful to set squeeze=False
to
guarantee that all original dimensions remain unchanged.
You can always squeeze explicitly later with the Dataset or DataArray
squeeze()
methods.
Multidimensional Grouping¶
Many datasets have a multidimensional coordinate variable (e.g. longitude) which is different from the logical grid dimensions (e.g. nx, ny). Such variables are valid under the CF conventions. Xarray supports groupby operations over multidimensional coordinate variables:
In [21]: da = xr.DataArray([[0,1],[2,3]],
....: coords={'lon': (['ny','nx'], [[30,40],[40,50]] ),
....: 'lat': (['ny','nx'], [[10,10],[20,20]] ),},
....: dims=['ny','nx'])
....:
In [22]: da
Out[22]:
<xarray.DataArray (ny: 2, nx: 2)>
array([[0, 1],
[2, 3]])
Coordinates:
lat (ny, nx) int64 10 10 20 20
lon (ny, nx) int64 30 40 40 50
* ny (ny) int64 0 1
* nx (nx) int64 0 1
In [23]: da.groupby('lon').sum()
Out[23]:
<xarray.DataArray (lon: 3)>
array([0, 3, 3])
Coordinates:
* lon (lon) int64 30 40 50
In [24]: da.groupby('lon').apply(lambda x: x - x.mean(), shortcut=False)
Out[24]:
<xarray.DataArray (ny: 2, nx: 2)>
array([[ 0. , -0.5],
[ 0.5, 0. ]])
Coordinates:
lat (ny, nx) int64 10 10 20 20
lon (ny, nx) int64 30 40 40 50
* ny (ny) int64 0 1
* nx (nx) int64 0 1
Because multidimensional groups have the ability to generate a very large
number of bins, coarse-binning via groupby_bins()
may be desirable:
In [25]: da.groupby_bins('lon', [0,45,50]).sum()
Out[25]:
<xarray.DataArray (lon_bins: 2)>
array([3, 3])
Coordinates:
* lon_bins (lon_bins) object '(0, 45]' '(45, 50]'
Reshaping and reorganizing data¶
These methods allow you to reorganize
Reordering dimensions¶
To reorder dimensions on a DataArray
or across all variables
on a Dataset
, use transpose()
or the
.T
property:
In [1]: ds = xr.Dataset({'foo': (('x', 'y', 'z'), [[[42]]]), 'bar': (('y', 'z'), [[24]])})
In [2]: ds.transpose('y', 'z', 'x')
Out[2]:
<xarray.Dataset>
Dimensions: (x: 1, y: 1, z: 1)
Coordinates:
* x (x) int64 0
* y (y) int64 0
* z (z) int64 0
Data variables:
foo (y, z, x) int64 42
bar (y, z) int64 24
In [3]: ds.T
Out[3]:
<xarray.Dataset>
Dimensions: (x: 1, y: 1, z: 1)
Coordinates:
* x (x) int64 0
* y (y) int64 0
* z (z) int64 0
Data variables:
foo (z, y, x) int64 42
bar (z, y) int64 24
Converting between datasets and arrays¶
To convert from a Dataset to a DataArray, use to_array()
:
In [4]: arr = ds.to_array()
In [5]: arr
Out[5]:
<xarray.DataArray (variable: 2, x: 1, y: 1, z: 1)>
array([[[[42]]],
[[[24]]]])
Coordinates:
* y (y) int64 0
* x (x) int64 0
* z (z) int64 0
* variable (variable) |S3 'foo' 'bar'
This method broadcasts all data variables in the dataset against each other, then concatenates them along a new dimension into a new array while preserving coordinates.
To convert back from a DataArray to a Dataset, use
to_dataset()
:
In [6]: arr.to_dataset(dim='variable')
Out[6]:
<xarray.Dataset>
Dimensions: (x: 1, y: 1, z: 1)
Coordinates:
* y (y) int64 0
* x (x) int64 0
* z (z) int64 0
Data variables:
foo (x, y, z) int64 42
bar (x, y, z) int64 24
The broadcasting behavior of to_array
means that the resulting array
includes the union of data variable dimensions:
In [7]: ds2 = xr.Dataset({'a': 0, 'b': ('x', [3, 4, 5])})
# the input dataset has 4 elements
In [8]: ds2
Out[8]:
<xarray.Dataset>
Dimensions: (x: 3)
Coordinates:
* x (x) int64 0 1 2
Data variables:
a int64 0
b (x) int64 3 4 5
# the resulting array has 6 elements
In [9]: ds2.to_array()
Out[9]:
<xarray.DataArray (variable: 2, x: 3)>
array([[0, 0, 0],
[3, 4, 5]])
Coordinates:
* variable (variable) |S1 'a' 'b'
* x (x) int64 0 1 2
Otherwise, the result could not be represented as an orthogonal array.
If you use to_dataset
without supplying the dim
argument, the DataArray will be converted into a Dataset of one variable:
In [10]: arr.to_dataset(name='combined')
Out[10]:
<xarray.Dataset>
Dimensions: (variable: 2, x: 1, y: 1, z: 1)
Coordinates:
* y (y) int64 0
* x (x) int64 0
* z (z) int64 0
* variable (variable) |S3 'foo' 'bar'
Data variables:
combined (variable, x, y, z) int64 42 24
Stack and unstack¶
As part of xarray’s nascent support for pandas.MultiIndex
, we have
implemented stack()
and
unstack()
method, for combining or splitting dimensions:
In [11]: array = xr.DataArray(np.random.randn(2, 3),
....: coords=[('x', ['a', 'b']), ('y', [0, 1, 2])])
....:
In [12]: stacked = array.stack(z=('x', 'y'))
In [13]: stacked
Out[13]:
<xarray.DataArray (z: 6)>
array([ 0.469, -0.283, -1.509, -1.136, 1.212, -0.173])
Coordinates:
* z (z) object ('a', 0) ('a', 1) ('a', 2) ('b', 0) ('b', 1) ('b', 2)
In [14]: stacked.unstack('z')
Out[14]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.469, -0.283, -1.509],
[-1.136, 1.212, -0.173]])
Coordinates:
* x (x) object 'a' 'b'
* y (y) int64 0 1 2
These methods are modeled on the pandas.DataFrame
methods of the
same name, although they in xarray they always create new dimensions rather than
adding to the existing index or columns.
Like DataFrame.unstack
, xarray’s unstack
always succeeds, even if the multi-index being unstacked does not contain all
possible levels. Missing levels are filled in with NaN
in the resulting object:
In [15]: stacked2 = stacked[::2]
In [16]: stacked2
Out[16]:
<xarray.DataArray (z: 3)>
array([ 0.469, -1.509, 1.212])
Coordinates:
* z (z) object ('a', 0) ('a', 2) ('b', 1)
In [17]: stacked2.unstack('z')
Out[17]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.469, nan, -1.509],
[ nan, 1.212, nan]])
Coordinates:
* x (x) object 'a' 'b'
* y (y) int64 0 1 2
However, xarray’s stack
has an important difference from pandas: unlike
pandas, it does not automatically drop missing values. Compare:
In [18]: array = xr.DataArray([[np.nan, 1], [2, 3]], dims=['x', 'y'])
In [19]: array.stack(z=('x', 'y'))
Out[19]:
<xarray.DataArray (z: 4)>
array([ nan, 1., 2., 3.])
Coordinates:
* z (z) object (0, 0) (0, 1) (1, 0) (1, 1)
In [20]: array.to_pandas().stack()
Out[20]:
x y
0 1 1.0
1 0 2.0
1 3.0
dtype: float64
We departed from pandas’s behavior here because predictable shapes for new array dimensions is necessary for Out of core computation with dask.
Shift and roll¶
To adjust coordinate labels, you can use the shift()
and
roll()
methods:
In [21]: array = xr.DataArray([1, 2, 3, 4], dims='x')
In [22]: array.shift(x=2)
Out[22]:
<xarray.DataArray (x: 4)>
array([ nan, nan, 1., 2.])
Coordinates:
* x (x) int64 0 1 2 3
In [23]: array.roll(x=2)
Out[23]:
<xarray.DataArray (x: 4)>
array([3, 4, 1, 2])
Coordinates:
* x (x) int64 2 3 0 1
Combining data¶
- For combining datasets or data arrays along a dimension, see concatenate.
- For combining datasets with different variables, see merge.
Concatenate¶
To combine arrays along existing or new dimension into a larger array, you
can use concat()
. concat
takes an iterable of DataArray
or Dataset
objects, as well as a dimension name, and concatenates along
that dimension:
In [1]: arr = xr.DataArray(np.random.randn(2, 3),
...: [('x', ['a', 'b']), ('y', [10, 20, 30])])
...:
In [2]: arr[:, :1]
Out[2]:
<xarray.DataArray (x: 2, y: 1)>
array([[ 0.4691123 ],
[-1.13563237]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10
# this resembles how you would use np.concatenate
In [3]: xr.concat([arr[:, :1], arr[:, 1:]], dim='y')
Out[3]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
In addition to combining along an existing dimension, concat
can create a
new dimension by stacking lower dimensional arrays together:
In [4]: arr[0]
Out[4]:
<xarray.DataArray (y: 3)>
array([ 0.4691123 , -0.28286334, -1.5090585 ])
Coordinates:
x |S1 'a'
* y (y) int64 10 20 30
# to combine these 1d arrays into a 2d array in numpy, you would use np.array
In [5]: xr.concat([arr[0], arr[1]], 'x')
Out[5]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
If the second argument to concat
is a new dimension name, the arrays will
be concatenated along that new dimension, which is always inserted as the first
dimension:
In [6]: xr.concat([arr[0], arr[1]], 'new_dim')
Out[6]:
<xarray.DataArray (new_dim: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
Coordinates:
* y (y) int64 10 20 30
x (new_dim) |S1 'a' 'b'
* new_dim (new_dim) int64 0 1
The second argument to concat
can also be an Index
or
DataArray
object as well as a string, in which case it is
used to label the values along the new dimension:
In [7]: xr.concat([arr[0], arr[1]], pd.Index([-90, -100], name='new_dim'))
Out[7]:
<xarray.DataArray (new_dim: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
Coordinates:
* y (y) int64 10 20 30
x (new_dim) |S1 'a' 'b'
* new_dim (new_dim) int64 -90 -100
Of course, concat
also works on Dataset
objects:
In [8]: ds = arr.to_dataset(name='foo')
In [9]: xr.concat([ds.sel(x='a'), ds.sel(x='b')], 'x')
Out[9]:
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
concat()
has a number of options which provide deeper control
over which variables are concatenated and how it handles conflicting variables
between datasets. With the default parameters, xarray will load some coordinate
variables into memory to compare them between datasets. This may be prohibitively
expensive if you are manipulating your dataset lazily using Out of core computation with dask.
Merge¶
To combine variables and coordinates between multiple DataArray
and/or
Dataset
object, use merge()
. It can merge a list of
Dataset
, DataArray
or dictionaries of objects convertible to
DataArray
objects:
In [10]: xr.merge([ds, ds.rename({'foo': 'bar'})])
Out[10]:
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
bar (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
In [11]: xr.merge([xr.DataArray(n, name='var%d' % n) for n in range(5)])
Out[11]:
<xarray.Dataset>
Dimensions: ()
Coordinates:
*empty*
Data variables:
var0 int64 0
var1 int64 1
var2 int64 2
var3 int64 3
var4 int64 4
If you merge another dataset (or a dictionary including data array objects), by default the resulting dataset will be aligned on the union of all index coordinates:
In [12]: other = xr.Dataset({'bar': ('x', [1, 2, 3, 4]), 'x': list('abcd')})
In [13]: xr.merge([ds, other])
Out[13]:
<xarray.Dataset>
Dimensions: (x: 4, y: 3)
Coordinates:
* x (x) object 'a' 'b' 'c' 'd'
* y (y) int64 10 20 30
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732 nan ...
bar (x) int64 1 2 3 4
This ensures that merge
is non-destructive. xarray.MergeError
is raised
if you attempt to merge two variables with the same name but different values:
In [14]: xr.merge([ds, ds + 1])
MergeError: conflicting values for variable 'foo' on objects to be combined:
first value: <xarray.Variable (x: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
second value: <xarray.Variable (x: 2, y: 3)>
array([[ 1.4691123 , 0.71713666, -0.5090585 ],
[-0.13563237, 2.21211203, 0.82678535]])
The same non-destructive merging between DataArray
index coordinates is
used in the Dataset
constructor:
In [15]: xr.Dataset({'a': arr[:-1], 'b': arr[1:]})
Out[15]:
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) object 'a' 'b'
* y (y) int64 10 20 30
Data variables:
a (x, y) float64 0.4691 -0.2829 -1.509 nan nan nan
b (x, y) float64 nan nan nan -1.136 1.212 -0.1732
Update¶
In contrast to merge
, update
modifies a dataset in-place without
checking for conflicts, and will overwrite any existing variables with new
values:
In [16]: ds.update({'space': ('space', [10.2, 9.4, 3.9])})
Out[16]:
<xarray.Dataset>
Dimensions: (space: 3, x: 2, y: 3)
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
* space (space) float64 10.2 9.4 3.9
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
However, dimensions are still required to be consistent between different Dataset variables, so you cannot change the size of a dimension unless you replace all dataset variables that use it.
update
also performs automatic alignment if necessary. Unlike merge
, it
maintains the alignment of the original array instead of merging indexes:
In [17]: ds.update(other)
Out[17]:
<xarray.Dataset>
Dimensions: (space: 3, x: 2, y: 3)
Coordinates:
* x (x) object 'a' 'b'
* y (y) int64 10 20 30
* space (space) float64 10.2 9.4 3.9
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
bar (x) int64 1 2
The exact same alignment logic when setting a variable with __setitem__
syntax:
In [18]: ds['baz'] = xr.DataArray([9, 9, 9, 9, 9], coords=[('x', list('abcde'))])
In [19]: ds.baz
Out[19]:
<xarray.DataArray 'baz' (x: 2)>
array([9, 9])
Coordinates:
* x (x) object 'a' 'b'
Equals and identical¶
xarray objects can be compared by using the equals()
,
identical()
and
broadcast_equals()
methods. These methods are used by
the optional compat
argument on concat
and merge
.
equals
checks dimension names, indexes and array
values:
In [20]: arr.equals(arr.copy())
Out[20]: True
identical
also checks attributes, and the name of each
object:
In [21]: arr.identical(arr.rename('bar'))
Out[21]: False
broadcast_equals
does a more relaxed form of equality
check that allows variables to have different dimensions, as long as values
are constant along those new dimensions:
In [22]: left = xr.Dataset(coords={'x': 0})
In [23]: right = xr.Dataset({'x': [0, 0, 0]})
In [24]: left.broadcast_equals(right)
Out[24]: True
Like pandas objects, two xarray objects are still equal or identical if they have
missing values marked by NaN
in the same locations.
In contrast, the ==
operation performs element-wise comparison (like
numpy):
In [25]: arr == arr.copy()
Out[25]:
<xarray.DataArray (x: 2, y: 3)>
array([[ True, True, True],
[ True, True, True]], dtype=bool)
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
Note that NaN
does not compare equal to NaN
in element-wise comparison;
you may need to deal with missing values explicitly.
Time series data¶
A major use case for xarray is multi-dimensional time-series data. Accordingly, we’ve copied many of features that make working with time-series data in pandas such a joy to xarray. In most cases, we rely on pandas for the core functionality.
Creating datetime64 data¶
xarray uses the numpy dtypes datetime64[ns]
and timedelta64[ns]
to
represent datetime data, which offer vectorized (if sometimes buggy) operations
with numpy and smooth integration with pandas.
To convert to or create regular arrays of datetime64
data, we recommend
using pandas.to_datetime()
and pandas.date_range()
:
In [1]: pd.to_datetime(['2000-01-01', '2000-02-02'])
Out[1]: DatetimeIndex(['2000-01-01', '2000-02-02'], dtype='datetime64[ns]', freq=None)
In [2]: pd.date_range('2000-01-01', periods=365)
Out[2]:
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
'2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08',
'2000-01-09', '2000-01-10',
...
'2000-12-21', '2000-12-22', '2000-12-23', '2000-12-24',
'2000-12-25', '2000-12-26', '2000-12-27', '2000-12-28',
'2000-12-29', '2000-12-30'],
dtype='datetime64[ns]', length=365, freq='D')
Alternatively, you can supply arrays of Python datetime
objects. These get
converted automatically when used as arguments in xarray objects:
In [3]: import datetime
In [4]: xr.Dataset({'time': datetime.datetime(2000, 1, 1)})
Out[4]:
<xarray.Dataset>
Dimensions: ()
Coordinates:
*empty*
Data variables:
time datetime64[ns] 2000-01-01
When reading or writing netCDF files, xarray automatically decodes datetime and
timedelta arrays using CF conventions (that is, by using a units
attribute like 'days since 2000-01-01'
).
You can manual decode arrays in this form by passing a dataset to
decode_cf()
:
In [5]: attrs = {'units': 'hours since 2000-01-01'}
In [6]: ds = xr.Dataset({'time': ('time', [0, 1, 2, 3], attrs)})
In [7]: xr.decode_cf(ds)
Out[7]:
<xarray.Dataset>
Dimensions: (time: 4)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...
Data variables:
*empty*
One unfortunate limitation of using datetime64[ns]
is that it limits the
native representation of dates to those that fall between the years 1678 and
2262. When a netCDF file contains dates outside of these bounds, dates will be
returned as arrays of netcdftime.datetime
objects.
Datetime indexing¶
xarray borrows powerful indexing machinery from pandas (see Indexing and selecting data).
This allows for several useful and suscinct forms of indexing, particularly for datetime64 data. For example, we support indexing with strings for single items and with the slice object:
In [8]: time = pd.date_range('2000-01-01', freq='H', periods=365 * 24)
In [9]: ds = xr.Dataset({'foo': ('time', np.arange(365 * 24)), 'time': time})
In [10]: ds.sel(time='2000-01')
Out[10]:
<xarray.Dataset>
Dimensions: (time: 744)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...
Data variables:
foo (time) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
In [11]: ds.sel(time=slice('2000-06-01', '2000-06-10'))
Out[11]:
<xarray.Dataset>
Dimensions: (time: 240)
Coordinates:
* time (time) datetime64[ns] 2000-06-01 2000-06-01T01:00:00 ...
Data variables:
foo (time) int64 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 ...
You can also select a particular time by indexing with a
datetime.time
object:
In [12]: ds.sel(time=datetime.time(12))
Out[12]:
<xarray.Dataset>
Dimensions: (time: 365)
Coordinates:
* time (time) datetime64[ns] 2000-01-01T12:00:00 2000-01-02T12:00:00 ...
Data variables:
foo (time) int64 12 36 60 84 108 132 156 180 204 228 252 276 300 ...
For more details, read the pandas documentation.
Datetime components¶
xarray supports a notion of “virtual” or “derived” coordinates for datetime components implemented by pandas, including “year”, “month”, “day”, “hour”, “minute”, “second”, “dayofyear”, “week”, “dayofweek”, “weekday” and “quarter”:
In [13]: ds['time.month']
Out[13]:
<xarray.DataArray 'month' (time: 8760)>
array([ 1, 1, 1, ..., 12, 12, 12], dtype=int32)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...
In [14]: ds['time.dayofyear']
Out[14]:
<xarray.DataArray 'dayofyear' (time: 8760)>
array([ 1, 1, 1, ..., 365, 365, 365], dtype=int32)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...
xarray adds 'season'
to the list of datetime components supported by pandas:
In [15]: ds['time.season']
Out[15]:
<xarray.DataArray 'season' (time: 8760)>
array(['DJF', 'DJF', 'DJF', ..., 'DJF', 'DJF', 'DJF'],
dtype='|S3')
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...
The set of valid seasons consists of ‘DJF’, ‘MAM’, ‘JJA’ and ‘SON’, labeled by the first letters of the corresponding months.
You can use these shortcuts with both Datasets and DataArray coordinates.
Resampling and grouped operations¶
Datetime components couple particularly well with grouped operations (see GroupBy: split-apply-combine) for analyzing features that repeat over time. Here’s how to calculate the mean by time of day:
In [16]: ds.groupby('time.hour').mean()
Out[16]:
<xarray.Dataset>
Dimensions: (hour: 24)
Coordinates:
* hour (hour) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
foo (hour) float64 4.368e+03 4.369e+03 4.37e+03 4.371e+03 4.372e+03 ...
For upsampling or downsampling temporal resolutions, xarray offers a
resample()
method building on the core functionality
offered by the pandas method of the same name. Resample uses essentialy the
same api as resample
in pandas.
For example, we can downsample our dataset from hourly to 6-hourly:
In [17]: ds.resample('6H', dim='time', how='mean')
Out[17]:
<xarray.Dataset>
Dimensions: (time: 1460)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T06:00:00 ...
Data variables:
foo (time) float64 2.5 8.5 14.5 20.5 26.5 32.5 38.5 44.5 50.5 56.5 ...
Resample also works for upsampling, in which case intervals without any
values are marked by NaN
:
In [18]: ds.resample('30Min', 'time')
Out[18]:
<xarray.Dataset>
Dimensions: (time: 17519)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T00:30:00 ...
Data variables:
foo (time) float64 0.0 nan 1.0 nan 2.0 nan 3.0 nan 4.0 nan 5.0 nan ...
Of course, all of these resampling and groupby operation work on both Dataset and DataArray objects with any number of additional dimensions.
For more examples of using grouped operations on a time dimension, see Toy weather data.
Working with pandas¶
One of the most important features of xarray is the ability to convert to and
from pandas
objects to interact with the rest of the PyData
ecosystem. For example, for plotting labeled data, we highly recommend
using the visualization built in to pandas itself or provided by the pandas
aware libraries such as Seaborn.
Hierarchical and tidy data¶
Tabular data is easiest to work with when it meets the criteria for tidy data:
- Each column holds a different variable.
- Each rows holds a different observation.
In this “tidy data” format, we can represent any Dataset
and
DataArray
in terms of pandas.DataFrame
and
pandas.Series
, respectively (and vice-versa). The representation
works by flattening non-coordinates to 1D, and turning the tensor product of
coordinate indexes into a pandas.MultiIndex
.
Dataset and DataFrame¶
To convert any dataset to a DataFrame
in tidy form, use the
Dataset.to_dataframe()
method:
In [1]: ds = xr.Dataset({'foo': (('x', 'y'), np.random.randn(2, 3))},
...: coords={'x': [10, 20], 'y': ['a', 'b', 'c'],
...: 'along_x': ('x', np.random.randn(2)),
...: 'scalar': 123})
...:
In [2]: ds
Out[2]:
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* y (y) |S1 'a' 'b' 'c'
* x (x) int64 10 20
scalar int64 123
along_x (x) float64 0.1192 -1.044
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
In [3]: df = ds.to_dataframe()
In [4]: df
Out[4]:
foo scalar along_x
x y
10 a 0.469112 123 0.119209
b -0.282863 123 0.119209
c -1.509059 123 0.119209
20 a -1.135632 123 -1.044236
b 1.212112 123 -1.044236
c -0.173215 123 -1.044236
We see that each variable and coordinate in the Dataset is now a column in the
DataFrame, with the exception of indexes which are in the index.
To convert the DataFrame
to any other convenient representation,
use DataFrame
methods like reset_index()
,
stack()
and unstack()
.
To create a Dataset
from a DataFrame
, use the
from_dataframe()
class method or the equivalent
pandas.DataFrame.to_xarray
method (pandas
v0.18 or later):
In [5]: xr.Dataset.from_dataframe(df)
Out[5]:
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) int64 10 20
* y (y) object 'a' 'b' 'c'
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
scalar (x, y) int64 123 123 123 123 123 123
along_x (x, y) float64 0.1192 0.1192 0.1192 -1.044 -1.044 -1.044
Notice that that dimensions of variables in the Dataset
have now
expanded after the round-trip conversion to a DataFrame
. This is because
every object in a DataFrame
must have the same indices, so we need to
broadcast the data of each array to the full size of the new MultiIndex
.
Likewise, all the coordinates (other than indexes) ended up as variables, because pandas does not distinguish non-index coordinates.
DataArray and Series¶
DataArray
objects have a complementary representation in terms of a
pandas.Series
. Using a Series preserves the Dataset
to
DataArray
relationship, because DataFrames
are dict-like containers
of Series
. The methods are very similar to those for working with
DataFrames:
In [6]: s = ds['foo'].to_series()
In [7]: s
Out[7]:
x y
10 a 0.469112
b -0.282863
c -1.509059
20 a -1.135632
b 1.212112
c -0.173215
Name: foo, dtype: float64
# or equivalently, with Series.to_xarray()
In [8]: xr.DataArray.from_series(s)
Out[8]:
<xarray.DataArray 'foo' (x: 2, y: 3)>
array([[ 0.469, -0.283, -1.509],
[-1.136, 1.212, -0.173]])
Coordinates:
* x (x) int64 10 20
* y (y) object 'a' 'b' 'c'
Both the from_series
and from_dataframe
methods use reindexing, so they
work even if not the hierarchical index is not a full tensor product:
In [9]: s[::2]
Out[9]:
x y
10 a 0.469112
c -1.509059
20 b 1.212112
Name: foo, dtype: float64
In [10]: s[::2].to_xarray()
Out[10]:
<xarray.DataArray 'foo' (x: 2, y: 3)>
array([[ 0.469, nan, -1.509],
[ nan, 1.212, nan]])
Coordinates:
* x (x) int64 10 20
* y (y) object 'a' 'b' 'c'
Multi-dimensional data¶
Tidy data is great, but it sometimes you want to preserve dimensions instead of
automatically stacking them into a MultiIndex
.
DataArray.to_pandas()
is a shortcut that
lets you convert a DataArray directly into a pandas object with the same
dimensionality (i.e., a 1D array is converted to a Series
,
2D to DataFrame
and 3D to Panel
):
In [11]: arr = xr.DataArray(np.random.randn(2, 3),
....: coords=[('x', [10, 20]), ('y', ['a', 'b', 'c'])])
....:
In [12]: df = arr.to_pandas()
In [13]: df
Out[13]:
y a b c
x
10 -0.861849 -2.104569 -0.494929
20 1.071804 0.721555 -0.706771
To perform the inverse operation of converting any pandas objects into a data
array with the same shape, simply use the DataArray
constructor:
In [14]: xr.DataArray(df)
Out[14]:
<xarray.DataArray (x: 2, y: 3)>
array([[-0.862, -2.105, -0.495],
[ 1.072, 0.722, -0.707]])
Coordinates:
* x (x) int64 10 20
* y (y) object 'a' 'b' 'c'
Both the DataArray
and Dataset
constructors directly convert pandas
objects into xarray objects with the same shape. This means that they
preserve all use of multi-indexes:
In [15]: index = pd.MultiIndex.from_arrays([['a', 'a', 'b'], [0, 1, 2]],
....: names=['one', 'two'])
....:
In [16]: df = pd.DataFrame({'x': 1, 'y': 2}, index=index)
In [17]: ds = xr.Dataset(df)
In [18]: ds
Out[18]:
<xarray.Dataset>
Dimensions: (dim_0: 3)
Coordinates:
* dim_0 (dim_0) object ('a', 0) ('a', 1) ('b', 2)
Data variables:
x (dim_0) int64 1 1 1
y (dim_0) int64 2 2 2
However, you will need to set dimension names explicitly, either with the
dims
argument on in the DataArray
constructor or by calling
rename
on the new object.
Transitioning from pandas.Panel to xarray¶
Panel
, pandas’s data structure for 3D arrays, has always
been a second class data structure compared to the Series and DataFrame. To
allow pandas developers to focus more on its core functionality built around
the DataFrame, pandas plans to eventually deprecate Panel.
xarray has most of Panel
‘s features, a more explicit API (particularly around
indexing), and the ability to scale to >3 dimensions with the same interface.
As discussed elsewhere in the docs, there are two primary data structures in
xarray: DataArray
and Dataset
. You can imagine a DataArray
as a
n-dimensional pandas Series
(i.e. a single typed array), and a Dataset
as the DataFrame
equivalent (i.e. a dict of aligned DataArray
objects).
So you can represent a Panel, in two ways:
- As a 3-dimensional
DataArray
, - Or as a
Dataset
containing a number of 2-dimensional DataArray objects.
Let’s take a look:
In [19]: panel = pd.Panel(np.random.rand(2, 3, 4), items=list('ab'), major_axis=list('mno'),
....: minor_axis=pd.date_range(start='2000', periods=4, name='date'))
....:
In [20]: panel
Out[20]:
<class 'pandas.core.panel.Panel'>
Dimensions: 2 (items) x 3 (major_axis) x 4 (minor_axis)
Items axis: a to b
Major_axis axis: m to o
Minor_axis axis: 2000-01-01 00:00:00 to 2000-01-04 00:00:00
As a DataArray:
# or equivalently, with Panel.to_xarray()
In [21]: xr.DataArray(panel)
Out[21]:
<xarray.DataArray (dim_0: 2, dim_1: 3, date: 4)>
array([[[ 0.595, 0.138, 0.853, 0.236],
[ 0.146, 0.59 , 0.574, 0.061],
[ 0.59 , 0.245, 0.34 , 0.985]],
[[ 0.92 , 0.038, 0.862, 0.754],
[ 0.405, 0.344, 0.171, 0.395],
[ 0.642, 0.275, 0.462, 0.871]]])
Coordinates:
* dim_0 (dim_0) object 'a' 'b'
* dim_1 (dim_1) object 'm' 'n' 'o'
* date (date) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
As you can see, there are three dimensions (each is also a coordinate). Two of
the axes of the panel were unnamed, so have been assigned dim_0
and
dim_1
respectively, while the third retains its name date
.
As a Dataset:
In [22]: xr.Dataset(panel)
Out[22]:
<xarray.Dataset>
Dimensions: (date: 4, dim_0: 3)
Coordinates:
* dim_0 (dim_0) object 'm' 'n' 'o'
* date (date) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
Data variables:
a (dim_0, date) float64 0.5948 0.1376 0.8529 0.2355 0.1462 0.5899 ...
b (dim_0, date) float64 0.9195 0.03777 0.8615 0.7536 0.4052 ...
Here, there are two data variables, each representing a DataFrame on panel’s
items
axis, and labelled as such. Each variable is a 2D array of the
respective values along the items
dimension.
While the xarray docs are relatively complete, a few items stand out for Panel users:
- A DataArray’s data is stored as a numpy array, and so can only contain a single
type. As a result, a Panel that contains
DataFrame
objects with multiple types will be converted todtype=object
. ADataset
of multipleDataArray
objects each with its own dtype will allow original types to be preserved. - Indexing is similar to pandas, but more explicit and leverages xarray’s naming of dimensions.
- Because of those features, making much higher dimensional data is very practical.
- Variables in
Dataset
objects can use a subset of its dimensions. For example, you can have one dataset with Person x Score x Time, and another with Person x Score. - You can use coordinates are used for both dimensions and for variables which _label_ the data variables, so you could have a coordinate Age, that labelled the Person dimension of a Dataset of Person x Score x Time.
While xarray may take some getting used to, it’s worth it! If anything is unclear, please post an issue on GitHub or StackOverflow, and we’ll endeavor to respond to the specific case or improve the general docs.
Serialization and IO¶
xarray supports direct serialization and IO to several file formats. For more options, consider exporting your objects to pandas (see the preceding section) and using its broad range of IO tools.
Pickle¶
The simplest way to serialize an xarray object is to use Python’s built-in pickle module:
In [1]: import pickle
In [2]: ds = xr.Dataset({'foo': (('x', 'y'), np.random.rand(4, 5))},
...: coords={'x': [10, 20, 30, 40],
...: 'y': pd.date_range('2000-01-01', periods=5),
...: 'z': ('x', list('abcd'))})
...:
# use the highest protocol (-1) because it is way faster than the default
# text based pickle format
In [3]: pkl = pickle.dumps(ds, protocol=-1)
In [4]: pickle.loads(pkl)
Out[4]:
<xarray.Dataset>
Dimensions: (x: 4, y: 5)
Coordinates:
* y (y) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04 ...
* x (x) int64 10 20 30 40
z (x) |S1 'a' 'b' 'c' 'd'
Data variables:
foo (x, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 0.4514 ...
Pickle support is important because it doesn’t require any external libraries
and lets you use xarray objects with Python modules like
multiprocessing
. However, there are two important caveats:
- To simplify serialization, xarray’s support for pickle currently loads all array values into memory before dumping an object. This means it is not suitable for serializing datasets too big to load into memory (e.g., from netCDF or OPeNDAP).
- Pickle will only work as long as the internal data structure of xarray objects remains unchanged. Because the internal design of xarray is still being refined, we make no guarantees (at this point) that objects pickled with this version of xarray will work in future versions.
Dictionary¶
Serializing an xarray object to a Python dictionary is also simple.
We can convert a Dataset
(or a DataArray
) to a dict using
to_dict()
:
In [5]: d = ds.to_dict()
In [6]: d
Out[6]:
{'attrs': {},
'coords': {'x': {'attrs': {}, 'data': [10, 20, 30, 40], 'dims': ('x',)},
'y': {'attrs': {},
'data': [datetime.datetime(2000, 1, 1, 0, 0),
datetime.datetime(2000, 1, 2, 0, 0),
datetime.datetime(2000, 1, 3, 0, 0),
datetime.datetime(2000, 1, 4, 0, 0),
datetime.datetime(2000, 1, 5, 0, 0)],
'dims': ('y',)},
'z': {'attrs': {}, 'data': ['a', 'b', 'c', 'd'], 'dims': ('x',)}},
'data_vars': {'foo': {'attrs': {},
'data': [[0.12696983303810094,
0.966717838482003,
0.26047600586578334,
0.8972365243645735,
0.37674971618967135],
[0.33622174433445307,
0.45137647047539964,
0.8402550832613813,
0.12310214428849964,
0.5430262020470384],
[0.37301222522143085,
0.4479968246859435,
0.12944067971751294,
0.8598787065799693,
0.8203883631195572],
[0.35205353914802473,
0.2288873043216132,
0.7767837505077176,
0.5947835894851238,
0.1375535565632705]],
'dims': ('x', 'y')}},
'dims': {'x': 4, 'y': 5}}
We can create a new xarray object from a dict using
from_dict()
:
In [7]: ds_dict = xr.Dataset.from_dict(d)
In [8]: ds_dict
Out[8]:
<xarray.Dataset>
Dimensions: (x: 4, y: 5)
Coordinates:
* y (y) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04 ...
* x (x) int64 10 20 30 40
z (x) |S1 'a' 'b' 'c' 'd'
Data variables:
foo (x, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 0.4514 ...
Dictionary support allows for flexible use of xarray objects. It doesn’t require external libraries and dicts can easily be pickled, or converted to json, or geojson. All the values are converted to lists, so dicts might be quite large.
netCDF¶
Currently, the only disk based serialization format that xarray directly supports
is netCDF. netCDF is a file format for fully self-described datasets that
is widely used in the geosciences and supported on almost all platforms. We use
netCDF because xarray was based on the netCDF data model, so netCDF files on disk
directly correspond to Dataset
objects. Recent versions of
netCDF are based on the even more widely used HDF5 file-format.
Reading and writing netCDF files with xarray requires the netCDF4-Python library or scipy to be installed.
We can save a Dataset to disk using the
Dataset.to_netcdf
method:
In [9]: ds.to_netcdf('saved_on_disk.nc')
By default, the file is saved as netCDF4 (assuming netCDF4-Python is
installed). You can control the format and engine used to write the file with
the format
and engine
arguments.
We can load netCDF files to create a new Dataset using
open_dataset()
:
In [10]: ds_disk = xr.open_dataset('saved_on_disk.nc')
In [11]: ds_disk
Out[11]:
<xarray.Dataset>
Dimensions: (x: 4, y: 5)
Coordinates:
* y (y) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04 ...
* x (x) int32 10 20 30 40
z (x) |S1 'a' 'b' 'c' 'd'
Data variables:
foo (x, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 0.4514 ...
A dataset can also be loaded or written to a specific group within a netCDF
file. To load from a group, pass a group
keyword argument to the
open_dataset
function. The group can be specified as a path-like
string, e.g., to access subgroup ‘bar’ within group ‘foo’ pass
‘/foo/bar’ as the group
argument. When writing multiple groups in one file,
pass mode='a'
to to_netcdf
to ensure that each call does not delete the
file.
Data is always loaded lazily from netCDF files. You can manipulate, slice and subset Dataset and DataArray objects, and no array values are loaded into memory until you try to perform some sort of actual computation. For an example of how these lazy arrays work, see the OPeNDAP section below.
It is important to note that when you modify values of a Dataset, even one linked to files on disk, only the in-memory copy you are manipulating in xarray is modified: the original file on disk is never touched.
Tip
xarray’s lazy loading of remote or on-disk datasets is often but not always
desirable. Before performing computationally intense operations, it is
often a good idea to load a dataset entirely into memory by invoking the
load()
method.
Datasets have a close()
method to close the associated
netCDF file. However, it’s often cleaner to use a with
statement:
# this automatically closes the dataset after use
In [12]: with xr.open_dataset('saved_on_disk.nc') as ds:
....: print(ds.keys())
....:
['y', 'x', 'foo', 'z']
Although xarray provides reasonable support for incremental reads of files on disk, it does not support incremental writes, which can be a useful strategy for dealing with datasets too big to fit into memory. Instead, xarray integrates with dask.array (see Out of core computation with dask), which provides a fully featured engine for streaming computation.
Reading encoded data¶
NetCDF files follow some conventions for encoding datetime arrays (as numbers
with a “units” attribute) and for packing and unpacking data (as
described by the “scale_factor” and “add_offset” attributes). If the argument
decode_cf=True
(default) is given to open_dataset
, xarray will attempt
to automatically decode the values in the netCDF objects according to
CF conventions. Sometimes this will fail, for example, if a variable
has an invalid “units” or “calendar” attribute. For these cases, you can
turn this decoding off manually.
You can view this encoding information (among others) in the
DataArray.encoding
attribute:
In [13]: ds_disk['y'].encoding
Out[13]:
{'calendar': u'proleptic_gregorian',
'chunksizes': None,
'complevel': 0,
'contiguous': True,
'dtype': dtype('float64'),
'fletcher32': False,
'least_significant_digit': None,
'shuffle': False,
'source': 'saved_on_disk.nc',
'units': u'days since 2000-01-01 00:00:00',
'zlib': False}
Note that all operations that manipulate variables other than indexing will remove encoding information.
Writing encoded data¶
Conversely, you can customize how xarray writes netCDF files on disk by
providing explicit encodings for each dataset variable. The encoding
argument takes a dictionary with variable names as keys and variable specific
encodings as values. These encodings are saved as attributes on the netCDF
variables on disk, which allows xarray to faithfully read encoded data back into
memory.
It is important to note that using encodings is entirely optional: if you do not
supply any of these encoding options, xarray will write data to disk using a
default encoding, or the options in the encoding
attribute, if set.
This works perfectly fine in most cases, but encoding can be useful for
additional control, especially for enabling compression.
In the file on disk, these encodings as saved as attributes on each variable, which allow xarray and other CF-compliant tools for working with netCDF files to correctly read the data.
Scaling and type conversions¶
These encoding options work on any version of the netCDF file format:
dtype
: Any valid NumPy dtype or string convertable to a dtype, e.g.,'int16'
or'float32'
. This controls the type of the data written on disk._FillValue
: Values ofNaN
in xarray variables are remapped to this value when saved on disk. This is important when converting floating point with missing values to integers on disk, becauseNaN
is not a valid dtype for integer dtypes.scale_factor
andadd_offset
: Used to convert from encoded data on disk to to the decoded data in memory, according to the formuladecoded = scale_factor * encoded + add_offset
.
These parameters can be fruitfully combined to compress discretized data on disk. For
example, to save the variable foo
with a precision of 0.1 in 16-bit integers while
converting NaN
to -9999
, we would use
encoding={'foo': {'dtype': 'int16', 'scale_factor': 0.1, '_FillValue': -9999}}
.
Compression and decompression with such discretization is extremely fast.
Chunk based compression¶
zlib
, complevel
, fletcher32
, continguous
and chunksizes
can be used for enabling netCDF4/HDF5’s chunk based compression, as described
in the documentation for createVariable for netCDF4-Python. This only works
for netCDF4 files and thus requires using format='netCDF4'
and either
engine='netcdf4'
or engine='h5netcdf'
.
Chunk based gzip compression can yield impressive space savings, especially for sparse data, but it comes with significant performance overhead. HDF5 libraries can only read complete chunks back into memory, and maximum decompression speed is in the range of 50-100 MB/s. Worse, HDF5’s compression and decompression currently cannot be parallelized with dask. For these reasons, we recommend trying discretization based compression (described above) first.
Time units¶
The units
and calendar
attributes control how xarray serializes datetime64
and
timedelta64
arrays to datasets on disk as numeric values. The units
encoding
should be a string like 'days since 1900-01-01'
for datetime64
data or a string
like 'days'
for timedelta64
data. calendar
should be one of the calendar types
supported by netCDF4-python: ‘standard’, ‘gregorian’, ‘proleptic_gregorian’ ‘noleap’,
‘365_day’, ‘360_day’, ‘julian’, ‘all_leap’, ‘366_day’.
By default, xarray uses the ‘proleptic_gregorian’ calendar and units of the smallest time difference between values, with a reference time of the first time value.
OPeNDAP¶
xarray includes support for OPeNDAP (via the netCDF4 library or Pydap), which lets us access large datasets over HTTP.
For example, we can open a connection to GBs of weather data produced by the PRISM project, and hosted by IRI at Columbia:
In [14]: remote_data = xr.open_dataset(
....: 'http://iridl.ldeo.columbia.edu/SOURCES/.OSU/.PRISM/.monthly/dods',
....: decode_times=False)
....:
In [15]: remote_data
Out[15]:
<xarray.Dataset>
Dimensions: (T: 1422, X: 1405, Y: 621)
Coordinates:
* X (X) float32 -125.0 -124.958 -124.917 -124.875 -124.833 -124.792 -124.75 ...
* T (T) float32 -779.5 -778.5 -777.5 -776.5 -775.5 -774.5 -773.5 -772.5 -771.5 ...
* Y (Y) float32 49.9167 49.875 49.8333 49.7917 49.75 49.7083 49.6667 49.625 ...
Data variables:
ppt (T, Y, X) float64 ...
tdmean (T, Y, X) float64 ...
tmax (T, Y, X) float64 ...
tmin (T, Y, X) float64 ...
Attributes:
Conventions: IRIDL
expires: 1375315200
Note
Like many real-world datasets, this dataset does not entirely follow
CF conventions. Unexpected formats will usually cause xarray’s automatic
decoding to fail. The way to work around this is to either set
decode_cf=False
in open_dataset
to turn off all use of CF
conventions, or by only disabling the troublesome parser.
In this case, we set decode_times=False
because the time axis here
provides the calendar attribute in a format that xarray does not expect
(the integer 360
instead of a string like '360_day'
).
We can select and slice this data any number of times, and nothing is loaded over the network until we look at particular values:
In [16]: tmax = remote_data['tmax'][:500, ::3, ::3]
In [17]: tmax
Out[17]:
<xarray.DataArray 'tmax' (T: 500, Y: 207, X: 469)>
[48541500 values with dtype=float64]
Coordinates:
* Y (Y) float32 49.9167 49.7917 49.6667 49.5417 49.4167 49.2917 ...
* X (X) float32 -125.0 -124.875 -124.75 -124.625 -124.5 -124.375 ...
* T (T) float32 -779.5 -778.5 -777.5 -776.5 -775.5 -774.5 -773.5 ...
Attributes:
pointwidth: 120
standard_name: air_temperature
units: Celsius_scale
expires: 1443657600
# the data is downloaded automatically when we make the plot
In [18]: tmax[0].plot()

Formats supported by PyNIO¶
xarray can also read GRIB, HDF4 and other file formats supported by PyNIO,
if PyNIO is installed. To use PyNIO to read such files, supply
engine='pynio'
to open_dataset()
.
We recommend installing PyNIO via conda:
conda install -c dbrown pynio
Combining multiple files¶
NetCDF files are often encountered in collections, e.g., with different files
corresponding to different model runs. xarray can straightforwardly combine such
files into a single Dataset by making use of concat()
.
Note
Version 0.5 includes experimental support for manipulating datasets that
don’t fit into memory with dask. If you have dask installed, you can open
multiple files simultaneously using open_mfdataset()
:
xr.open_mfdataset('my/files/*.nc')
This function automatically concatenates and merges into a single xarray datasets. For more details, see Reading and writing data.
For example, here’s how we could approximate MFDataset
from the netCDF4
library:
from glob import glob
import xarray as xr
def read_netcdfs(files, dim):
# glob expands paths with * to a list of files, like the unix shell
paths = sorted(glob(files))
datasets = [xr.open_dataset(p) for p in paths]
combined = xr.concat(dataset, dim)
return combined
read_netcdfs('/all/my/files/*.nc', dim='time')
This function will work in many cases, but it’s not very robust. First, it never closes files, which means it will fail one you need to load more than a few thousands file. Second, it assumes that you want all the data from each file and that it can all fit into memory. In many situations, you only need a small subset or an aggregated summary of the data from each file.
Here’s a slightly more sophisticated example of how to remedy these deficiencies:
def read_netcdfs(files, dim, transform_func=None):
def process_one_path(path):
# use a context manager, to ensure the file gets closed after use
with xr.open_dataset(path) as ds:
# transform_func should do some sort of selection or
# aggregation
if transform_func is not None:
ds = transform_func(ds)
# load all data from the transformed dataset, to ensure we can
# use it after closing each original file
ds.load()
return ds
paths = sorted(glob(files))
datasets = [process_one_path(p) for p in paths]
combined = xr.concat(datasets, dim)
return combined
# here we suppose we only care about the combined mean of each file;
# you might also use indexing operations like .sel to subset datasets
read_netcdfs('/all/my/files/*.nc', dim='time',
transform_func=lambda ds: ds.mean())
This pattern works well and is very robust. We’ve used similar code to process tens of thousands of files constituting 100s of GB of data.
Out of core computation with dask¶
xarray integrates with dask to support streaming computation on datasets that don’t fit into memory.
Currently, dask is an entirely optional feature for xarray. However, the benefits of using dask are sufficiently strong that dask may become a required dependency in a future version of xarray.
For a full example of how to use xarray’s dask integration, read the blog post introducing xarray and dask.
What is a dask array?¶

Dask divides arrays into many small pieces, called chunks, each of which is presumed to be small enough to fit into memory.
Unlike NumPy, which has eager evaluation, operations on dask arrays are lazy. Operations queue up a series of taks mapped over blocks, and no computation is performed until you actually ask values to be computed (e.g., to print results to your screen or write to disk). At that point, data is loaded into memory and computation proceeds in a streaming fashion, block-by-block.
The actual computation is controlled by a multi-processing or thread pool, which allows dask to take full advantage of multiple processers available on most modern computers.
For more details on dask, read its documentation.
Reading and writing data¶
The usual way to create a dataset filled with dask arrays is to load the
data from a netCDF file or files. You can do this by supplying a chunks
argument to open_dataset()
or using the
open_mfdataset()
function.
In [1]: ds = xr.open_dataset('example-data.nc', chunks={'time': 10})
In [2]: ds
Out[2]:
<xarray.Dataset>
Dimensions: (latitude: 180, longitude: 360, time: 365)
Coordinates:
* latitude (latitude) float64 89.5 88.5 87.5 86.5 85.5 84.5 83.5 82.5 ...
* time (time) datetime64[ns] 2015-01-01 2015-01-02 2015-01-03 ...
* longitude (longitude) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
Data variables:
temperature (time, latitude, longitude) float64 0.4691 -0.2829 -1.509 ...
In this example latitude
and longitude
do not appear in the
chunks
dict, so only one chunk will be used along those dimensions. It
is also entirely equivalent to open a dataset using open_dataset
and
then chunk the data use the chunk
method, e.g.,
xr.open_dataset('example-data.nc').chunk({'time': 10})
.
To open multiple files simultaneously, use open_mfdataset()
:
xr.open_mfdataset('my/files/*.nc')
This function will automatically concatenate and merge dataset into one in
the simple cases that it understands (see auto_combine()
for the full disclaimer). By default, open_mfdataset
will chunk each
netCDF file into a single dask array; again, supply the chunks
argument to
control the size of the resulting dask arrays. In more complex cases, you can
open each file individually using open_dataset
and merge the result, as
described in Combining data.
You’ll notice that printing a dataset still shows a preview of array values, even if they are actually dask arrays. We can do this quickly with dask because we only need to the compute the first few values (typically from the first block). To reveal the true nature of an array, print a DataArray:
In [3]: ds.temperature
Out[3]:
<xarray.DataArray 'temperature' (time: 365, latitude: 180, longitude: 360)>
dask.array<example..., shape=(365, 180, 360), dtype=float64, chunksize=(10, 180, 360)>
Coordinates:
* latitude (latitude) float64 89.5 88.5 87.5 86.5 85.5 84.5 83.5 82.5 ...
* time (time) datetime64[ns] 2015-01-01 2015-01-02 2015-01-03 ...
* longitude (longitude) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ...
Once you’ve manipulated a dask array, you can still write a dataset too big to
fit into memory back to disk by using to_netcdf()
in the
usual way.
Using dask with xarray¶
Nearly all existing xarray methods (including those for indexing, computation, concatenating and grouped operations) have been extended to work automatically with dask arrays. When you load data as a dask array in an xarray data structure, almost all xarray operations will keep it as a dask array; when this is not possible, they will raise an exception rather than unexpectedly loading data into memory. Converting a dask array into memory generally requires an explicit conversion step. One noteable exception is indexing operations: to enable label based indexing, xarray will automatically load coordinate labels into memory.
The easiest way to convert an xarray data structure from lazy dask arrays into
eager, in-memory numpy arrays is to use the load()
method:
In [4]: ds.load()
Out[4]:
<xarray.Dataset>
Dimensions: (latitude: 180, longitude: 360, time: 365)
Coordinates:
* latitude (latitude) float64 89.5 88.5 87.5 86.5 85.5 84.5 83.5 82.5 ...
* time (time) datetime64[ns] 2015-01-01 2015-01-02 2015-01-03 ...
* longitude (longitude) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
Data variables:
temperature (time, latitude, longitude) float64 0.4691 -0.2829 -1.509 ...
You can also access values
, which will always be a
numpy array:
In [5]: ds.temperature.values
Out[5]:
array([[[ 4.691e-01, -2.829e-01, ..., -5.577e-01, 3.814e-01],
[ 1.337e+00, -1.531e+00, ..., 8.726e-01, -1.538e+00],
...
# truncated for brevity
Explicit conversion by wrapping a DataArray with np.asarray
also works:
In [6]: np.asarray(ds.temperature)
Out[6]:
array([[[ 4.691e-01, -2.829e-01, ..., -5.577e-01, 3.814e-01],
[ 1.337e+00, -1.531e+00, ..., 8.726e-01, -1.538e+00],
...
With the current version of dask, there is no automatic alignment of chunks when
performing operations between dask arrays with different chunk sizes. If your
computation involves multiple dask arrays with different chunks, you may need to
explicitly rechunk each array to ensure compatibility. With xarray, both
converting data to a dask arrays and converting the chunk sizes of dask arrays
is done with the chunk()
method:
In [7]: rechunked = ds.chunk({'latitude': 100, 'longitude': 100})
You can view the size of existing chunks on an array by viewing the
chunks
attribute:
In [8]: rechunked.chunks
Out[8]: Frozen(SortedKeysDict({'latitude': (100, 80), 'longitude': (100, 100, 100, 60), 'time': (10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 5)}))
If there are not consistent chunksizes between all the arrays in a dataset
along a particular dimension, an exception is raised when you try to access
.chunks
.
Note
In the future, we would like to enable automatic alignment of dask chunksizes (but not the other way around). We might also require that all arrays in a dataset share the same chunking alignment. Neither of these are currently done.
NumPy ufuncs like np.sin
currently only work on eagerly evaluated arrays
(this will change with the next major NumPy release). We have provided
replacements that also work on all xarray objects, including those that store
lazy dask arrays, in the xarray.ufuncs module:
In [9]: import xarray.ufuncs as xu
In [10]: xu.sin(rechunked)
Out[10]:
<xarray.Dataset>
Dimensions: (latitude: 180, longitude: 360, time: 365)
Coordinates:
* latitude (latitude) float64 89.5 88.5 87.5 86.5 85.5 84.5 83.5 82.5 ...
* longitude (longitude) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
* time (time) datetime64[ns] 2015-01-01 2015-01-02 2015-01-03 ...
Data variables:
temperature (time, latitude, longitude) float64 0.4521 -0.2791 -0.9981 ...
To access dask arrays directly, use the new
DataArray.data
attribute. This attribute exposes
array data either as a dask array or as a numpy array, depending on whether it has been
loaded into dask or not:
In [11]: ds.temperature.data
Out[11]: dask.array<xarray-..., shape=(365, 180, 360), dtype=float64, chunksize=(10, 180, 360)>
Note
In the future, we may extend .data
to support other “computable” array
backends beyond dask and numpy (e.g., to support sparse arrays).
Chunking and performance¶
The chunks
parameter has critical performance implications when using dask
arrays. If your chunks are too small, queueing up operations will be extremely
slow, because dask will translates each operation into a huge number of
operations mapped across chunks. Computation on dask arrays with small chunks
can also be slow, because each operation on a chunk has some fixed overhead
from the Python interpreter and the dask task executor.
Conversely, if your chunks are too big, some of your computation may be wasted, because dask only computes results one chunk at a time.
A good rule of thumb to create arrays with a minimum chunksize of at least one million elements (e.g., a 1000x1000 matrix). With large arrays (10+ GB), the cost of queueing up dask operations can be noticeable, and you may need even larger chunksizes.
Plotting¶
Introduction¶
Labeled data enables expressive computations. These same labels can also be used to easily create informative plots.
xarray’s plotting capabilities are centered around
xarray.DataArray
objects.
To plot xarray.Dataset
objects
simply access the relevant DataArrays, ie dset['var1']
.
Here we focus mostly on arrays 2d or larger. If your data fits
nicely into a pandas DataFrame then you’re better off using one of the more
developed tools there.
xarray plotting functionality is a thin wrapper around the popular matplotlib library. Matplotlib syntax and function names were copied as much as possible, which makes for an easy transition between the two. Matplotlib must be installed before xarray can plot.
For more extensive plotting applications consider the following projects:
- Seaborn: “provides a high-level interface for drawing attractive statistical graphics.” Integrates well with pandas.
- Holoviews: “Composable, declarative data structures for building even complex visualizations easily.” Works for 2d datasets.
- Cartopy: Provides cartographic tools.
Imports¶
The following imports are necessary for all of the examples.
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: import matplotlib.pyplot as plt
In [4]: import xarray as xr
For these examples we’ll use the North American air temperature dataset.
In [5]: airtemps = xr.tutorial.load_dataset('air_temperature')
In [6]: airtemps
Out[6]:
<xarray.Dataset>
Dimensions: (lat: 25, lon: 53, time: 2920)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 52.5 ...
* time (time) datetime64[ns] 2013-01-01 2013-01-01T06:00:00 ...
* lon (lon) float32 200.0 202.5 205.0 207.5 210.0 212.5 215.0 217.5 ...
Data variables:
air (time, lat, lon) float64 241.2 242.5 243.5 244.0 244.1 243.9 ...
Attributes:
platform: Model
Conventions: COARDS
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.html
description: Data is from NMC initialized reanalysis
(4x/day). These are the 0.9950 sigma level values.
title: 4x daily NMC reanalysis (1948)
# Convert to celsius
In [7]: air = airtemps.air - 273.15
One Dimension¶
Simple Example¶
xarray uses the coordinate name to label the x axis.
In [8]: air1d = air.isel(lat=10, lon=10)
In [9]: air1d.plot()
Out[9]: [<matplotlib.lines.Line2D at 0x7fe8eda48e10>]

Additional Arguments¶
Additional arguments are passed directly to the matplotlib function which
does the work.
For example, xarray.plot.line()
calls
matplotlib.pyplot.plot passing in the index and the array values as x and y, respectively.
So to make a line plot with blue triangles a matplotlib format string
can be used:
In [10]: air1d[:200].plot.line('b-^')
Out[10]: [<matplotlib.lines.Line2D at 0x7fe8edc96b10>]

Note
Not all xarray plotting methods support passing positional arguments to the wrapped matplotlib functions, but they do all support keyword arguments.
Keyword arguments work the same way, and are more explicit.
In [11]: air1d[:200].plot.line(color='purple', marker='o')
Out[11]: [<matplotlib.lines.Line2D at 0x7fe8e469d350>]

Adding to Existing Axis¶
To add the plot to an existing axis pass in the axis as a keyword argument
ax
. This works for all xarray plotting methods.
In this example axes
is an array consisting of the left and right
axes created by plt.subplots
.
In [12]: fig, axes = plt.subplots(ncols=2)
In [13]: axes
Out[13]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x7fe8ee491a10>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fe8e4383e50>], dtype=object)
In [14]: air1d.plot(ax=axes[0])
Out[14]: [<matplotlib.lines.Line2D at 0x7fe8ec10c990>]
In [15]: air1d.plot.hist(ax=axes[1])
Out[15]:
(array([ 9., 38., 255., 584., 542., 489., 368., 258., 327., 50.]),
array([ 0.95 , 2.719, 4.488, ..., 15.102, 16.871, 18.64 ]),
<a list of 10 Patch objects>)
In [16]: plt.tight_layout()
In [17]: plt.show()

On the right is a histogram created by xarray.plot.hist()
.
Two Dimensions¶
Simple Example¶
The default method xarray.DataArray.plot()
sees that the data is
2 dimensional and calls xarray.plot.pcolormesh()
.
In [18]: air2d = air.isel(time=500)
In [19]: air2d.plot()
Out[19]: <matplotlib.collections.QuadMesh at 0x7fe8ee573890>

All 2d plots in xarray allow the use of the keyword arguments yincrease
and xincrease
.
In [20]: air2d.plot(yincrease=False)
Out[20]: <matplotlib.collections.QuadMesh at 0x7fe8eefc5ed0>

Note
We use xarray.plot.pcolormesh()
as the default two-dimensional plot
method because it is more flexible than xarray.plot.imshow()
.
However, for large arrays, imshow
can be much faster than pcolormesh
.
If speed is important to you and you are plotting a regular mesh, consider
using imshow
.
Missing Values¶
xarray plots data with Missing values.
In [21]: bad_air2d = air2d.copy()
In [22]: bad_air2d[dict(lat=slice(0, 10), lon=slice(0, 25))] = np.nan
In [23]: bad_air2d.plot()
Out[23]: <matplotlib.collections.QuadMesh at 0x7fe8eefe3150>

Nonuniform Coordinates¶
It’s not necessary for the coordinates to be evenly spaced. Both
xarray.plot.pcolormesh()
(default) and xarray.plot.contourf()
can
produce plots with nonuniform coordinates.
In [24]: b = air2d.copy()
# Apply a nonlinear transformation to one of the coords
In [25]: b.coords['lat'] = np.log(b.coords['lat'])
In [26]: b.plot()
Out[26]: <matplotlib.collections.QuadMesh at 0x7fe8e3f5ab50>

Calling Matplotlib¶
Since this is a thin wrapper around matplotlib, all the functionality of matplotlib is available.
In [27]: air2d.plot(cmap=plt.cm.Blues)
Out[27]: <matplotlib.collections.QuadMesh at 0x7fe8e337de10>
In [28]: plt.title('These colors prove North America\nhas fallen in the ocean')
Out[28]: <matplotlib.text.Text at 0x7fe8e33a00d0>
In [29]: plt.ylabel('latitude')
Out[29]: <matplotlib.text.Text at 0x7fe8e44b0e10>
In [30]: plt.xlabel('longitude')
Out[30]: <matplotlib.text.Text at 0x7fe8e3400b90>
In [31]: plt.tight_layout()
In [32]: plt.show()

Note
xarray methods update label information and generally play around with the
axes. So any kind of updates to the plot
should be done after the call to the xarray’s plot.
In the example below, plt.xlabel
effectively does nothing, since
d_ylog.plot()
updates the xlabel.
In [33]: plt.xlabel('Never gonna see this.')
Out[33]: <matplotlib.text.Text at 0x7fe8e32eb210>
In [34]: air2d.plot()
Out[34]: <matplotlib.collections.QuadMesh at 0x7fe8e3260d50>
In [35]: plt.show()

Colormaps¶
xarray borrows logic from Seaborn to infer what kind of color map to use. For example, consider the original data in Kelvins rather than Celsius:
In [36]: airtemps.air.isel(time=0).plot()
Out[36]: <matplotlib.collections.QuadMesh at 0x7fe8e343cc90>

The Celsius data contain 0, so a diverging color map was used. The Kelvins do not have 0, so the default color map was used.
Robust¶
Outliers often have an extreme effect on the output of the plot. Here we add two bad data points. This affects the color scale, washing out the plot.
In [37]: air_outliers = airtemps.air.isel(time=0).copy()
In [38]: air_outliers[0, 0] = 100
In [39]: air_outliers[-1, -1] = 400
In [40]: air_outliers.plot()
Out[40]: <matplotlib.collections.QuadMesh at 0x7fe8e346b6d0>

This plot shows that we have outliers. The easy way to visualize
the data without the outliers is to pass the parameter
robust=True
.
This will use the 2nd and 98th
percentiles of the data to compute the color limits.
In [41]: air_outliers.plot(robust=True)
Out[41]: <matplotlib.collections.QuadMesh at 0x7fe8e2fe3450>

Observe that the ranges of the color bar have changed. The arrows on the color bar indicate that the colors include data points outside the bounds.
Discrete Colormaps¶
It is often useful, when visualizing 2d data, to use a discrete colormap,
rather than the default continuous colormaps that matplotlib uses. The
levels
keyword argument can be used to generate plots with discrete
colormaps. For example, to make a plot with 8 discrete color intervals:
In [42]: air2d.plot(levels=8)
Out[42]: <matplotlib.collections.QuadMesh at 0x7fe8e2ee8150>

It is also possible to use a list of levels to specify the boundaries of the discrete colormap:
In [43]: air2d.plot(levels=[0, 12, 18, 30])
Out[43]: <matplotlib.collections.QuadMesh at 0x7fe8e2e95250>

You can also specify a list of discrete colors through the colors
argument:
In [44]: flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]
In [45]: air2d.plot(levels=[0, 12, 18, 30], colors=flatui)
Out[45]: <matplotlib.collections.QuadMesh at 0x7fe8df1615d0>

Finally, if you have Seaborn
installed, you can also specify a seaborn color palette to the cmap
argument. Note that levels
must be specified with seaborn color palettes
if using imshow
or pcolormesh
(but not with contour
or contourf
,
since levels are chosen automatically).
In [46]: air2d.plot(levels=10, cmap='husl')
Out[46]: <matplotlib.collections.QuadMesh at 0x7fe8df070250>

Faceting¶
Faceting here refers to splitting an array along one or two dimensions and plotting each group. xarray’s basic plotting is useful for plotting two dimensional arrays. What about three or four dimensional arrays? That’s where facets become helpful.
Consider the temperature data set. There are 4 observations per day for two years which makes for 2920 values along the time dimension. One way to visualize this data is to make a seperate plot for each time period.
The faceted dimension should not have too many values; faceting on the time dimension will produce 2920 plots. That’s too much to be helpful. To handle this situation try performing an operation that reduces the size of the data in some way. For example, we could compute the average air temperature for each month and reduce the size of this dimension from 2920 -> 12. A simpler way is to just take a slice on that dimension. So let’s use a slice to pick 6 times throughout the first year.
In [47]: t = air.isel(time=slice(0, 365 * 4, 250))
In [48]: t.coords
Out[48]:
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 52.5 ...
* time (time) datetime64[ns] 2013-01-01 2013-03-04T12:00:00 2013-05-06 ...
* lon (lon) float32 200.0 202.5 205.0 207.5 210.0 212.5 215.0 217.5 ...
Simple Example¶
The easiest way to create faceted plots is to pass in row
or col
arguments to the xarray plotting methods/functions. This returns a
xarray.plot.FacetGrid
object.
In [49]: g_simple = t.plot(x='lon', y='lat', col='time', col_wrap=3)

4 dimensional¶
For 4 dimensional arrays we can use the rows and columns of the grids. Here we create a 4 dimensional array by taking the original data and adding a fixed amount. Now we can see how the temperature maps would compare if one were much hotter.
In [50]: t2 = t.isel(time=slice(0, 2))
In [51]: t4d = xr.concat([t2, t2 + 40], pd.Index(['normal', 'hot'], name='fourth_dim'))
# This is a 4d array
In [52]: t4d.coords
Out[52]:
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 ...
* time (time) datetime64[ns] 2013-01-01 2013-03-04T12:00:00
* lon (lon) float32 200.0 202.5 205.0 207.5 210.0 212.5 215.0 ...
* fourth_dim (fourth_dim) object 'normal' 'hot'
In [53]: t4d.plot(x='lon', y='lat', col='time', row='fourth_dim')
Out[53]: <xarray.plot.facetgrid.FacetGrid at 0x7fe8df112590>

Other features¶
Faceted plotting supports other arguments common to xarray 2d plots.
In [54]: hasoutliers = t.isel(time=slice(0, 5)).copy()
In [55]: hasoutliers[0, 0, 0] = -100
In [56]: hasoutliers[-1, -1, -1] = 400
In [57]: g = hasoutliers.plot.pcolormesh('lon', 'lat', col='time', col_wrap=3,
....: robust=True, cmap='viridis')
....:

FacetGrid Objects¶
xarray.plot.FacetGrid
is used to control the behavior of the
multiple plots.
It borrows an API and code from Seaborn.
The structure is contained within the axes
and name_dicts
attributes, both 2d Numpy object arrays.
In [58]: g.axes
Out[58]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7fe8de963510>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fe8de832690>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fe8de944b90>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x7fe8dec87250>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fe8ded394d0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7fe8dee56890>]], dtype=object)
In [59]: g.name_dicts
Out[59]:
array([[{'time': numpy.datetime64('2013-01-01T00:00:00.000000000')},
{'time': numpy.datetime64('2013-03-04T12:00:00.000000000')},
{'time': numpy.datetime64('2013-05-06T00:00:00.000000000')}],
[{'time': numpy.datetime64('2013-07-07T12:00:00.000000000')},
{'time': numpy.datetime64('2013-09-08T00:00:00.000000000')}, None]], dtype=object)
It’s possible to select the xarray.DataArray
or
xarray.Dataset
corresponding to the FacetGrid through the
name_dicts
.
In [60]: g.data.loc[g.name_dicts[0, 0]]
Out[60]:
<xarray.DataArray 'air' (lat: 25, lon: 53)>
array([[-100. , -30.65, -29.65, ..., -40.35, -37.65, -34.55],
[ -29.35, -28.65, -28.45, ..., -40.35, -37.85, -33.85],
[ -23.15, -23.35, -24.26, ..., -39.95, -36.76, -31.45],
...,
[ 23.45, 23.05, 23.25, ..., 22.25, 21.95, 21.55],
[ 22.75, 23.05, 23.64, ..., 22.75, 22.75, 22.05],
[ 23.14, 23.64, 23.95, ..., 23.75, 23.64, 23.45]])
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 52.5 ...
time datetime64[ns] 2013-01-01
* lon (lon) float32 200.0 202.5 205.0 207.5 210.0 212.5 215.0 217.5 ...
Here is an example of using the lower level API and then modifying the axes after they have been plotted.
In [61]: g = t.plot.imshow('lon', 'lat', col='time', col_wrap=3, robust=True)
In [62]: for i, ax in enumerate(g.axes.flat):
....: ax.set_title('Air Temperature %d' % i)
....:
In [63]: bottomright = g.axes[-1, -1]
In [64]: bottomright.annotate('bottom right', (240, 40))
Out[64]: <matplotlib.text.Annotation at 0x7fe8de9b31d0>
In [65]: plt.show()

TODO: add an example of using the map
method to plot dataset variables
(e.g., with plt.quiver
).
Maps¶
To follow this section you’ll need to have Cartopy installed and working.
This script will plot the air temperature on a map.
import xarray as xr
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
air = (xr.tutorial
.load_dataset('air_temperature')
.air
.isel(time=0))
ax = plt.axes(projection=ccrs.Orthographic(-80, 35))
ax.set_global()
air.plot.contourf(ax=ax, transform=ccrs.PlateCarree())
ax.coastlines()
plt.savefig('cartopy_example.png')
Here is the resulting image:

Details¶
Ways to Use¶
There are three ways to use the xarray plotting functionality:
- Use
plot
as a convenience method for a DataArray. - Access a specific plotting method from the
plot
attribute of a DataArray. - Directly from the xarray plot submodule.
These are provided for user convenience; they all call the same code.
In [66]: import xarray.plot as xplt
In [67]: da = xr.DataArray(range(5))
In [68]: fig, axes = plt.subplots(ncols=2, nrows=2)
In [69]: da.plot(ax=axes[0, 0])
Out[69]: [<matplotlib.lines.Line2D at 0x7fe8de21c7d0>]
In [70]: da.plot.line(ax=axes[0, 1])
Out[70]: [<matplotlib.lines.Line2D at 0x7fe8de68d890>]
In [71]: xplt.plot(da, ax=axes[1, 0])
Out[71]: [<matplotlib.lines.Line2D at 0x7fe8de16ff90>]
In [72]: xplt.line(da, ax=axes[1, 1])
Out[72]: [<matplotlib.lines.Line2D at 0x7fe8ec179b90>]
In [73]: plt.tight_layout()
In [74]: plt.show()

Here the output is the same. Since the data is 1 dimensional the line plot was used.
The convenience method xarray.DataArray.plot()
dispatches to an appropriate
plotting function based on the dimensions of the DataArray
and whether
the coordinates are sorted and uniformly spaced. This table
describes what gets plotted:
Dimensions | Plotting function |
1 | xarray.plot.line() |
2 | xarray.plot.pcolormesh() |
Anything else | xarray.plot.hist() |
Coordinates¶
If you’d like to find out what’s really going on in the coordinate system, read on.
In [75]: a0 = xr.DataArray(np.zeros((4, 3, 2)), dims=('y', 'x', 'z'),
....: name='temperature')
....:
In [76]: a0[0, 0, 0] = 1
In [77]: a = a0.isel(z=0)
In [78]: a
Out[78]:
<xarray.DataArray 'temperature' (y: 4, x: 3)>
array([[ 1., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])
Coordinates:
* y (y) int64 0 1 2 3
* x (x) int64 0 1 2
z int64 0
The plot will produce an image corresponding to the values of the array. Hence the top left pixel will be a different color than the others. Before reading on, you may want to look at the coordinates and think carefully about what the limits, labels, and orientation for each of the axes should be.
In [79]: a.plot()
Out[79]: <matplotlib.collections.QuadMesh at 0x7fe8de270c10>

It may seem strange that the values on the y axis are decreasing with -0.5 on the top. This is because the pixels are centered over their coordinates, and the axis labels and ranges correspond to the values of the coordinates.
API reference¶
This page provides an auto-generated summary of xarray’s API. For more details and examples, refer to the relevant chapters in the main part of the documentation.
Top-level functions¶
align (*objects[, join, copy]) |
Given any number of Dataset and/or DataArray objects, returns new objects with aligned indexes. |
broadcast (*args, **kwargs) |
Explicitly broadcast any number of DataArray or Dataset objects against one another. |
concat (objs[, dim, data_vars, coords, ...]) |
Concatenate xarray objects along a new or existing dimension. |
merge (objects[, compat, join]) |
Merge any number of xarray objects into a single Dataset as variables. |
set_options (**kwargs) |
Set global state within a controlled context |
Dataset¶
Creating a dataset¶
Dataset ([data_vars, coords, attrs, compat]) |
A multi-dimensional, in memory, array database. |
decode_cf (obj[, concat_characters, ...]) |
Decode the given Dataset or Datastore according to CF conventions into a new Dataset. |
Attributes¶
Dataset.dims |
Mapping from dimension names to lengths. |
Dataset.data_vars |
Dictionary of xarray.DataArray objects corresponding to data variables |
Dataset.coords |
Dictionary of xarray.DataArray objects corresponding to coordinate |
Dataset.attrs |
Dictionary of global attributes on this dataset |
Dictionary interface¶
Datasets implement the mapping interface with keys given by variable names
and values given by DataArray
objects.
Dataset.__getitem__ (key) |
Access variables or coordinates this dataset as a DataArray . |
Dataset.__setitem__ (key, value) |
Add an array to this dataset. |
Dataset.__delitem__ (key) |
Remove a variable from this dataset. |
Dataset.update (other[, inplace]) |
Update this dataset’s variables with those from another dataset. |
Dataset.iteritems (...) |
|
Dataset.itervalues (...) |
Dataset contents¶
Dataset.copy ([deep]) |
Returns a copy of this dataset. |
Dataset.assign (**kwargs) |
Assign new data variables to a Dataset, returning a new object with all the original variables in addition to the new ones. |
Dataset.assign_coords (**kwargs) |
Assign new coordinates to this object, returning a new object with all the original data in addition to the new coordinates. |
Dataset.pipe (func, *args, **kwargs) |
Apply func(self, *args, **kwargs) |
Dataset.merge (other[, inplace, ...]) |
Merge the arrays of two datasets into a single dataset. |
Dataset.rename (name_dict[, inplace]) |
Returns a new object with renamed variables and dimensions. |
Dataset.swap_dims (dims_dict[, inplace]) |
Returns a new object with swapped dimensions. |
Dataset.drop (labels[, dim]) |
Drop variables or index labels from this dataset. |
Dataset.set_coords (names[, inplace]) |
Given names of one or more variables, set them as coordinates |
Dataset.reset_coords ([names, drop, inplace]) |
Given names of coordinates, reset them to become variables |
Comparisons¶
Dataset.equals (other) |
Two Datasets are equal if they have matching variables and coordinates, all of which are equal. |
Dataset.identical (other) |
Like equals, but also checks all dataset attributes and the attributes on all variables and coordinates. |
Dataset.broadcast_equals (other) |
Two Datasets are broadcast equal if they are equal after broadcasting all variables against each other. |
Indexing¶
Dataset.loc |
Attribute for location based indexing. |
Dataset.isel (**indexers) |
Returns a new dataset with each array indexed along the specified dimension(s). |
Dataset.sel ([method, tolerance]) |
Returns a new dataset with each array indexed by tick labels along the specified dimension(s). |
Dataset.isel_points ([dim]) |
Returns a new dataset with each array indexed pointwise along the specified dimension(s). |
Dataset.sel_points ([dim, method, tolerance]) |
Returns a new dataset with each array indexed pointwise by tick labels along the specified dimension(s). |
Dataset.squeeze ([dim]) |
Returns a new dataset with squeezed data. |
Dataset.reindex ([indexers, method, ...]) |
Conform this object onto a new set of indexes, filling in missing values with NaN. |
Dataset.reindex_like (other[, method, ...]) |
Conform this object onto the indexes of another object, filling in missing values with NaN. |
Computation¶
Dataset.apply (func[, keep_attrs, args]) |
Apply a function over the data variables in this dataset. |
Dataset.reduce (func[, dim, keep_attrs, ...]) |
Reduce this dataset by applying func along some dimension(s). |
Dataset.groupby (group[, squeeze]) |
Returns a GroupBy object for performing grouped operations. |
Dataset.groupby_bins (group, bins[, right, ...]) |
Returns a GroupBy object for performing grouped operations. |
Dataset.resample (freq, dim[, how, skipna, ...]) |
Resample this object to a new temporal resolution. |
Dataset.diff (dim[, n, label]) |
Calculate the n-th order discrete difference along given axis. |
Aggregation:
all
any
argmax
argmin
max
mean
median
min
prod
sum
std
var
Missing values:
isnull
notnull
count
dropna
fillna
where
ndarray methods:
argsort
clip
conj
conjugate
imag
round
real
T
Grouped operations:
assign
assign_coords
first
last
fillna
where
Reshaping and reorganizing¶
Dataset.transpose (*dims) |
Return a new Dataset object with all array dimensions transposed. |
Dataset.stack (**dimensions) |
Stack any number of existing dimensions into a single new dimension. |
Dataset.unstack (dim) |
Unstack an existing dimension corresponding to a MultiIndex into multiple new dimensions. |
Dataset.shift (**shifts) |
Shift this dataset by an offset along one or more dimensions. |
Dataset.roll (**shifts) |
Roll this dataset by an offset along one or more dimensions. |
DataArray¶
DataArray (data[, coords, dims, name, attrs, ...]) |
N-dimensional array with labeled coordinates and dimensions. |
Attributes¶
DataArray.values |
The array’s data as a numpy.ndarray |
DataArray.data |
The array’s data as a dask or numpy array |
DataArray.coords |
Dictionary-like container of coordinate arrays. |
DataArray.dims |
Dimension names associated with this array. |
DataArray.name |
The name of this array. |
DataArray.attrs |
Dictionary storing arbitrary metadata with this array. |
DataArray.encoding |
Dictionary of format-specific settings for how this array should be serialized. |
DataArray contents¶
DataArray.assign_coords (**kwargs) |
Assign new coordinates to this object, returning a new object with all the original data in addition to the new coordinates. |
DataArray.rename (new_name_or_name_dict) |
Returns a new DataArray with renamed coordinates and/or a new name. |
DataArray.swap_dims (dims_dict) |
Returns a new DataArray with swapped dimensions. |
DataArray.drop (labels[, dim]) |
Drop coordinates or index labels from this DataArray. |
DataArray.reset_coords ([names, drop, inplace]) |
Given names of coordinates, reset them to become variables. |
DataArray.copy ([deep]) |
Returns a copy of this array. |
Indexing¶
DataArray.__getitem__ (key) |
|
DataArray.__setitem__ (key, value) |
|
DataArray.loc |
Attribute for location based indexing like pandas. |
DataArray.isel (**indexers) |
Return a new DataArray whose dataset is given by integer indexing along the specified dimension(s). |
DataArray.sel ([method, tolerance]) |
Return a new DataArray whose dataset is given by selecting index labels along the specified dimension(s). |
DataArray.isel_points ([dim]) |
Return a new DataArray whose dataset is given by pointwise integer indexing along the specified dimension(s). |
DataArray.sel_points ([dim, method, tolerance]) |
Return a new DataArray whose dataset is given by pointwise selection of index labels along the specified dimension(s). |
DataArray.squeeze ([dim]) |
Return a new DataArray object with squeezed data. |
DataArray.reindex ([method, tolerance, copy]) |
Conform this object onto a new set of indexes, filling in missing values with NaN. |
DataArray.reindex_like (other[, method, ...]) |
Conform this object onto the indexes of another object, filling in missing values with NaN. |
Comparisons¶
DataArray.equals (other) |
True if two DataArrays have the same dimensions, coordinates and values; otherwise False. |
DataArray.identical (other) |
Like equals, but also checks the array name and attributes, and attributes on all coordinates. |
DataArray.broadcast_equals (other) |
Two DataArrays are broadcast equal if they are equal after broadcasting them against each other such that they have the same dimensions. |
Computation¶
DataArray.reduce (func[, dim, axis, keep_attrs]) |
Reduce this array by applying func along some dimension(s). |
DataArray.groupby (group[, squeeze]) |
Returns a GroupBy object for performing grouped operations. |
DataArray.groupby_bins (group, bins[, right, ...]) |
Returns a GroupBy object for performing grouped operations. |
DataArray.rolling ([min_periods, center]) |
Rolling window object. |
DataArray.resample (freq, dim[, how, skipna, ...]) |
Resample this object to a new temporal resolution. |
DataArray.get_axis_num (dim) |
Return axis number(s) corresponding to dimension(s) in this array. |
DataArray.diff (dim[, n, label]) |
Calculate the n-th order discrete difference along given axis. |
DataArray.dot (other) |
Perform dot product of two DataArrays along their shared dims. |
Aggregation:
all
any
argmax
argmin
max
mean
median
min
prod
sum
std
var
Missing values:
isnull
notnull
count
dropna
fillna
where
ndarray methods:
argsort
clip
conj
conjugate
imag
searchsorted
round
real
T
Grouped operations:
assign_coords
first
last
fillna
where
Reshaping and reorganizing¶
DataArray.transpose (*dims) |
Return a new DataArray object with transposed dimensions. |
DataArray.stack (**dimensions) |
Stack any number of existing dimensions into a single new dimension. |
DataArray.unstack (dim) |
Unstack an existing dimension corresponding to a MultiIndex into multiple new dimensions. |
DataArray.shift (**shifts) |
Shift this array by an offset along one or more dimensions. |
DataArray.roll (**shifts) |
Roll this array by an offset along one or more dimensions. |
Universal functions¶
This functions are copied from NumPy, but extended to work on NumPy arrays,
dask arrays and all xarray objects. You can find them in the xarray.ufuncs
module:
angle
arccos
arccosh
arcsin
arcsinh
arctan
arctan2
arctanh
ceil
conj
copysign
cos
cosh
deg2rad
degrees
exp
expm1
fabs
fix
floor
fmax
fmin
fmod
fmod
frexp
hypot
imag
iscomplex
isfinite
isinf
isnan
isreal
ldexp
log
log10
log1p
log2
logaddexp
logaddexp2
logical_and
logical_not
logical_or
logical_xor
maximum
minimum
nextafter
rad2deg
radians
real
rint
sign
signbit
sin
sinh
sqrt
square
tan
tanh
trunc
IO / Conversion¶
Dataset methods¶
open_dataset (filename_or_obj[, group, ...]) |
Load and decode a dataset from a file or file-like object. |
open_mfdataset (paths[, chunks, concat_dim, ...]) |
Open multiple files as a single dataset. |
Dataset.to_netcdf ([path, mode, format, ...]) |
Write dataset contents to a netCDF file. |
save_mfdataset (datasets, paths[, mode, ...]) |
Write multiple datasets to disk as netCDF files simultaneously. |
Dataset.to_array ([dim, name]) |
Convert this dataset into an xarray.DataArray |
Dataset.to_dataframe () |
Convert this dataset into a pandas.DataFrame. |
Dataset.to_dict () |
Convert this dataset to a dictionary following xarray naming conventions. |
Dataset.from_dataframe (dataframe) |
Convert a pandas.DataFrame into an xarray.Dataset |
Dataset.from_dict (d) |
Convert a dictionary into an xarray.Dataset. |
Dataset.close () |
Close any files linked to this dataset |
Dataset.load () |
Manually trigger loading of this dataset’s data from disk or a remote source into memory and return this dataset. |
Dataset.chunk ([chunks, name_prefix, token, lock]) |
Coerce all arrays in this dataset into dask arrays with the given chunks. |
Dataset.filter_by_attrs (**kwargs) |
Returns a Dataset with variables that match specific conditions. |
DataArray methods¶
DataArray.to_dataset ([dim, name]) |
Convert a DataArray to a Dataset. |
DataArray.to_pandas () |
Convert this array into a pandas object with the same shape. |
DataArray.to_series () |
Convert this array into a pandas.Series. |
DataArray.to_dataframe ([name]) |
Convert this array and its coordinates into a tidy pandas.DataFrame. |
DataArray.to_index () |
Convert this variable to a pandas.Index. |
DataArray.to_masked_array ([copy]) |
Convert this array into a numpy.ma.MaskedArray |
DataArray.to_cdms2 () |
Convert this array into a cdms2.Variable |
DataArray.to_dict () |
Convert this xarray.DataArray into a dictionary following xarray naming conventions. |
DataArray.from_series (series) |
Convert a pandas.Series into an xarray.DataArray. |
DataArray.from_cdms2 (variable) |
Convert a cdms2.Variable into an xarray.DataArray |
DataArray.from_dict (d) |
Convert a dictionary into an xarray.DataArray |
DataArray.load () |
Manually trigger loading of this array’s data from disk or a remote source into memory and return this array. |
DataArray.chunk ([chunks]) |
Coerce this array’s data into a dask arrays with the given chunks. |
Plotting¶
plot.plot (darray[, row, col, col_wrap, ax, ...]) |
Default plot of DataArray using matplotlib.pyplot. |
plot.contourf (darray[, x, y, ax, row, col, ...]) |
Filled contour plot of 2d DataArray |
plot.contour (darray[, x, y, ax, row, col, ...]) |
Contour plot of 2d DataArray |
plot.hist (darray[, ax]) |
Histogram of DataArray |
plot.imshow (darray[, x, y, ax, row, col, ...]) |
Image plot of 2d DataArray using matplotlib.pyplot |
plot.line (darray, *args, **kwargs) |
Line plot of 1 dimensional DataArray index against values |
plot.pcolormesh (darray[, x, y, ax, row, ...]) |
Pseudocolor plot of 2d DataArray |
plot.FacetGrid (data[, col, row, col_wrap, ...]) |
Initialize the matplotlib figure and FacetGrid object. |
Advanced API¶
Variable (dims, data[, attrs, encoding, fastpath]) |
A netcdf-like variable consisting of dimensions, data and attributes which describe a single Array. |
Coordinate (name, data[, attrs, encoding, ...]) |
Wrapper around pandas.Index that adds xarray specific functionality. |
register_dataset_accessor (name) |
Register a custom property on xarray.Dataset objects. |
register_dataarray_accessor (name) |
Register a custom accessor on xarray.DataArray objects. |
These backends provide a low-level interface for lazily loading data from
external file-formats or protocols, and can be manually invoked to create
arguments for the from_store
and dump_to_store
Dataset methods:
backends.NetCDF4DataStore (filename[, mode, ...]) |
Store for reading and writing data via the Python-NetCDF4 library. |
backends.H5NetCDFStore (filename[, mode, ...]) |
Store for reading and writing data via h5netcdf |
backends.PydapDataStore (url) |
Store for accessing OpenDAP datasets with pydap. |
backends.ScipyDataStore (filename_or_obj[, ...]) |
Store for reading and writing data via scipy.io.netcdf. |
xarray Internals¶
xarray builds upon two of the foundational libraries of the scientific Python stack, NumPy and pandas. It is written in pure Python (no C or Cython extensions), which makes it easy to develop and extend. Instead, we push compiled code to optional dependencies.
Variable objects¶
The core internal data structure in xarray is the Variable
,
which is used as the basic building block behind xarray’s
Dataset
and DataArray
types. A
Variable
consists of:
dims
: A tuple of dimension names.data
: The N-dimensional array (typically, a NumPy or Dask array) storing the Variable’s data. It must have the same number of dimensions as the length ofdims
.attrs
: An ordered dictionary of metadata associated with this array. By convention, xarray’s built-in operations never use this metadata.encoding
: Another ordered dictionary used to store information about how these variable’s data is represented on disk. See Reading encoded data for more details.
Variable
has an interface similar to NumPy arrays, but extended to make use
of named dimensions. For example, it uses dim
in preference to an axis
argument for methods like mean
, and supports Broadcasting by dimension name.
However, unlike Dataset
and DataArray
, the basic Variable
does not
include coordinate labels along each axis.
Variable
is public API, but because of its incomplete support for labeled
data, it is mostly intended for advanced uses, such as in xarray itself or for
writing new backends. You can access the variable objects that correspond to
xarray objects via the (readonly) Dataset.variables
and
DataArray.variable
attributes.
Extending xarray¶
xarray is designed as a general purpose library, and hence tries to avoid including overly domain specific methods. But inevitably, the need for more domain specific logic arises.
One standard solution to this problem is to subclass Dataset and/or DataArray to add domain specific functionality. However, inheritance is not very robust. It’s easy to inadvertently use internal APIs when subclassing, which means that your code may break when xarray upgrades. Furthermore, many builtin methods will only return native xarray objects.
The standard advice is to use composition over inheritance, but reimplementing an API as large as xarray’s on your own objects can be an onerous task, even if most methods are only forwarding to xarray implementations.
To resolve this dilemma, xarray has the experimental
register_dataset_accessor()
and
register_dataarray_accessor()
decorators for adding custom
“accessors” on xarray objects. Here’s how you might use these decorators to
write a custom “geo” accessor implementing a geography specific extension to
xarray:
import xarray as xr
@xr.register_dataset_accessor('geo')
class GeoAccessor(object):
def __init__(self, xarray_obj):
self._obj = xarray_obj
self._center = None
@property
def center(self):
"""Return the geographic center point of this dataset."""
if self._center is None:
# we can use a cache on our accessor objects, because accessors
# themselves are cached on instances that access them.
lon = self._obj.latitude
lat = self._obj.longitude
self._center = (float(lon.mean()), float(lat.mean()))
return self._center
def plot(self):
"""Plot data on a map."""
return 'plotting!'
This achieves the same result as if the Dataset
class had a cached property
defined that returns an instance of your class:
class Dataset:
...
@property
def geo(self)
return GeoAccessor(self)
However, using the register accessor decorators is preferable to simply adding
your own ad-hoc property (i.e., Dataset.geo = property(...)
), for two
reasons:
- It ensures that the name of your property does not conflict with any other attributes or methods.
- Instances of accessor object will be cached on the xarray object that creates them. This means you can save state on them (e.g., to cache computed properties).
Back in an interactive IPython session, we can use these properties:
In [1]: ds = xr.Dataset({'longitude': np.linspace(0, 10),
...: 'latitude': np.linspace(0, 20)})
...:
In [2]: ds.geo.center
Out[2]: (10.0, 5.0)
In [3]: ds.geo.plot()
Out[3]: 'plotting!'
The intent here is that libraries that extend xarray could add such an accessor to implement subclass specific functionality rather than using actual subclasses or patching in a large number of domain specific methods.
To help users keep things straight, please let us know if you plan to write a new accessor for an open source library. In the future, we will maintain a list of accessors and the libraries that implement them on this page.
Here are several existing libraries that build functionality upon xarray. They may be useful points of reference for your work:
- xgcm: General Circulation Model Postprocessing. Uses subclassing and custom xarray backends.
- PyGDX: Python 3 package for accessing data stored in GAMS Data eXchange (GDX) files. Also uses a custom subclass.
- windspharm: Spherical harmonic wind analysis in Python.
- eofs: EOF analysis in Python.
See also¶
- Stephan Hoyer’s SciPy2015 talk introducing xarray to a general audience.
- Stephan Hoyer’s 2015 Unidata Users Workshop talk and tutorial (with answers) introducing xarray to users familiar with netCDF.
- Nicolas Fauchereau’s tutorial on xarray for netCDF users.
Get in touch¶
- Ask usage questions on StackOverflow.
- Report bugs, suggest features or view the source code on GitHub.
- For less well defined questions or ideas, use the mailing list.
- You can also try our chatroom on Gitter.
License¶
xarray is available under the open source Apache License.
History¶
xarray is an evolution of an internal tool developed at The Climate Corporation. It was originally written by Climate Corp researchers Stephan Hoyer, Alex Kleeman and Eugene Brevdo and was released as open source in May 2014. The project was renamed from “xray” in January 2016.