
N-D labeled arrays and datasets in Python¶
xarray (formerly xray) is an open source project and Python package that aims to bring the labeled data power of pandas to the physical sciences, by providing N-dimensional variants of the core pandas data structures.
Our goal is to provide a pandas-like and pandas-compatible toolkit for analytics on multi-dimensional arrays, rather than the tabular data for which pandas excels. Our approach adopts the Common Data Model for self- describing scientific data in widespread use in the Earth sciences: xarray.Dataset is an in-memory representation of a netCDF file.
Note
xray is now xarray! See the v0.7.0 release notes for more details. The preferred URL for these docs is now http://xarray.pydata.org.
Documentation¶
What’s New¶
v0.8.0 (2 August 2016)¶
This release includes four months of new features and bug fixes, including several breaking changes.
Breaking changes¶
- Dropped support for Python 2.6 (GH855).
- Indexing on multi-index now drop levels, which is consistent with pandas. It also changes the name of the dimension / coordinate when the multi-index is reduced to a single index (GH802).
- Contour plots no longer add a colorbar per default (GH866). Filled contour plots are unchanged.
- DataArray.values and .data now always returns an NumPy array-like object, even for 0-dimensional arrays with object dtype (GH867). Previously, .values returned native Python objects in such cases. To convert the values of scalar arrays to Python objects, use the .item() method.
Enhancements¶
- Groupby operations now support grouping over multidimensional variables. A new method called groupby_bins() has also been added to allow users to specify bins for grouping. The new features are described in Multidimensional Grouping and Working with Multidimensional Coordinates. By Ryan Abernathey.
- DataArray and Dataset method where() now supports a drop=True option that clips coordinate elements that are fully masked. By Phillip J. Wolfram.
- New top level merge() function allows for combining variables from any number of Dataset and/or DataArray variables. See Merge for more details. By Stephan Hoyer.
- DataArray and Dataset method resample() now supports the keep_attrs=False option that determines whether variable and dataset attributes are retained in the resampled object. By Jeremy McGibbon.
- Better multi-index support in DataArray and Dataset sel() and loc() methods, which now behave more closely to pandas and which also accept dictionaries for indexing based on given level names and labels (see Multi-level indexing). By Benoit Bovy.
- New (experimental) decorators register_dataset_accessor() and register_dataarray_accessor() for registering custom xarray extensions without subclassing. They are described in the new documentation page on xarray Internals. By Stephan Hoyer.
- Round trip boolean datatypes. Previously, writing boolean datatypes to netCDF formats would raise an error since netCDF does not have a bool datatype. This feature reads/writes a dtype attribute to boolean variables in netCDF files. By Joe Hamman.
- 2D plotting methods now have two new keywords (cbar_ax and cbar_kwargs), allowing more control on the colorbar (GH872). By Fabien Maussion.
- New Dataset method filter_by_attrs(), akin to netCDF4.Dataset.get_variables_by_attributes, to easily filter data variables using its attributes. Filipe Fernandes.
Bug fixes¶
Attributes were being retained by default for some resampling operations when they should not. With the keep_attrs=False option, they will no longer be retained by default. This may be backwards-incompatible with some scripts, but the attributes may be kept by adding the keep_attrs=True option. By Jeremy McGibbon.
Concatenating xarray objects along an axis with a MultiIndex or PeriodIndex preserves the nature of the index (GH875). By Stephan Hoyer.
Fixed bug in arithmetic operations on DataArray objects whose dimensions are numpy structured arrays or recarrays GH861, GH837. Maciek Swat.
- decode_cf_timedelta now accepts arrays with ndim >1 (GH842).
This fixes issue GH665. Filipe Fernandes.
Fix a bug where xarray.ufuncs that take two arguments would incorrectly use to numpy functions instead of dask.array functions (GH876). By Stephan Hoyer.
Support for pickling functions from xarray.ufuncs (GH901). By Stephan Hoyer.
Variable.copy(deep=True) no longer converts MultiIndex into a base Index (GH769). By Benoit Bovy.
Fixes for groupby on dimensions with a multi-index (GH867). By Stephan Hoyer.
Fix printing datasets with unicode attributes on Python 2 (GH892). By Stephan Hoyer.
Fixed incorrect test for dask version (GH891). By Stephan Hoyer.
Fixed dim argument for isel_points/sel_points when a pandas.Index is passed. By Stephan Hoyer.
contour() now plots the correct number of contours (GH866). By Fabien Maussion.
v0.7.2 (13 March 2016)¶
This release includes two new, entirely backwards compatible features and several bug fixes.
Enhancements¶
New DataArray method DataArray.dot() for calculating the dot product of two DataArrays along shared dimensions. By Dean Pospisil.
Rolling window operations on DataArray objects are now supported via a new DataArray.rolling() method. For example:
In [1]: import xarray as xr; import numpy as np In [2]: arr = xr.DataArray(np.arange(0, 7.5, 0.5).reshape(3, 5), dims=('x', 'y')) In [3]: arr Out[3]: <xarray.DataArray (x: 3, y: 5)> array([[ 0. , 0.5, 1. , 1.5, 2. ], [ 2.5, 3. , 3.5, 4. , 4.5], [ 5. , 5.5, 6. , 6.5, 7. ]]) Coordinates: * x (x) int64 0 1 2 * y (y) int64 0 1 2 3 4 In [4]: arr.rolling(y=3, min_periods=2).mean() Out[4]: <xarray.DataArray (x: 3, y: 5)> array([[ nan, 0.25, 0.5 , 1. , 1.5 ], [ nan, 2.75, 3. , 3.5 , 4. ], [ nan, 5.25, 5.5 , 6. , 6.5 ]]) Coordinates: * x (x) int64 0 1 2 * y (y) int64 0 1 2 3 4
See Rolling window operations for more details. By Joe Hamman.
Bug fixes¶
- Fixed an issue where plots using pcolormesh and Cartopy axes were being distorted by the inference of the axis interval breaks. This change chooses not to modify the coordinate variables when the axes have the attribute projection, allowing Cartopy to handle the extent of pcolormesh plots (GH781). By Joe Hamman.
- 2D plots now better handle additional coordinates which are not DataArray dimensions (GH788). By Fabien Maussion.
v0.7.1 (16 February 2016)¶
This is a bug fix release that includes two small, backwards compatible enhancements. We recommend that all users upgrade.
Enhancements¶
Bug fixes¶
- Restore checks for shape consistency between data and coordinates in the DataArray constructor (GH758).
- Single dimension variables no longer transpose as part of a broader .transpose. This behavior was causing pandas.PeriodIndex dimensions to lose their type (GH749)
- Dataset labels remain as their native type on .to_dataset. Previously they were coerced to strings (GH745)
- Fixed a bug where replacing a DataArray index coordinate would improperly align the coordinate (GH725).
- DataArray.reindex_like now maintains the dtype of complex numbers when reindexing leads to NaN values (GH738).
- Dataset.rename and DataArray.rename support the old and new names being the same (GH724).
- Fix from_dataset() for DataFrames with Categorical column and a MultiIndex index (GH737).
- Fixes to ensure xarray works properly after the upcoming pandas v0.18 and NumPy v1.11 releases.
Acknowledgments¶
The following individuals contributed to this release:
- Edward Richards
- Maximilian Roos
- Rafael Guedes
- Spencer Hill
- Stephan Hoyer
v0.7.0 (21 January 2016)¶
This major release includes redesign of DataArray internals, as well as new methods for reshaping, rolling and shifting data. It includes preliminary support for pandas.MultiIndex, as well as a number of other features and bug fixes, several of which offer improved compatibility with pandas.
New name¶
The project formerly known as “xray” is now “xarray”, pronounced “x-array”! This avoids a namespace conflict with the entire field of x-ray science. Renaming our project seemed like the right thing to do, especially because some scientists who work with actual x-rays are interested in using this project in their work. Thanks for your understanding and patience in this transition. You can now find our documentation and code repository at new URLs:
To ease the transition, we have simultaneously released v0.7.0 of both xray and xarray on the Python Package Index. These packages are identical. For now, import xray still works, except it issues a deprecation warning. This will be the last xray release. Going forward, we recommend switching your import statements to import xarray as xr.
Breaking changes¶
The internal data model used by DataArray has been rewritten to fix several outstanding issues (GH367, GH634, this stackoverflow report). Internally, DataArray is now implemented in terms of ._variable and ._coords attributes instead of holding variables in a Dataset object.
This refactor ensures that if a DataArray has the same name as one of its coordinates, the array and the coordinate no longer share the same data.
In practice, this means that creating a DataArray with the same name as one of its dimensions no longer automatically uses that array to label the corresponding coordinate. You will now need to provide coordinate labels explicitly. Here’s the old behavior:
In [5]: xray.DataArray([4, 5, 6], dims='x', name='x') Out[5]: <xray.DataArray 'x' (x: 3)> array([4, 5, 6]) Coordinates: * x (x) int64 4 5 6
and the new behavior (compare the values of the x coordinate):
In [6]: xray.DataArray([4, 5, 6], dims='x', name='x') Out[6]: <xray.DataArray 'x' (x: 3)> array([4, 5, 6]) Coordinates: * x (x) int64 0 1 2
It is no longer possible to convert a DataArray to a Dataset with xray.DataArray.to_dataset() if it is unnamed. This will now raise ValueError. If the array is unnamed, you need to supply the name argument.
Enhancements¶
Basic support for MultiIndex coordinates on xray objects, including indexing, stack() and unstack():
In [7]: df = pd.DataFrame({'foo': range(3), ...: 'x': ['a', 'b', 'b'], ...: 'y': [0, 0, 1]}) ...: In [8]: s = df.set_index(['x', 'y'])['foo'] In [9]: arr = xray.DataArray(s, dims='z') In [10]: arr Out[10]: <xray.DataArray 'foo' (z: 3)> array([0, 1, 2]) Coordinates: * z (z) object ('a', 0) ('b', 0) ('b', 1) In [11]: arr.indexes['z'] Out[11]: MultiIndex(levels=[[u'a', u'b'], [0, 1]], labels=[[0, 1, 1], [0, 0, 1]], names=[u'x', u'y']) In [12]: arr.unstack('z') Out[12]: <xray.DataArray 'foo' (x: 2, y: 2)> array([[ 0., nan], [ 1., 2.]]) Coordinates: * x (x) object 'a' 'b' * y (y) int64 0 1 In [13]: arr.unstack('z').stack(z=('x', 'y')) Out[13]: <xray.DataArray 'foo' (z: 4)> array([ 0., nan, 1., 2.]) Coordinates: * z (z) object ('a', 0) ('a', 1) ('b', 0) ('b', 1)
See Stack and unstack for more details.
Warning
xray’s MultiIndex support is still experimental, and we have a long to- do list of desired additions (GH719), including better display of multi-index levels when printing a Dataset, and support for saving datasets with a MultiIndex to a netCDF file. User contributions in this area would be greatly appreciated.
Support for reading GRIB, HDF4 and other file formats via PyNIO. See Formats supported by PyNIO for more details.
Better error message when a variable is supplied with the same name as one of its dimensions.
Plotting: more control on colormap parameters (GH642). vmin and vmax will not be silently ignored anymore. Setting center=False prevents automatic selection of a divergent colormap.
New shift() and roll() methods for shifting/rotating datasets or arrays along a dimension:
In [14]: array = xray.DataArray([5, 6, 7, 8], dims='x') In [15]: array.shift(x=2) Out[15]: <xarray.DataArray (x: 4)> array([ nan, nan, 5., 6.]) Coordinates: * x (x) int64 0 1 2 3 In [16]: array.roll(x=2) Out[16]: <xarray.DataArray (x: 4)> array([7, 8, 5, 6]) Coordinates: * x (x) int64 2 3 0 1
Notice that shift moves data independently of coordinates, but roll moves both data and coordinates.
Assigning a pandas object directly as a Dataset variable is now permitted. Its index names correspond to the dims of the Dataset, and its data is aligned.
Passing a pandas.DataFrame or pandas.Panel to a Dataset constructor is now permitted.
New function broadcast() for explicitly broadcasting DataArray and Dataset objects against each other. For example:
In [17]: a = xray.DataArray([1, 2, 3], dims='x') In [18]: b = xray.DataArray([5, 6], dims='y') In [19]: a Out[19]: <xarray.DataArray (x: 3)> array([1, 2, 3]) Coordinates: * x (x) int64 0 1 2 In [20]: b Out[20]: <xarray.DataArray (y: 2)> array([5, 6]) Coordinates: * y (y) int64 0 1 In [21]: a2, b2 = xray.broadcast(a, b) In [22]: a2 Out[22]: <xarray.DataArray (x: 3, y: 2)> array([[1, 1], [2, 2], [3, 3]]) Coordinates: * x (x) int64 0 1 2 * y (y) int64 0 1 In [23]: b2 Out[23]: <xarray.DataArray (x: 3, y: 2)> array([[5, 6], [5, 6], [5, 6]]) Coordinates: * y (y) int64 0 1 * x (x) int64 0 1 2
Bug fixes¶
- Fixes for several issues found on DataArray objects with the same name as one of their coordinates (see Breaking changes for more details).
- DataArray.to_masked_array always returns masked array with mask being an array (not a scalar value) (GH684)
- Allows for (imperfect) repr of Coords when underlying index is PeriodIndex (GH645).
- Fixes for several issues found on DataArray objects with the same name as one of their coordinates (see Breaking changes for more details).
- Attempting to assign a Dataset or DataArray variable/attribute using attribute-style syntax (e.g., ds.foo = 42) now raises an error rather than silently failing (GH656, GH714).
- You can now pass pandas objects with non-numpy dtypes (e.g., categorical or datetime64 with a timezone) into xray without an error (GH716).
Acknowledgments¶
The following individuals contributed to this release:
- Antony Lee
- Fabien Maussion
- Joe Hamman
- Maximilian Roos
- Stephan Hoyer
- Takeshi Kanmae
- femtotrader
v0.6.1 (21 October 2015)¶
This release contains a number of bug and compatibility fixes, as well as enhancements to plotting, indexing and writing files to disk.
Note that the minimum required version of dask for use with xray is now version 0.6.
API Changes¶
- The handling of colormaps and discrete color lists for 2D plots in plot() was changed to provide more compatibility with matplotlib’s contour and contourf functions (GH538). Now discrete lists of colors should be specified using colors keyword, rather than cmap.
Enhancements¶
Faceted plotting through FacetGrid and the plot() method. See Faceting for more details and examples.
sel() and reindex() now support the tolerance argument for controlling nearest-neighbor selection (GH629):
In [24]: array = xray.DataArray([1, 2, 3], dims='x') In [25]: array.reindex(x=[0.9, 1.5], method='nearest', tolerance=0.2) Out[25]: <xray.DataArray (x: 2)> array([ 2., nan]) Coordinates: * x (x) float64 0.9 1.5
This feature requires pandas v0.17 or newer.
New encoding argument in to_netcdf() for writing netCDF files with compression, as described in the new documentation section on Writing encoded data.
Add real and imag attributes to Dataset and DataArray (GH553).
More informative error message with from_dataframe() if the frame has duplicate columns.
xray now uses deterministic names for dask arrays it creates or opens from disk. This allows xray users to take advantage of dask’s nascent support for caching intermediate computation results. See GH555 for an example.
Bug fixes¶
- Forwards compatibility with the latest pandas release (v0.17.0). We were using some internal pandas routines for datetime conversion, which unfortunately have now changed upstream (GH569).
- Aggregation functions now correctly skip NaN for data for complex128 dtype (GH554).
- Fixed indexing 0d arrays with unicode dtype (GH568).
- name() and Dataset keys must be a string or None to be written to netCDF (GH533).
- where() now uses dask instead of numpy if either the array or other is a dask array. Previously, if other was a numpy array the method was evaluated eagerly.
- Global attributes are now handled more consistently when loading remote datasets using engine='pydap' (GH574).
- It is now possible to assign to the .data attribute of DataArray objects.
- coordinates attribute is now kept in the encoding dictionary after decoding (GH610).
- Compatibility with numpy 1.10 (GH617).
Acknowledgments¶
The following individuals contributed to this release:
- Ryan Abernathey
- Pete Cable
- Clark Fitzgerald
- Joe Hamman
- Stephan Hoyer
- Scott Sinclair
v0.6.0 (21 August 2015)¶
This release includes numerous bug fixes and enhancements. Highlights include the introduction of a plotting module and the new Dataset and DataArray methods isel_points(), sel_points(), where() and diff(). There are no breaking changes from v0.5.2.
Enhancements¶
Plotting methods have been implemented on DataArray objects plot() through integration with matplotlib (GH185). For an introduction, see Plotting.
Variables in netCDF files with multiple missing values are now decoded as NaN after issuing a warning if open_dataset is called with mask_and_scale=True.
We clarified our rules for when the result from an xray operation is a copy vs. a view (see Copies vs. views for more details).
Dataset variables are now written to netCDF files in order of appearance when using the netcdf4 backend (GH479).
Added isel_points() and sel_points() to support pointwise indexing of Datasets and DataArrays (GH475).
In [26]: da = xray.DataArray(np.arange(56).reshape((7, 8)), ....: coords={'x': list('abcdefg'), ....: 'y': 10 * np.arange(8)}, ....: dims=['x', 'y']) ....: In [27]: da Out[27]: <xray.DataArray (x: 7, y: 8)> array([[ 0, 1, 2, 3, 4, 5, 6, 7], [ 8, 9, 10, 11, 12, 13, 14, 15], [16, 17, 18, 19, 20, 21, 22, 23], [24, 25, 26, 27, 28, 29, 30, 31], [32, 33, 34, 35, 36, 37, 38, 39], [40, 41, 42, 43, 44, 45, 46, 47], [48, 49, 50, 51, 52, 53, 54, 55]]) Coordinates: * y (y) int64 0 10 20 30 40 50 60 70 * x (x) |S1 'a' 'b' 'c' 'd' 'e' 'f' 'g' # we can index by position along each dimension In [28]: da.isel_points(x=[0, 1, 6], y=[0, 1, 0], dim='points') Out[28]: <xray.DataArray (points: 3)> array([ 0, 9, 48]) Coordinates: y (points) int64 0 10 0 x (points) |S1 'a' 'b' 'g' * points (points) int64 0 1 2 # or equivalently by label In [29]: da.sel_points(x=['a', 'b', 'g'], y=[0, 10, 0], dim='points') Out[29]: <xray.DataArray (points: 3)> array([ 0, 9, 48]) Coordinates: y (points) int64 0 10 0 x (points) |S1 'a' 'b' 'g' * points (points) int64 0 1 2
New where() method for masking xray objects according to some criteria. This works particularly well with multi-dimensional data:
In [30]: ds = xray.Dataset(coords={'x': range(100), 'y': range(100)}) In [31]: ds['distance'] = np.sqrt(ds.x ** 2 + ds.y ** 2) In [32]: ds.distance.where(ds.distance < 100).plot() Out[32]: <matplotlib.collections.QuadMesh at 0x7f23b4eac950>
Added new methods DataArray.diff and Dataset.diff for finite difference calculations along a given axis.
New to_masked_array() convenience method for returning a numpy.ma.MaskedArray.
In [33]: da = xray.DataArray(np.random.random_sample(size=(5, 4))) In [34]: da.where(da < 0.5) Out[34]: <xarray.DataArray (dim_0: 5, dim_1: 4)> array([[ 0.127, nan, 0.26 , nan], [ 0.377, 0.336, 0.451, nan], [ 0.123, nan, 0.373, 0.448], [ 0.129, nan, nan, 0.352], [ 0.229, nan, nan, 0.138]]) Coordinates: * dim_0 (dim_0) int64 0 1 2 3 4 * dim_1 (dim_1) int64 0 1 2 3 In [35]: da.where(da < 0.5).to_masked_array(copy=True) Out[35]: masked_array(data = [[0.12696983303810094 -- 0.26047600586578334 --] [0.37674971618967135 0.33622174433445307 0.45137647047539964 --] [0.12310214428849964 -- 0.37301222522143085 0.4479968246859435] [0.12944067971751294 -- -- 0.35205353914802473] [0.2288873043216132 -- -- 0.1375535565632705]], mask = [[False True False True] [False False False True] [False True False False] [False True True False] [False True True False]], fill_value = 1e+20)
Added new flag “drop_variables” to open_dataset() for excluding variables from being parsed. This may be useful to drop variables with problems or inconsistent values.
Bug fixes¶
- Fixed aggregation functions (e.g., sum and mean) on big-endian arrays when bottleneck is installed (GH489).
- Dataset aggregation functions dropped variables with unsigned integer dtype (GH505).
- .any() and .all() were not lazy when used on xray objects containing dask arrays.
- Fixed an error when attempting to saving datetime64 variables to netCDF files when the first element is NaT (GH528).
- Fix pickle on DataArray objects (GH515).
- Fixed unnecessary coercion of float64 to float32 when using netcdf3 and netcdf4_classic formats (GH526).
v0.5.2 (16 July 2015)¶
This release contains bug fixes, several additional options for opening and saving netCDF files, and a backwards incompatible rewrite of the advanced options for xray.concat.
Backwards incompatible changes¶
- The optional arguments concat_over and mode in concat() have been removed and replaced by data_vars and coords. The new arguments are both more easily understood and more robustly implemented, and allowed us to fix a bug where concat accidentally loaded data into memory. If you set values for these optional arguments manually, you will need to update your code. The default behavior should be unchanged.
Enhancements¶
open_mfdataset() now supports a preprocess argument for preprocessing datasets prior to concatenaton. This is useful if datasets cannot be otherwise merged automatically, e.g., if the original datasets have conflicting index coordinates (GH443).
open_dataset() and open_mfdataset() now use a global thread lock by default for reading from netCDF files with dask. This avoids possible segmentation faults for reading from netCDF4 files when HDF5 is not configured properly for concurrent access (GH444).
Added support for serializing arrays of complex numbers with engine=’h5netcdf’.
The new save_mfdataset() function allows for saving multiple datasets to disk simultaneously. This is useful when processing large datasets with dask.array. For example, to save a dataset too big to fit into memory to one file per year, we could write:
In [36]: years, datasets = zip(*ds.groupby('time.year')) In [37]: paths = ['%s.nc' % y for y in years] In [38]: xray.save_mfdataset(datasets, paths)
Bug fixes¶
- Fixed min, max, argmin and argmax for arrays with string or unicode types (GH453).
- open_dataset() and open_mfdataset() support supplying chunks as a single integer.
- Fixed a bug in serializing scalar datetime variable to netCDF.
- Fixed a bug that could occur in serialization of 0-dimensional integer arrays.
- Fixed a bug where concatenating DataArrays was not always lazy (GH464).
- When reading datasets with h5netcdf, bytes attributes are decoded to strings. This allows conventions decoding to work properly on Python 3 (GH451).
v0.5.1 (15 June 2015)¶
This minor release fixes a few bugs and an inconsistency with pandas. It also adds the pipe method, copied from pandas.
Enhancements¶
- Added pipe(), replicating the new pandas method in version 0.16.2. See Transforming datasets for more details.
- assign() and assign_coords() now assign new variables in sorted (alphabetical) order, mirroring the behavior in pandas. Previously, the order was arbitrary.
v0.5 (1 June 2015)¶
Highlights¶
The headline feature in this release is experimental support for out-of-core computing (data that doesn’t fit into memory) with dask. This includes a new top-level function open_mfdataset() that makes it easy to open a collection of netCDF (using dask) as a single xray.Dataset object. For more on dask, read the blog post introducing xray + dask and the new documentation section Out of core computation with dask.
Dask makes it possible to harness parallelism and manipulate gigantic datasets with xray. It is currently an optional dependency, but it may become required in the future.
Backwards incompatible changes¶
The logic used for choosing which variables are concatenated with concat() has changed. Previously, by default any variables which were equal across a dimension were not concatenated. This lead to some surprising behavior, where the behavior of groupby and concat operations could depend on runtime values (GH268). For example:
In [39]: ds = xray.Dataset({'x': 0}) In [40]: xray.concat([ds, ds], dim='y') Out[40]: <xray.Dataset> Dimensions: () Coordinates: *empty* Data variables: x int64 0
Now, the default always concatenates data variables:
In [41]: xray.concat([ds, ds], dim='y') Out[41]: <xarray.Dataset> Dimensions: (y: 2) Coordinates: * y (y) int64 0 1 Data variables: x (y) int64 0 0
To obtain the old behavior, supply the argument concat_over=[].
Enhancements¶
New to_array() and enhanced to_dataset() methods make it easy to switch back and forth between arrays and datasets:
In [42]: ds = xray.Dataset({'a': 1, 'b': ('x', [1, 2, 3])}, ....: coords={'c': 42}, attrs={'Conventions': 'None'}) ....: In [43]: ds.to_array() Out[43]: <xarray.DataArray (variable: 2, x: 3)> array([[1, 1, 1], [1, 2, 3]]) Coordinates: * variable (variable) |S1 'a' 'b' * x (x) int64 0 1 2 c int64 42 Attributes: Conventions: None In [44]: ds.to_array().to_dataset(dim='variable') Out[44]: <xarray.Dataset> Dimensions: (x: 3) Coordinates: * x (x) int64 0 1 2 c int64 42 Data variables: a (x) int64 1 1 1 b (x) int64 1 2 3 Attributes: Conventions: None
New fillna() method to fill missing values, modeled off the pandas method of the same name:
In [45]: array = xray.DataArray([np.nan, 1, np.nan, 3], dims='x') In [46]: array.fillna(0) Out[46]: <xarray.DataArray (x: 4)> array([ 0., 1., 0., 3.]) Coordinates: * x (x) int64 0 1 2 3
fillna works on both Dataset and DataArray objects, and uses index based alignment and broadcasting like standard binary operations. It also can be applied by group, as illustrated in Fill missing values with climatology.
New assign() and assign_coords() methods patterned off the new DataFrame.assign method in pandas:
In [47]: ds = xray.Dataset({'y': ('x', [1, 2, 3])}) In [48]: ds.assign(z = lambda ds: ds.y ** 2) Out[48]: <xarray.Dataset> Dimensions: (x: 3) Coordinates: * x (x) int64 0 1 2 Data variables: y (x) int64 1 2 3 z (x) int64 1 4 9 In [49]: ds.assign_coords(z = ('x', ['a', 'b', 'c'])) Out[49]: <xarray.Dataset> Dimensions: (x: 3) Coordinates: * x (x) int64 0 1 2 z (x) |S1 'a' 'b' 'c' Data variables: y (x) int64 1 2 3
These methods return a new Dataset (or DataArray) with updated data or coordinate variables.
sel() now supports the method parameter, which works like the paramter of the same name on reindex(). It provides a simple interface for doing nearest-neighbor interpolation:
In [50]: ds.sel(x=1.1, method='nearest') Out[50]: <xray.Dataset> Dimensions: () Coordinates: x int64 1 Data variables: y int64 2 In [51]: ds.sel(x=[1.1, 2.1], method='pad') Out[51]: <xray.Dataset> Dimensions: (x: 2) Coordinates: * x (x) int64 1 2 Data variables: y (x) int64 2 3
See Nearest neighbor lookups for more details.
You can now control the underlying backend used for accessing remote datasets (via OPeNDAP) by specifying engine='netcdf4' or engine='pydap'.
xray now provides experimental support for reading and writing netCDF4 files directly via h5py with the h5netcdf package, avoiding the netCDF4-Python package. You will need to install h5netcdf and specify engine='h5netcdf' to try this feature.
Accessing data from remote datasets now has retrying logic (with exponential backoff) that should make it robust to occasional bad responses from DAP servers.
You can control the width of the Dataset repr with xray.set_options. It can be used either as a context manager, in which case the default is restored outside the context:
In [52]: ds = xray.Dataset({'x': np.arange(1000)}) In [53]: with xray.set_options(display_width=40): ....: print(ds) ....: <xarray.Dataset> Dimensions: (x: 1000) Coordinates: * x (x) int64 0 1 2 3 4 5 6 ... Data variables: *empty*
Or to set a global option:
In [54]: xray.set_options(display_width=80)
The default value for the display_width option is 80.
Deprecations¶
- The method load_data() has been renamed to the more succinct load().
v0.4.1 (18 March 2015)¶
The release contains bug fixes and several new features. All changes should be fully backwards compatible.
Enhancements¶
New documentation sections on Time series data and Combining multiple files.
resample() lets you resample a dataset or data array to a new temporal resolution. The syntax is the same as pandas, except you need to supply the time dimension explicitly:
In [55]: time = pd.date_range('2000-01-01', freq='6H', periods=10) In [56]: array = xray.DataArray(np.arange(10), [('time', time)]) In [57]: array.resample('1D', dim='time') Out[57]: <xarray.DataArray (time: 3)> array([ 1.5, 5.5, 8.5]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
You can specify how to do the resampling with the how argument and other options such as closed and label let you control labeling:
In [58]: array.resample('1D', dim='time', how='sum', label='right') Out[58]: <xarray.DataArray (time: 3)> array([ 6, 22, 17]) Coordinates: * time (time) datetime64[ns] 2000-01-02 2000-01-03 2000-01-04
If the desired temporal resolution is higher than the original data (upsampling), xray will insert missing values:
In [59]: array.resample('3H', 'time') Out[59]: <xarray.DataArray (time: 19)> array([ 0., nan, 1., ..., 8., nan, 9.]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-01T03:00:00 ...
first and last methods on groupby objects let you take the first or last examples from each group along the grouped axis:
In [60]: array.groupby('time.day').first() Out[60]: <xarray.DataArray (day: 3)> array([0, 4, 8]) Coordinates: * day (day) int64 1 2 3
These methods combine well with resample:
In [61]: array.resample('1D', dim='time', how='first') Out[61]: <xarray.DataArray (time: 3)> array([0, 4, 8]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
swap_dims() allows for easily swapping one dimension out for another:
In [62]: ds = xray.Dataset({'x': range(3), 'y': ('x', list('abc'))}) In [63]: ds Out[63]: <xarray.Dataset> Dimensions: (x: 3) Coordinates: * x (x) int64 0 1 2 Data variables: y (x) |S1 'a' 'b' 'c' In [64]: ds.swap_dims({'x': 'y'}) Out[64]: <xarray.Dataset> Dimensions: (y: 3) Coordinates: * y (y) |S1 'a' 'b' 'c' x (y) int64 0 1 2 Data variables: *empty*
This was possible in earlier versions of xray, but required some contortions.
open_dataset() and to_netcdf() now accept an engine argument to explicitly select which underlying library (netcdf4 or scipy) is used for reading/writing a netCDF file.
Bug fixes¶
- Fixed a bug where data netCDF variables read from disk with engine='scipy' could still be associated with the file on disk, even after closing the file (GH341). This manifested itself in warnings about mmapped arrays and segmentation faults (if the data was accessed).
- Silenced spurious warnings about all-NaN slices when using nan-aware aggregation methods (GH344).
- Dataset aggregations with keep_attrs=True now preserve attributes on data variables, not just the dataset itself.
- Tests for xray now pass when run on Windows (GH360).
- Fixed a regression in v0.4 where saving to netCDF could fail with the error ValueError: could not automatically determine time units.
v0.4 (2 March, 2015)¶
This is one of the biggest releases yet for xray: it includes some major changes that may break existing code, along with the usual collection of minor enhancements and bug fixes. On the plus side, this release includes all hitherto planned breaking changes, so the upgrade path for xray should be smoother going forward.
Breaking changes¶
We now automatically align index labels in arithmetic, dataset construction, merging and updating. This means the need for manually invoking methods like align() and reindex_like() should be vastly reduced.
For arithmetic, we align based on the intersection of labels:
In [65]: lhs = xray.DataArray([1, 2, 3], [('x', [0, 1, 2])]) In [66]: rhs = xray.DataArray([2, 3, 4], [('x', [1, 2, 3])]) In [67]: lhs + rhs Out[67]: <xarray.DataArray (x: 2)> array([4, 6]) Coordinates: * x (x) int64 1 2
For dataset construction and merging, we align based on the union of labels:
In [68]: xray.Dataset({'foo': lhs, 'bar': rhs}) Out[68]: <xarray.Dataset> Dimensions: (x: 4) Coordinates: * x (x) int64 0 1 2 3 Data variables: foo (x) float64 1.0 2.0 3.0 nan bar (x) float64 nan 2.0 3.0 4.0
For update and __setitem__, we align based on the original object:
In [69]: lhs.coords['rhs'] = rhs In [70]: lhs Out[70]: <xarray.DataArray (x: 3)> array([1, 2, 3]) Coordinates: * x (x) int64 0 1 2 rhs (x) float64 nan 2.0 3.0
Aggregations like mean or median now skip missing values by default:
In [71]: xray.DataArray([1, 2, np.nan, 3]).mean() Out[71]: <xarray.DataArray ()> array(2.0)
You can turn this behavior off by supplying the keyword arugment skipna=False.
These operations are lightning fast thanks to integration with bottleneck, which is a new optional dependency for xray (numpy is used if bottleneck is not installed).
Scalar coordinates no longer conflict with constant arrays with the same value (e.g., in arithmetic, merging datasets and concat), even if they have different shape (GH243). For example, the coordinate c here persists through arithmetic, even though it has different shapes on each DataArray:
In [72]: a = xray.DataArray([1, 2], coords={'c': 0}, dims='x') In [73]: b = xray.DataArray([1, 2], coords={'c': ('x', [0, 0])}, dims='x') In [74]: (a + b).coords Out[74]: Coordinates: c (x) int64 0 0 * x (x) int64 0 1
This functionality can be controlled through the compat option, which has also been added to the Dataset constructor.
Datetime shortcuts such as 'time.month' now return a DataArray with the name 'month', not 'time.month' (GH345). This makes it easier to index the resulting arrays when they are used with groupby:
In [75]: time = xray.DataArray(pd.date_range('2000-01-01', periods=365), ....: dims='time', name='time') ....: In [76]: counts = time.groupby('time.month').count() In [77]: counts.sel(month=2) Out[77]: <xarray.DataArray 'time' ()> array(29) Coordinates: month int64 2
Previously, you would need to use something like counts.sel(**{'time.month': 2}}), which is much more awkward.
The season datetime shortcut now returns an array of string labels such ‘DJF’:
In [78]: ds = xray.Dataset({'t': pd.date_range('2000-01-01', periods=12, freq='M')}) In [79]: ds['t.season'] Out[79]: <xarray.DataArray 'season' (t: 12)> array(['DJF', 'DJF', 'MAM', ..., 'SON', 'SON', 'DJF'], dtype='|S3') Coordinates: * t (t) datetime64[ns] 2000-01-31 2000-02-29 2000-03-31 2000-04-30 ...
Previously, it returned numbered seasons 1 through 4.
We have updated our use of the terms of “coordinates” and “variables”. What were known in previous versions of xray as “coordinates” and “variables” are now referred to throughout the documentation as “coordinate variables” and “data variables”. This brings xray in closer alignment to CF Conventions. The only visible change besides the documentation is that Dataset.vars has been renamed Dataset.data_vars.
You will need to update your code if you have been ignoring deprecation warnings: methods and attributes that were deprecated in xray v0.3 or earlier (e.g., dimensions, attributes`) have gone away.
Enhancements¶
Support for reindex() with a fill method. This provides a useful shortcut for upsampling:
In [80]: data = xray.DataArray([1, 2, 3], dims='x') In [81]: data.reindex(x=[0.5, 1, 1.5, 2, 2.5], method='pad') Out[81]: <xarray.DataArray (x: 5)> array([1, 2, 2, 3, 3]) Coordinates: * x (x) float64 0.5 1.0 1.5 2.0 2.5
This will be especially useful once pandas 0.16 is released, at which point xray will immediately support reindexing with method=’nearest’.
Use functions that return generic ndarrays with DataArray.groupby.apply and Dataset.apply (GH327 and GH329). Thanks Jeff Gerard!
Consolidated the functionality of dumps (writing a dataset to a netCDF3 bytestring) into to_netcdf() (GH333).
to_netcdf() now supports writing to groups in netCDF4 files (GH333). It also finally has a full docstring – you should read it!
open_dataset() and to_netcdf() now work on netCDF3 files when netcdf4-python is not installed as long as scipy is available (GH333).
The new Dataset.drop and DataArray.drop methods makes it easy to drop explicitly listed variables or index labels:
# drop variables In [82]: ds = xray.Dataset({'x': 0, 'y': 1}) In [83]: ds.drop('x') Out[83]: <xarray.Dataset> Dimensions: () Coordinates: *empty* Data variables: y int64 1 # drop index labels In [84]: arr = xray.DataArray([1, 2, 3], coords=[('x', list('abc'))]) In [85]: arr.drop(['a', 'c'], dim='x') Out[85]: <xarray.DataArray (x: 1)> array([2]) Coordinates: * x (x) |S1 'b'
broadcast_equals() has been added to correspond to the new compat option.
Long attributes are now truncated at 500 characters when printing a dataset (GH338). This should make things more convenient for working with datasets interactively.
Added a new documentation example, Calculating Seasonal Averages from Timeseries of Monthly Means. Thanks Joe Hamman!
Bug fixes¶
- Several bug fixes related to decoding time units from netCDF files (GH316, GH330). Thanks Stefan Pfenninger!
- xray no longer requires decode_coords=False when reading datasets with unparseable coordinate attributes (GH308).
- Fixed DataArray.loc indexing with ... (GH318).
- Fixed an edge case that resulting in an error when reindexing multi-dimensional variables (GH315).
- Slicing with negative step sizes (GH312).
- Invalid conversion of string arrays to numeric dtype (GH305).
- Fixed``repr()`` on dataset objects with non-standard dates (GH347).
Deprecations¶
- dump and dumps have been deprecated in favor of to_netcdf().
- drop_vars has been deprecated in favor of drop().
Future plans¶
The biggest feature I’m excited about working toward in the immediate future is supporting out-of-core operations in xray using Dask, a part of the Blaze project. For a preview of using Dask with weather data, read this blog post by Matthew Rocklin. See GH328 for more details.
v0.3.2 (23 December, 2014)¶
This release focused on bug-fixes, speedups and resolving some niggling inconsistencies.
There are a few cases where the behavior of xray differs from the previous version. However, I expect that in almost all cases your code will continue to run unmodified.
Warning
xray now requires pandas v0.15.0 or later. This was necessary for supporting TimedeltaIndex without too many painful hacks.
Backwards incompatible changes¶
Arrays of datetime.datetime objects are now automatically cast to datetime64[ns] arrays when stored in an xray object, using machinery borrowed from pandas:
In [86]: from datetime import datetime In [87]: xray.Dataset({'t': [datetime(2000, 1, 1)]}) Out[87]: <xarray.Dataset> Dimensions: (t: 1) Coordinates: * t (t) datetime64[ns] 2000-01-01 Data variables: *empty*
xray now has support (including serialization to netCDF) for TimedeltaIndex. datetime.timedelta objects are thus accordingly cast to timedelta64[ns] objects when appropriate.
Masked arrays are now properly coerced to use NaN as a sentinel value (GH259).
Enhancements¶
Due to popular demand, we have added experimental attribute style access as a shortcut for dataset variables, coordinates and attributes:
In [88]: ds = xray.Dataset({'tmin': ([], 25, {'units': 'celcius'})}) In [89]: ds.tmin.units Out[89]: 'celcius'
Tab-completion for these variables should work in editors such as IPython. However, setting variables or attributes in this fashion is not yet supported because there are some unresolved ambiguities (GH300).
You can now use a dictionary for indexing with labeled dimensions. This provides a safe way to do assignment with labeled dimensions:
In [90]: array = xray.DataArray(np.zeros(5), dims=['x']) In [91]: array[dict(x=slice(3))] = 1 In [92]: array Out[92]: <xarray.DataArray (x: 5)> array([ 1., 1., 1., 0., 0.]) Coordinates: * x (x) int64 0 1 2 3 4
Non-index coordinates can now be faithfully written to and restored from netCDF files. This is done according to CF conventions when possible by using the coordinates attribute on a data variable. When not possible, xray defines a global coordinates attribute.
Preliminary support for converting xray.DataArray objects to and from CDAT cdms2 variables.
We sped up any operation that involves creating a new Dataset or DataArray (e.g., indexing, aggregation, arithmetic) by a factor of 30 to 50%. The full speed up requires cyordereddict to be installed.
Bug fixes¶
Future plans¶
- I am contemplating switching to the terms “coordinate variables” and “data variables” instead of the (currently used) “coordinates” and “variables”, following their use in CF Conventions (GH293). This would mostly have implications for the documentation, but I would also change the Dataset attribute vars to data.
- I no longer certain that automatic label alignment for arithmetic would be a good idea for xray – it is a feature from pandas that I have not missed (GH186).
- The main API breakage that I do anticipate in the next release is finally making all aggregation operations skip missing values by default (GH130). I’m pretty sick of writing ds.reduce(np.nanmean, 'time').
- The next version of xray (0.4) will remove deprecated features and aliases whose use currently raises a warning.
If you have opinions about any of these anticipated changes, I would love to hear them – please add a note to any of the referenced GitHub issues.
v0.3.1 (22 October, 2014)¶
This is mostly a bug-fix release to make xray compatible with the latest release of pandas (v0.15).
We added several features to better support working with missing values and exporting xray objects to pandas. We also reorganized the internal API for serializing and deserializing datasets, but this change should be almost entirely transparent to users.
Other than breaking the experimental DataStore API, there should be no backwards incompatible changes.
New features¶
- Added count() and dropna() methods, copied from pandas, for working with missing values (GH247, GH58).
- Added DataArray.to_pandas for converting a data array into the pandas object with the same dimensionality (1D to Series, 2D to DataFrame, etc.) (GH255).
- Support for reading gzipped netCDF3 files (GH239).
- Reduced memory usage when writing netCDF files (GH251).
- ‘missing_value’ is now supported as an alias for the ‘_FillValue’ attribute on netCDF variables (GH245).
- Trivial indexes, equivalent to range(n) where n is the length of the dimension, are no longer written to disk (GH245).
Bug fixes¶
- Compatibility fixes for pandas v0.15 (GH262).
- Fixes for display and indexing of NaT (not-a-time) (GH238, GH240)
- Fix slicing by label was an argument is a data array (GH250).
- Test data is now shipped with the source distribution (GH253).
- Ensure order does not matter when doing arithmetic with scalar data arrays (GH254).
- Order of dimensions preserved with DataArray.to_dataframe (GH260).
v0.3 (21 September 2014)¶
New features¶
- Revamped coordinates: “coordinates” now refer to all arrays that are not used to index a dimension. Coordinates are intended to allow for keeping track of arrays of metadata that describe the grid on which the points in “variable” arrays lie. They are preserved (when unambiguous) even though mathematical operations.
- Dataset math Dataset objects now support all arithmetic operations directly. Dataset-array operations map across all dataset variables; dataset-dataset operations act on each pair of variables with the same name.
- GroupBy math: This provides a convenient shortcut for normalizing by the average value of a group.
- The dataset __repr__ method has been entirely overhauled; dataset objects now show their values when printed.
- You can now index a dataset with a list of variables to return a new dataset: ds[['foo', 'bar']].
Backwards incompatible changes¶
- Dataset.__eq__ and Dataset.__ne__ are now element-wise operations instead of comparing all values to obtain a single boolean. Use the method equals() instead.
Deprecations¶
- Dataset.noncoords is deprecated: use Dataset.vars instead.
- Dataset.select_vars deprecated: index a Dataset with a list of variable names instead.
- DataArray.select_vars and DataArray.drop_vars deprecated: use reset_coords() instead.
v0.2 (14 August 2014)¶
This is major release that includes some new features and quite a few bug fixes. Here are the highlights:
- There is now a direct constructor for DataArray objects, which makes it possible to create a DataArray without using a Dataset. This is highlighted in the refreshed tutorial.
- You can perform aggregation operations like mean directly on Dataset objects, thanks to Joe Hamman. These aggregation methods also worked on grouped datasets.
- xray now works on Python 2.6, thanks to Anna Kuznetsova.
- A number of methods and attributes were given more sensible (usually shorter) names: labeled -> sel, indexed -> isel, select -> select_vars, unselect -> drop_vars, dimensions -> dims, coordinates -> coords, attributes -> attrs.
- New load_data() and close() methods for datasets facilitate lower level of control of data loaded from disk.
v0.1.1 (20 May 2014)¶
xray 0.1.1 is a bug-fix release that includes changes that should be almost entirely backwards compatible with v0.1:
- Python 3 support (GH53)
- Required numpy version relaxed to 1.7 (GH129)
- Return numpy.datetime64 arrays for non-standard calendars (GH126)
- Support for opening datasets associated with NetCDF4 groups (GH127)
- Bug-fixes for concatenating datetime arrays (GH134)
Special thanks to new contributors Thomas Kluyver, Joe Hamman and Alistair Miles.
v0.1 (2 May 2014)¶
Initial release.
Overview: Why xarray?¶
Features¶
Adding dimensions names and coordinate indexes to numpy’s ndarray makes many powerful array operations possible:
- Apply operations over dimensions by name: x.sum('time').
- Select values by label instead of integer location: x.loc['2014-01-01'] or x.sel(time='2014-01-01').
- Mathematical operations (e.g., x - y) vectorize across multiple dimensions (array broadcasting) based on dimension names, not shape.
- Flexible split-apply-combine operations with groupby: x.groupby('time.dayofyear').mean().
- Database like alignment based on coordinate labels that smoothly handles missing values: x, y = xr.align(x, y, join='outer').
- Keep track of arbitrary metadata in the form of a Python dictionary: x.attrs.
pandas provides many of these features, but it does not make use of dimension names, and its core data structures are fixed dimensional arrays.
The N-dimensional nature of xarray’s data structures makes it suitable for dealing with multi-dimensional scientific data, and its use of dimension names instead of axis labels (dim='time' instead of axis=0) makes such arrays much more manageable than the raw numpy ndarray: with xarray, you don’t need to keep track of the order of arrays dimensions or insert dummy dimensions (e.g., np.newaxis) to align arrays.
Core data structures¶
xarray has two core data structures. Both are fundamentally N-dimensional:
- DataArray is our implementation of a labeled, N-dimensional array. It is an N-D generalization of a pandas.Series. The name DataArray itself is borrowed from Fernando Perez’s datarray project, which prototyped a similar data structure.
- Dataset is a multi-dimensional, in-memory array database. It is a dict-like container of DataArray objects aligned along any number of shared dimensions, and serves a similar purpose in xarray to the pandas.DataFrame.
The value of attaching labels to numpy’s numpy.ndarray may be fairly obvious, but the dataset may need more motivation.
The power of the dataset over a plain dictionary is that, in addition to pulling out arrays by name, it is possible to select or combine data along a dimension across all arrays simultaneously. Like a DataFrame, datasets facilitate array operations with heterogeneous data – the difference is that the arrays in a dataset can not only have different data types, but can also have different numbers of dimensions.
This data model is borrowed from the netCDF file format, which also provides xarray with a natural and portable serialization format. NetCDF is very popular in the geosciences, and there are existing libraries for reading and writing netCDF in many programming languages, including Python.
xarray distinguishes itself from many tools for working with netCDF data in-so-far as it provides data structures for in-memory analytics that both utilize and preserve labels. You only need to do the tedious work of adding metadata once, not every time you save a file.
Goals and aspirations¶
pandas excels at working with tabular data. That suffices for many statistical analyses, but physical scientists rely on N-dimensional arrays – which is where xarray comes in.
xarray aims to provide a data analysis toolkit as powerful as pandas but designed for working with homogeneous N-dimensional arrays instead of tabular data. When possible, we copy the pandas API and rely on pandas’s highly optimized internals (in particular, for fast indexing).
Importantly, xarray has robust support for converting its objects to and from a numpy ndarray or a pandas DataFrame or Series, providing compatibility with the full PyData ecosystem.
Our target audience is anyone who needs N-dimensional labeled arrays, but we are particularly focused on the data analysis needs of physical scientists – especially geoscientists who already know and love netCDF.
Frequently Asked Questions¶
Why is pandas not enough?¶
pandas, thanks to its unrivaled speed and flexibility, has emerged as the premier python package for working with labeled arrays. So why are we contributing to further fragmentation in the ecosystem for working with data arrays in Python?
Sometimes, we really want to work with collections of higher dimensional arrays (ndim > 2), or arrays for which the order of dimensions (e.g., columns vs rows) shouldn’t really matter. For example, climate and weather data is often natively expressed in 4 or more dimensions: time, x, y and z.
Pandas does support N-dimensional panels, but the implementation is very limited:
- You need to create a new factory type for each dimensionality.
- You can’t do math between NDPanels with different dimensionality.
- Each dimension in a NDPanel has a name (e.g., ‘labels’, ‘items’, ‘major_axis’, etc.) but the dimension names refer to order, not their meaning. You can’t specify an operation as to be applied along the “time” axis.
Fundamentally, the N-dimensional panel is limited by its context in pandas’s tabular model, which treats a 2D DataFrame as a collections of 1D Series, a 3D Panel as a collection of 2D DataFrame, and so on. In my experience, it usually easier to work with a DataFrame with a hierarchical index rather than to use higher dimensional (N > 3) data structures in pandas.
Another use case is handling collections of arrays with different numbers of dimensions. For example, suppose you have a 2D array and a handful of associated 1D arrays that share one of the same axes. Storing these in one pandas object is possible but awkward – you can either upcast all the 1D arrays to 2D and store everything in a Panel, or put everything in a DataFrame, where the first few columns have a different meaning than the other columns. In contrast, this sort of data structure fits very naturally in an xarray Dataset.
Pandas gets a lot of things right, but scientific users need fully multi- dimensional data structures.
How do xarray data structures differ from those found in pandas?¶
The main distinguishing feature of xarray’s DataArray over labeled arrays in pandas is that dimensions can have names (e.g., “time”, “latitude”, “longitude”). Names are much easier to keep track of than axis numbers, and xarray uses dimension names for indexing, aggregation and broadcasting. Not only can you write x.sel(time='2000-01-01') and x.mean(dim='time'), but operations like x - x.mean(dim='time') always work, no matter the order of the “time” dimension. You never need to reshape arrays (e.g., with np.newaxis) to align them for arithmetic operations in xarray.
Should I use xarray instead of pandas?¶
It’s not an either/or choice! xarray provides robust support for converting back and forth between the tabular data-structures of pandas and its own multi-dimensional data-structures.
That said, you should only bother with xarray if some aspect of data is fundamentally multi-dimensional. If your data is unstructured or one-dimensional, stick with pandas, which is a more developed toolkit for doing data analysis in Python.
What is your approach to metadata?¶
We are firm believers in the power of labeled data! In addition to dimensions and coordinates, xarray supports arbitrary metadata in the form of global (Dataset) and variable specific (DataArray) attributes (attrs).
Automatic interpretation of labels is powerful but also reduces flexibility. With xarray, we draw a firm line between labels that the library understands (dims and coords) and labels for users and user code (attrs). For example, we do not automatically interpret and enforce units or CF conventions. (An exception is serialization to and from netCDF files.)
An implication of this choice is that we do not propagate attrs through most operations unless explicitly flagged (some methods have a keep_attrs option). Similarly, xarray does not check for conflicts between attrs when combining arrays and datasets, unless explicitly requested with the option compat='identical'. The guiding principle is that metadata should not be allowed to get in the way.
Examples¶
Quick overview¶
Here are some quick examples of what you can do with xarray.DataArray objects. Everything is explained in much more detail in the rest of the documentation.
To begin, import numpy, pandas and xarray using their customary abbreviations:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: import xarray as xr
Create a DataArray¶
You can make a DataArray from scratch by supplying data in the form of a numpy array or list, with optional dimensions and coordinates:
In [4]: xr.DataArray(np.random.randn(2, 3))
Out[4]:
<xarray.DataArray (dim_0: 2, dim_1: 3)>
array([[-1.344, 0.845, 1.076],
[-0.109, 1.644, -1.469]])
Coordinates:
* dim_0 (dim_0) int64 0 1
* dim_1 (dim_1) int64 0 1 2
In [5]: data = xr.DataArray(np.random.randn(2, 3), [('x', ['a', 'b']), ('y', [-2, 0, 2])])
In [6]: data
Out[6]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.357, -0.675, -1.777],
[-0.969, -1.295, 0.414]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
If you supply a pandas Series or DataFrame, metadata is copied directly:
In [7]: xr.DataArray(pd.Series(range(3), index=list('abc'), name='foo'))
Out[7]:
<xarray.DataArray 'foo' (dim_0: 3)>
array([0, 1, 2])
Coordinates:
* dim_0 (dim_0) object 'a' 'b' 'c'
Here are the key properties for a DataArray:
# like in pandas, values is a numpy array that you can modify in-place
In [8]: data.values
Out[8]:
array([[ 0.357, -0.675, -1.777],
[-0.969, -1.295, 0.414]])
In [9]: data.dims
Out[9]: ('x', 'y')
In [10]: data.coords
Out[10]:
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
# you can use this dictionary to store arbitrary metadata
In [11]: data.attrs
Out[11]: OrderedDict()
Indexing¶
xarray supports four kind of indexing. These operations are just as fast as in pandas, because we borrow pandas’ indexing machinery.
# positional and by integer label, like numpy
In [12]: data[[0, 1]]
Out[12]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.357, -0.675, -1.777],
[-0.969, -1.295, 0.414]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
# positional and by coordinate label, like pandas
In [13]: data.loc['a':'b']
Out[13]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.357, -0.675, -1.777],
[-0.969, -1.295, 0.414]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
# by dimension name and integer label
In [14]: data.isel(x=slice(2))
Out[14]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.357, -0.675, -1.777],
[-0.969, -1.295, 0.414]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
# by dimension name and coordinate label
In [15]: data.sel(x=['a', 'b'])
Out[15]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.357, -0.675, -1.777],
[-0.969, -1.295, 0.414]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
Computation¶
Data arrays work very similarly to numpy ndarrays:
In [16]: data + 10
Out[16]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 10.357, 9.325, 8.223],
[ 9.031, 8.705, 10.414]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
In [17]: np.sin(data)
Out[17]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.349, -0.625, -0.979],
[-0.824, -0.962, 0.402]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
In [18]: data.T
Out[18]:
<xarray.DataArray (y: 3, x: 2)>
array([[ 0.357, -0.969],
[-0.675, -1.295],
[-1.777, 0.414]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
In [19]: data.sum()
Out[19]:
<xarray.DataArray ()>
array(-3.9441825539138033)
However, aggregation operations can use dimension names instead of axis numbers:
In [20]: data.mean(dim='x')
Out[20]:
<xarray.DataArray (y: 3)>
array([-0.306, -0.985, -0.682])
Coordinates:
* y (y) int64 -2 0 2
Arithmetic operations broadcast based on dimension name. This means you don’t need to insert dummy dimensions for alignment:
In [21]: a = xr.DataArray(np.random.randn(3), [data.coords['y']])
In [22]: b = xr.DataArray(np.random.randn(4), dims='z')
In [23]: a
Out[23]:
<xarray.DataArray (y: 3)>
array([ 0.277, -0.472, -0.014])
Coordinates:
* y (y) int64 -2 0 2
In [24]: b
Out[24]:
<xarray.DataArray (z: 4)>
array([-0.363, -0.006, -0.923, 0.896])
Coordinates:
* z (z) int64 0 1 2 3
In [25]: a + b
Out[25]:
<xarray.DataArray (y: 3, z: 4)>
array([[-0.086, 0.271, -0.646, 1.172],
[-0.835, -0.478, -1.395, 0.424],
[-0.377, -0.02 , -0.937, 0.882]])
Coordinates:
* y (y) int64 -2 0 2
* z (z) int64 0 1 2 3
It also means that in most cases you do not need to worry about the order of dimensions:
In [26]: data - data.T
Out[26]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0., 0., 0.],
[ 0., 0., 0.]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
Operations also align based on index labels:
In [27]: data[:-1] - data[:1]
Out[27]:
<xarray.DataArray (x: 1, y: 3)>
array([[ 0., 0., 0.]])
Coordinates:
* x (x) |S1 'a'
* y (y) int64 -2 0 2
GroupBy¶
xarray supports grouped operations using a very similar API to pandas:
In [28]: labels = xr.DataArray(['E', 'F', 'E'], [data.coords['y']], name='labels')
In [29]: labels
Out[29]:
<xarray.DataArray 'labels' (y: 3)>
array(['E', 'F', 'E'],
dtype='|S1')
Coordinates:
* y (y) int64 -2 0 2
In [30]: data.groupby(labels).mean('y')
Out[30]:
<xarray.DataArray (x: 2, labels: 2)>
array([[-0.71 , -0.675],
[-0.278, -1.295]])
Coordinates:
* x (x) |S1 'a' 'b'
* labels (labels) object 'E' 'F'
In [31]: data.groupby(labels).apply(lambda x: x - x.min())
Out[31]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 2.134, 0.62 , 0. ],
[ 0.808, 0. , 2.191]])
Coordinates:
* x (x) object 'a' 'b'
* y (y) int64 -2 0 2
labels (y) |S1 'E' 'F' 'E'
Convert to pandas¶
A key feature of xarray is robust conversion to and from pandas objects:
In [32]: data.to_series()
Out[32]:
x y
a -2 0.357021
0 -0.674600
2 -1.776904
b -2 -0.968914
0 -1.294524
2 0.413738
dtype: float64
In [33]: data.to_pandas()
Out[33]:
y -2 0 2
x
a 0.357021 -0.674600 -1.776904
b -0.968914 -1.294524 0.413738
Datasets and NetCDF¶
xarray.Dataset is a dict-like container of DataArray objects that share index labels and dimensions. It looks a lot like a netCDF file:
In [34]: ds = data.to_dataset(name='foo')
In [35]: ds
Out[35]:
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
Data variables:
foo (x, y) float64 0.357 -0.6746 -1.777 -0.9689 -1.295 0.4137
You can do almost everything you can do with DataArray objects with Dataset objects if you prefer to work with multiple variables at once.
Datasets also let you easily read and write netCDF files:
In [36]: ds.to_netcdf('example.nc')
In [37]: xr.open_dataset('example.nc')
Out[37]:
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* y (y) int32 -2 0 2
* x (x) |S1 'a' 'b'
Data variables:
foo (x, y) float64 0.357 -0.6746 -1.777 -0.9689 -1.295 0.4137
Toy weather data¶
Here is an example of how to easily manipulate a toy weather dataset using xarray and other recommended Python libraries:
Shared setup:
import xarray as xr
import numpy as np
import pandas as pd
import seaborn as sns # pandas aware plotting library
np.random.seed(123)
times = pd.date_range('2000-01-01', '2001-12-31', name='time')
annual_cycle = np.sin(2 * np.pi * (times.dayofyear / 365.25 - 0.28))
base = 10 + 15 * annual_cycle.reshape(-1, 1)
tmin_values = base + 3 * np.random.randn(annual_cycle.size, 3)
tmax_values = base + 10 + 3 * np.random.randn(annual_cycle.size, 3)
ds = xr.Dataset({'tmin': (('time', 'location'), tmin_values),
'tmax': (('time', 'location'), tmax_values)},
{'time': times, 'location': ['IA', 'IN', 'IL']})
Examine a dataset with pandas and seaborn¶
In [1]: ds
Out[1]:
<xarray.Dataset>
Dimensions: (location: 3, time: 731)
Coordinates:
* location (location) |S2 'IA' 'IN' 'IL'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
Data variables:
tmax (time, location) float64 12.98 3.31 6.779 0.4479 6.373 4.843 ...
tmin (time, location) float64 -8.037 -1.788 -3.932 -9.341 -6.558 ...
In [2]: df = ds.to_dataframe()
In [3]: df.head()
Out[3]:
tmax tmin
location time
IA 2000-01-01 12.980549 -8.037369
2000-01-02 0.447856 -9.341157
2000-01-03 5.322699 -12.139719
2000-01-04 1.889425 -7.492914
2000-01-05 0.791176 -0.447129
In [4]: df.describe()
Out[4]:
tmax tmin
count 2193.000000 2193.000000
mean 20.108232 9.975426
std 11.010569 10.963228
min -3.506234 -13.395763
25% 9.853905 -0.040347
50% 19.967409 10.060403
75% 30.045588 20.083590
max 43.271148 33.456060
In [5]: ds.mean(dim='location').to_dataframe().plot()
Out[5]: <matplotlib.axes._subplots.AxesSubplot at 0x7f23b42bcb10>

In [6]: sns.pairplot(df.reset_index(), vars=ds.data_vars)
Out[6]: <seaborn.axisgrid.PairGrid at 0x7f0fd2368a10>

Probability of freeze by calendar month¶
In [7]: freeze = (ds['tmin'] <= 0).groupby('time.month').mean('time')
In [8]: freeze
Out[8]:
<xarray.DataArray 'tmin' (month: 12, location: 3)>
array([[ 0.952, 0.887, 0.935],
[ 0.842, 0.719, 0.772],
[ 0.242, 0.129, 0.161],
...,
[ 0. , 0.016, 0. ],
[ 0.333, 0.35 , 0.233],
[ 0.935, 0.855, 0.823]])
Coordinates:
* location (location) |S2 'IA' 'IN' 'IL'
* month (month) int64 1 2 3 4 5 6 7 8 9 10 11 12
In [9]: freeze.to_pandas().plot()
Out[9]: <matplotlib.axes._subplots.AxesSubplot at 0x7f23aff96150>

Monthly averaging¶
In [10]: monthly_avg = ds.resample('1MS', dim='time', how='mean')
In [11]: monthly_avg.sel(location='IA').to_dataframe().plot(style='s-')
Out[11]: <matplotlib.axes._subplots.AxesSubplot at 0x7f23b40ef410>

Note that MS here refers to Month-Start; M labels Month-End (the last day of the month).
Calculate monthly anomalies¶
In climatology, “anomalies” refer to the difference between observations and typical weather for a particular season. Unlike observations, anomalies should not show any seasonal cycle.
In [12]: climatology = ds.groupby('time.month').mean('time')
In [13]: anomalies = ds.groupby('time.month') - climatology
In [14]: anomalies.mean('location').to_dataframe()[['tmin', 'tmax']].plot()
Out[14]: <matplotlib.axes._subplots.AxesSubplot at 0x7f23affd9f90>

Fill missing values with climatology¶
The fillna() method on grouped objects lets you easily fill missing values by group:
# throw away the first half of every month
In [15]: some_missing = ds.tmin.sel(time=ds['time.day'] > 15).reindex_like(ds)
In [16]: filled = some_missing.groupby('time.month').fillna(climatology.tmin)
In [17]: both = xr.Dataset({'some_missing': some_missing, 'filled': filled})
In [18]: both
Out[18]:
<xarray.Dataset>
Dimensions: (location: 3, time: 731)
Coordinates:
* location (location) object 'IA' 'IN' 'IL'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
month (time) int32 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
Data variables:
some_missing (time, location) float64 nan nan nan nan nan nan nan nan ...
filled (time, location) float64 -5.163 -4.216 -4.681 -5.163 ...
In [19]: df = both.sel(time='2000').mean('location').reset_coords(drop=True).to_dataframe()
In [20]: df[['filled', 'some_missing']].plot()
Out[20]: <matplotlib.axes._subplots.AxesSubplot at 0x7f23a3cd7d10>

Calculating Seasonal Averages from Timeseries of Monthly Means¶
Author: Joe Hamman
The data for this example can be found in the xray-data repository. This example is also available in an IPython Notebook that is available here.
Suppose we have a netCDF or xray Dataset of monthly mean data and we want to calculate the seasonal average. To do this properly, we need to calculate the weighted average considering that each month has a different number of days.
%matplotlib inline
import numpy as np
import pandas as pd
import xray
from netCDF4 import num2date
import matplotlib.pyplot as plt
print("numpy version : ", np.__version__)
print("pandas version : ", pd.version.version)
print("xray version : ", xray.version.version)
numpy version : 1.9.2
pandas version : 0.16.2
xray version : 0.5.1
Some calendar information so we can support any netCDF calendar.¶
dpm = {'noleap': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
'365_day': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
'standard': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
'gregorian': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
'proleptic_gregorian': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
'all_leap': [0, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
'366_day': [0, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
'360_day': [0, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30]}
A few calendar functions to determine the number of days in each month¶
If you were just using the standard calendar, it would be easy to use the calendar.month_range function.
def leap_year(year, calendar='standard'):
"""Determine if year is a leap year"""
leap = False
if ((calendar in ['standard', 'gregorian',
'proleptic_gregorian', 'julian']) and
(year % 4 == 0)):
leap = True
if ((calendar == 'proleptic_gregorian') and
(year % 100 == 0) and
(year % 400 != 0)):
leap = False
elif ((calendar in ['standard', 'gregorian']) and
(year % 100 == 0) and (year % 400 != 0) and
(year < 1583)):
leap = False
return leap
def get_dpm(time, calendar='standard'):
"""
return a array of days per month corresponding to the months provided in `months`
"""
month_length = np.zeros(len(time), dtype=np.int)
cal_days = dpm[calendar]
for i, (month, year) in enumerate(zip(time.month, time.year)):
month_length[i] = cal_days[month]
if leap_year(year, calendar=calendar):
month_length[i] += 1
return month_length
Open the Dataset¶
monthly_mean_file = 'RASM_example_data.nc'
ds = xray.open_dataset(monthly_mean_file, decode_coords=False)
print(ds)
<xray.Dataset>
Dimensions: (time: 36, x: 275, y: 205)
Coordinates:
* time (time) datetime64[ns] 1980-09-16T12:00:00 1980-10-17 ...
* x (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
* y (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
Tair (time, y, x) float64 nan nan nan nan nan nan nan nan nan nan ...
Attributes:
title: /workspace/jhamman/processed/R1002RBRxaaa01a/lnd/temp/R1002RBRxaaa01a.vic.ha.1979-09-01.nc
institution: U.W.
source: RACM R1002RBRxaaa01a
output_frequency: daily
output_mode: averaged
convention: CF-1.4
references: Based on the initial model of Liang et al., 1994, JGR, 99, 14,415- 14,429.
comment: Output from the Variable Infiltration Capacity (VIC) model.
nco_openmp_thread_number: 1
NCO: 4.3.7
history: history deleted for brevity
Now for the heavy lifting:¶
We first have to come up with the weights, - calculate the month lengths for each monthly data record - calculate weights using groupby('time.season')
Finally, we just need to multiply our weights by the Dataset and sum allong the time dimension.
# Make a DataArray with the number of days in each month, size = len(time)
month_length = xray.DataArray(get_dpm(ds.time.to_index(),
calendar='noleap'),
coords=[ds.time], name='month_length')
# Calculate the weights by grouping by 'time.season'.
# Conversion to float type ('astype(float)') only necessary for Python 2.x
weights = month_length.groupby('time.season') / month_length.astype(float).groupby('time.season').sum()
# Test that the sum of the weights for each season is 1.0
np.testing.assert_allclose(weights.groupby('time.season').sum().values, np.ones(4))
# Calculate the weighted average
ds_weighted = (ds * weights).groupby('time.season').sum(dim='time')
print(ds_weighted)
<xray.Dataset>
Dimensions: (season: 4, x: 275, y: 205)
Coordinates:
* y (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
* x (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
* season (season) object 'DJF' 'JJA' 'MAM' 'SON'
Data variables:
Tair (season, y, x) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
# only used for comparisons
ds_unweighted = ds.groupby('time.season').mean('time')
ds_diff = ds_weighted - ds_unweighted
# Quick plot to show the results
is_null = np.isnan(ds_unweighted['Tair'][0].values)
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(14,12))
for i, season in enumerate(('DJF', 'MAM', 'JJA', 'SON')):
plt.sca(axes[i, 0])
plt.pcolormesh(np.ma.masked_where(is_null, ds_weighted['Tair'].sel(season=season).values),
vmin=-30, vmax=30, cmap='Spectral_r')
plt.colorbar(extend='both')
plt.sca(axes[i, 1])
plt.pcolormesh(np.ma.masked_where(is_null, ds_unweighted['Tair'].sel(season=season).values),
vmin=-30, vmax=30, cmap='Spectral_r')
plt.colorbar(extend='both')
plt.sca(axes[i, 2])
plt.pcolormesh(np.ma.masked_where(is_null, ds_diff['Tair'].sel(season=season).values),
vmin=-0.1, vmax=.1, cmap='RdBu_r')
plt.colorbar(extend='both')
for j in range(3):
axes[i, j].axes.get_xaxis().set_ticklabels([])
axes[i, j].axes.get_yaxis().set_ticklabels([])
axes[i, j].axes.axis('tight')
axes[i, 0].set_ylabel(season)
axes[0, 0].set_title('Weighted by DPM')
axes[0, 1].set_title('Equal Weighting')
axes[0, 2].set_title('Difference')
plt.tight_layout()
fig.suptitle('Seasonal Surface Air Temperature', fontsize=16, y=1.02)

# Wrap it into a simple function
def season_mean(ds, calendar='standard'):
# Make a DataArray of season/year groups
year_season = xray.DataArray(ds.time.to_index().to_period(freq='Q-NOV').to_timestamp(how='E'),
coords=[ds.time], name='year_season')
# Make a DataArray with the number of days in each month, size = len(time)
month_length = xray.DataArray(get_dpm(ds.time.to_index(), calendar=calendar),
coords=[ds.time], name='month_length')
# Calculate the weights by grouping by 'time.season'
weights = month_length.groupby('time.season') / month_length.groupby('time.season').sum()
# Test that the sum of the weights for each season is 1.0
np.testing.assert_allclose(weights.groupby('time.season').sum().values, np.ones(4))
# Calculate the weighted average
return (ds * weights).groupby('time.season').sum(dim='time')
Working with Multidimensional Coordinates¶
Author: Ryan Abernathey
Many datasets have physical coordinates which differ from their logical coordinates. Xarray provides several ways to plot and analyze such datasets.
%matplotlib inline
import numpy as np
import pandas as pd
import xarray as xr
import cartopy.crs as ccrs
from matplotlib import pyplot as plt
print("numpy version : ", np.__version__)
print("pandas version : ", pd.__version__)
print("xarray version : ", xr.version.version)
('numpy version : ', '1.11.0')
('pandas version : ', u'0.18.0')
('xarray version : ', '0.7.2-32-gf957eb8')
As an example, consider this dataset from the xarray-data repository.
! curl -L -O https://github.com/pydata/xarray-data/raw/master/RASM_example_data.nc
ds = xr.open_dataset('RASM_example_data.nc')
ds
<xarray.Dataset>
Dimensions: (time: 36, x: 275, y: 205)
Coordinates:
* time (time) datetime64[ns] 1980-09-16T12:00:00 1980-10-17 ...
yc (y, x) float64 16.53 16.78 17.02 17.27 17.51 17.76 18.0 18.25 ...
xc (y, x) float64 189.2 189.4 189.6 189.7 189.9 190.1 190.2 190.4 ...
* x (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
* y (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
Tair (time, y, x) float64 nan nan nan nan nan nan nan nan nan nan ...
Attributes:
title: /workspace/jhamman/processed/R1002RBRxaaa01a/lnd/temp/R1002RBRxaaa01a.vic.ha.1979-09-01.nc
institution: U.W.
source: RACM R1002RBRxaaa01a
output_frequency: daily
output_mode: averaged
convention: CF-1.4
references: Based on the initial model of Liang et al., 1994, JGR, 99, 14,415- 14,429.
comment: Output from the Variable Infiltration Capacity (VIC) model.
nco_openmp_thread_number: 1
NCO: 4.3.7
history: history deleted for brevity
In this example, the logical coordinates are x and y, while the physical coordinates are xc and yc, which represent the latitudes and longitude of the data.
print(ds.xc.attrs)
print(ds.yc.attrs)
OrderedDict([(u'long_name', u'longitude of grid cell center'), (u'units', u'degrees_east'), (u'bounds', u'xv')])
OrderedDict([(u'long_name', u'latitude of grid cell center'), (u'units', u'degrees_north'), (u'bounds', u'yv')])
Plotting¶
Let’s examine these coordinate variables by plotting them.
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(14,4))
ds.xc.plot(ax=ax1)
ds.yc.plot(ax=ax2)
<matplotlib.collections.QuadMesh at 0x118688fd0>
/Users/rpa/anaconda/lib/python2.7/site-packages/matplotlib/collections.py:590: FutureWarning: elementwise comparison failed; returning scalar instead, but in the future will perform elementwise comparison
if self._edgecolors == str('face'):

Note that the variables xc (longitude) and yc (latitude) are two-dimensional scalar fields.
If we try to plot the data variable Tair, by default we get the logical coordinates.
ds.Tair[0].plot()
<matplotlib.collections.QuadMesh at 0x11b6da890>

In order to visualize the data on a conventional latitude-longitude grid, we can take advantage of xarray’s ability to apply cartopy map projections.
plt.figure(figsize=(14,6))
ax = plt.axes(projection=ccrs.PlateCarree())
ax.set_global()
ds.Tair[0].plot.pcolormesh(ax=ax, transform=ccrs.PlateCarree(), x='xc', y='yc', add_colorbar=False)
ax.coastlines()
ax.set_ylim([0,90]);

Multidimensional Groupby¶
The above example allowed us to visualize the data on a regular latitude-longitude grid. But what if we want to do a calculation that involves grouping over one of these physical coordinates (rather than the logical coordinates), for example, calculating the mean temperature at each latitude. This can be achieved using xarray’s groupby function, which accepts multidimensional variables. By default, groupby will use every unique value in the variable, which is probably not what we want. Instead, we can use the groupby_bins function to specify the output coordinates of the group.
# define two-degree wide latitude bins
lat_bins = np.arange(0,91,2)
# define a label for each bin corresponding to the central latitude
lat_center = np.arange(1,90,2)
# group according to those bins and take the mean
Tair_lat_mean = ds.Tair.groupby_bins('xc', lat_bins, labels=lat_center).mean()
# plot the result
Tair_lat_mean.plot()
[<matplotlib.lines.Line2D at 0x11cb92e90>]

Note that the resulting coordinate for the groupby_bins operation got the _bins suffix appended: xc_bins. This help us distinguish it from the original multidimensional variable xc.
Installation¶
Optional dependencies¶
For netCDF and IO¶
For accelerating xarray¶
- bottleneck: speeds up NaN-skipping and rolling window aggregations by a large factor
- cyordereddict: speeds up most internal operations with xarray data structures
For parallel computing¶
- dask.array: required for Out of core computation with dask.
For plotting¶
- matplotlib: required for Plotting.
- cartopy: recommended for Maps.
Instructions¶
xarray itself is a pure Python package, but its dependencies are not. The easiest way to get them installed is to use conda. You can then install xarray with its recommended dependencies with the conda command line tool:
$ conda install xarray dask netCDF4 bottleneck
Note
You might consider using the conda-forge channel, as it has been shown to be more up-to-date and to better handle package dependencies.
If you don’t use conda, be sure you have the required dependencies (numpy and pandas) installed first. Then, install xarray with pip:
$ pip install xarray
To run the test suite after installing xarray, install py.test and run py.test xarray.
Data Structures¶
DataArray¶
xarray.DataArray is xarray’s implementation of a labeled, multi-dimensional array. It has several key properties:
- values: a numpy.ndarray holding the array’s values
- dims: dimension names for each axis (e.g., ('x', 'y', 'z'))
- coords: a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings)
- attrs: an OrderedDict to hold arbitrary metadata (attributes)
xarray uses dims and coords to enable its core metadata aware operations. Dimensions provide names that xarray uses instead of the axis argument found in many numpy functions. Coordinates enable fast label based indexing and alignment, building on the functionality of the index found on a pandas DataFrame or Series.
DataArray objects also can have a name and can hold arbitrary metadata in the form of their attrs property (an ordered dictionary). Names and attributes are strictly for users and user-written code: xarray makes no attempt to interpret them, and propagates them only in unambiguous cases (see FAQ, What is your approach to metadata?).
Creating a DataArray¶
The DataArray constructor takes:
- data: a multi-dimensional array of values (e.g., a numpy ndarray, Series, DataFrame or Panel)
- coords: a list or dictionary of coordinates
- dims: a list of dimension names. If omitted, dimension names are taken from coords if possible
- attrs: a dictionary of attributes to add to the instance
- name: a string that names the instance
In [1]: data = np.random.rand(4, 3)
In [2]: locs = ['IA', 'IL', 'IN']
In [3]: times = pd.date_range('2000-01-01', periods=4)
In [4]: foo = xr.DataArray(data, coords=[times, locs], dims=['time', 'space'])
In [5]: foo
Out[5]:
<xarray.DataArray (time: 4, space: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Only data is required; all of other arguments will be filled in with default values:
In [6]: xr.DataArray(data)
Out[6]:
<xarray.DataArray (dim_0: 4, dim_1: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
* dim_0 (dim_0) int64 0 1 2 3
* dim_1 (dim_1) int64 0 1 2
As you can see, dimensions and coordinate arrays corresponding to each dimension are always present. This behavior is similar to pandas, which fills in index values in the same way.
Coordinates can take the following forms:
- A list of (dim, ticks[, attrs]) pairs with length equal to the number of dimensions
- A dictionary of {coord_name: coord} where the values are each a scalar value, a 1D array or a tuple. Tuples are be in the same form as the above, and multiple dimensions can be supplied with the form (dims, data[, attrs]). Supplying as a tuple allows other coordinates than those corresponding to dimensions (more on these later).
As a list of tuples:
In [7]: xr.DataArray(data, coords=[('time', times), ('space', locs)])
Out[7]:
<xarray.DataArray (time: 4, space: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
As a dictionary:
In [8]: xr.DataArray(data, coords={'time': times, 'space': locs, 'const': 42,
...: 'ranking': ('space', [1, 2, 3])},
...: dims=['time', 'space'])
...:
Out[8]:
<xarray.DataArray (time: 4, space: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
ranking (space) int64 1 2 3
* space (space) |S2 'IA' 'IL' 'IN'
const int64 42
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
As a dictionary with coords across multiple dimensions:
In [9]: xr.DataArray(data, coords={'time': times, 'space': locs, 'const': 42,
...: 'ranking': (('time', 'space'), np.arange(12).reshape(4,3))},
...: dims=['time', 'space'])
...:
Out[9]:
<xarray.DataArray (time: 4, space: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
ranking (time, space) int64 0 1 2 3 4 5 6 7 8 9 10 11
* space (space) |S2 'IA' 'IL' 'IN'
const int64 42
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
If you create a DataArray by supplying a pandas Series, DataFrame or Panel, any non-specified arguments in the DataArray constructor will be filled in from the pandas object:
In [10]: df = pd.DataFrame({'x': [0, 1], 'y': [2, 3]}, index=['a', 'b'])
In [11]: df.index.name = 'abc'
In [12]: df.columns.name = 'xyz'
In [13]: df
Out[13]:
xyz x y
abc
a 0 2
b 1 3
In [14]: xr.DataArray(df)
Out[14]:
<xarray.DataArray (abc: 2, xyz: 2)>
array([[0, 2],
[1, 3]])
Coordinates:
* abc (abc) object 'a' 'b'
* xyz (xyz) object 'x' 'y'
Xarray supports labeling coordinate values with a pandas.MultiIndex. While it handles multi-indexes with unnamed levels, it is recommended that you explicitly set the names of the levels.
DataArray properties¶
Let’s take a look at the important properties on our array:
In [15]: foo.values
Out[15]:
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
In [16]: foo.dims
Out[16]: ('time', 'space')
In [17]: foo.coords
Out[17]:
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
In [18]: foo.attrs
Out[18]: OrderedDict()
In [19]: print(foo.name)
None
You can even modify values inplace:
In [20]: foo.values = 1.0 * foo.values
Note
The array values in a DataArray have a single (homogeneous) data type. To work with heterogeneous or structured data types in xarray, use coordinates, or put separate DataArray objects in a single Dataset (see below).
Now fill in some of that missing metadata:
In [21]: foo.name = 'foo'
In [22]: foo.attrs['units'] = 'meters'
In [23]: foo
Out[23]:
<xarray.DataArray 'foo' (time: 4, space: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Attributes:
units: meters
The rename() method is another option, returning a new data array:
In [24]: foo.rename('bar')
Out[24]:
<xarray.DataArray 'bar' (time: 4, space: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Attributes:
units: meters
DataArray Coordinates¶
The coords property is dict like. Individual coordinates can be accessed from the coordinates by name, or even by indexing the data array itself:
In [25]: foo.coords['time']
Out[25]:
<xarray.DataArray 'time' (time: 4)>
array(['2000-01-01T00:00:00.000000000+0000', '2000-01-02T00:00:00.000000000+0000',
'2000-01-03T00:00:00.000000000+0000', '2000-01-04T00:00:00.000000000+0000'], dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
In [26]: foo['time']
Out[26]:
<xarray.DataArray 'time' (time: 4)>
array(['2000-01-01T00:00:00.000000000+0000', '2000-01-02T00:00:00.000000000+0000',
'2000-01-03T00:00:00.000000000+0000', '2000-01-04T00:00:00.000000000+0000'], dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
These are also DataArray objects, which contain tick-labels for each dimension.
Coordinates can also be set or removed by using the dictionary like syntax:
In [27]: foo['ranking'] = ('space', [1, 2, 3])
In [28]: foo.coords
Out[28]:
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
ranking (space) int64 1 2 3
In [29]: del foo['ranking']
In [30]: foo.coords
Out[30]:
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Dataset¶
xarray.Dataset is xarray’s multi-dimensional equivalent of a DataFrame. It is a dict-like container of labeled arrays (DataArray objects) with aligned dimensions. It is designed as an in-memory representation of the data model from the netCDF file format.
In addition to the dict-like interface of the dataset itself, which can be used to access any variable in a dataset, datasets have four key properties:
- dims: a dictionary mapping from dimension names to the fixed length of each dimension (e.g., {'x': 6, 'y': 6, 'time': 8})
- data_vars: a dict-like container of DataArrays corresponding to variables
- coords: another dict-like container of DataArrays intended to label points used in data_vars (e.g., arrays of numbers, datetime objects or strings)
- attrs: an OrderedDict to hold arbitrary metadata
The distinction between whether a variables falls in data or coordinates (borrowed from CF conventions) is mostly semantic, and you can probably get away with ignoring it if you like: dictionary like access on a dataset will supply variables found in either category. However, xarray does make use of the distinction for indexing and computations. Coordinates indicate constant/fixed/independent quantities, unlike the varying/measured/dependent quantities that belong in data.
Here is an example of how we might structure a dataset for a weather forecast:

In this example, it would be natural to call temperature and precipitation “data variables” and all the other arrays “coordinate variables” because they label the points along the dimensions. (see [1] for more background on this example).
Creating a Dataset¶
To make an Dataset from scratch, supply dictionaries for any variables (data_vars), coordinates (coords) and attributes (attrs).
data_vars are supplied as a dictionary with each key as the name of the variable and each value as one of: - A DataArray - A tuple of the form (dims, data[, attrs]) - A pandas object
coords are supplied as dictionary of {coord_name: coord} where the values are scalar values, arrays or tuples in the form of (dims, data[, attrs]).
Let’s create some fake data for the example we show above:
In [31]: temp = 15 + 8 * np.random.randn(2, 2, 3)
In [32]: precip = 10 * np.random.rand(2, 2, 3)
In [33]: lon = [[-99.83, -99.32], [-99.79, -99.23]]
In [34]: lat = [[42.25, 42.21], [42.63, 42.59]]
# for real use cases, its good practice to supply array attributes such as
# units, but we won't bother here for the sake of brevity
In [35]: ds = xr.Dataset({'temperature': (['x', 'y', 'time'], temp),
....: 'precipitation': (['x', 'y', 'time'], precip)},
....: coords={'lon': (['x', 'y'], lon),
....: 'lat': (['x', 'y'], lat),
....: 'time': pd.date_range('2014-09-06', periods=3),
....: 'reference_time': pd.Timestamp('2014-09-05')})
....:
In [36]: ds
Out[36]:
<xarray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
Data variables:
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
Notice that we did not explicitly include coordinates for the “x” or “y” dimensions, so they were filled in array of ascending integers of the proper length.
Here we pass xarray.DataArray objects or a pandas object as values in the dictionary:
In [37]: xr.Dataset({'bar': foo})
Out[37]:
<xarray.Dataset>
Dimensions: (space: 3, time: 4)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Data variables:
bar (time, space) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 ...
In [38]: xr.Dataset({'bar': foo.to_pandas()})
Out[38]:
<xarray.Dataset>
Dimensions: (space: 3, time: 4)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) object 'IA' 'IL' 'IN'
Data variables:
bar (time, space) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 ...
Where a pandas object is supplied as a value, the names of its indexes are used as dimension names, and its data is aligned to any existing dimensions.
You can also create an dataset from: - A pandas.DataFrame or pandas.Panel along its columns and items
respectively, by passing it into the xarray.Dataset directly
- A pandas.DataFrame with Dataset.from_dataframe, which will additionally handle MultiIndexes See Working with pandas
- A netCDF file on disk with open_dataset(). See Serialization and IO.
Dataset contents¶
Dataset implements the Python dictionary interface, with values given by xarray.DataArray objects:
In [39]: 'temperature' in ds
Out[39]: True
In [40]: ds.keys()
Out[40]:
['precipitation',
'temperature',
'lat',
'reference_time',
'lon',
'time',
'x',
'y']
In [41]: ds['temperature']
Out[41]:
<xarray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[ 11.041, 23.574, 20.772],
[ 9.346, 6.683, 17.175]],
[[ 11.6 , 19.536, 17.21 ],
[ 6.301, 9.61 , 15.909]]])
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
The valid keys include each listed coordinate and data variable.
Data and coordinate variables are also contained separately in the data_vars and coords dictionary-like attributes:
In [42]: ds.data_vars
Out[42]:
Data variables:
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 0.3777 ...
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
In [43]: ds.coords
Out[43]:
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
Finally, like data arrays, datasets also store arbitrary metadata in the form of attributes:
In [44]: ds.attrs
Out[44]: OrderedDict()
In [45]: ds.attrs['title'] = 'example attribute'
In [46]: ds
Out[46]:
<xarray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
Data variables:
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
Attributes:
title: example attribute
xarray does not enforce any restrictions on attributes, but serialization to some file formats may fail if you use objects that are not strings, numbers or numpy.ndarray objects.
As a useful shortcut, you can use attribute style access for reading (but not setting) variables and attributes:
In [47]: ds.temperature
Out[47]:
<xarray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[ 11.041, 23.574, 20.772],
[ 9.346, 6.683, 17.175]],
[[ 11.6 , 19.536, 17.21 ],
[ 6.301, 9.61 , 15.909]]])
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
This is particularly useful in an exploratory context, because you can tab-complete these variable names with tools like IPython.
Dictionary like methods¶
We can update a dataset in-place using Python’s standard dictionary syntax. For example, to create this example dataset from scratch, we could have written:
In [48]: ds = xr.Dataset()
In [49]: ds['temperature'] = (('x', 'y', 'time'), temp)
In [50]: ds['precipitation'] = (('x', 'y', 'time'), precip)
In [51]: ds.coords['lat'] = (('x', 'y'), lat)
In [52]: ds.coords['lon'] = (('x', 'y'), lon)
In [53]: ds.coords['time'] = pd.date_range('2014-09-06', periods=3)
In [54]: ds.coords['reference_time'] = pd.Timestamp('2014-09-05')
To change the variables in a Dataset, you can use all the standard dictionary methods, including values, items, __delitem__, get and update(). Note that assigning a DataArray or pandas object to a Dataset variable using __setitem__ or update will automatically align the array(s) to the original dataset’s indexes.
You can copy a Dataset by calling the copy() method. By default, the copy is shallow, so only the container will be copied: the arrays in the Dataset will still be stored in the same underlying numpy.ndarray objects. You can copy all data by calling ds.copy(deep=True).
Transforming datasets¶
In addition to dictionary-like methods (described above), xarray has additional methods (like pandas) for transforming datasets into new objects.
For removing variables, you can select and drop an explicit list of variables by indexing with a list of names or using the drop() methods to return a new Dataset. These operations keep around coordinates:
In [55]: list(ds[['temperature']])
Out[55]: ['temperature', 'reference_time', 'lon', 'y', 'time', 'lat', 'x']
In [56]: list(ds[['x']])
Out[56]: ['x', 'reference_time']
In [57]: list(ds.drop('temperature'))
Out[57]: ['x', 'y', 'time', 'precipitation', 'lat', 'lon', 'reference_time']
If a dimension name is given as an argument to drop, it also drops all variables that use that dimension:
In [58]: list(ds.drop('time'))
Out[58]: ['x', 'y', 'lat', 'lon', 'reference_time']
As an alternate to dictionary-like modifications, you can use assign() and assign_coords(). These methods return a new dataset with additional (or replaced) or values:
In [59]: ds.assign(temperature2 = 2 * ds.temperature)
Out[59]:
<xarray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
* x (x) int64 0 1
* y (y) int64 0 1
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
Data variables:
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
temperature2 (x, y, time) float64 22.08 47.15 41.54 18.69 13.37 34.35 ...
There is also the pipe() method that allows you to use a method call with an external function (e.g., ds.pipe(func)) instead of simply calling it (e.g., func(ds)). This allows you to write pipelines for transforming you data (using “method chaining”) instead of writing hard to follow nested function calls:
# these lines are equivalent, but with pipe we can make the logic flow
# entirely from left to right
In [60]: plt.plot((2 * ds.temperature.sel(x=0)).mean('y'))
Out[60]: [<matplotlib.lines.Line2D at 0x7f23ac296190>]
In [61]: (ds.temperature
....: .sel(x=0)
....: .pipe(lambda x: 2 * x)
....: .mean('y')
....: .pipe(plt.plot))
....:
Out[61]: [<matplotlib.lines.Line2D at 0x7f23ac296650>]
Both pipe and assign replicate the pandas methods of the same names (DataFrame.pipe and DataFrame.assign).
With xarray, there is no performance penalty for creating new datasets, even if variables are lazily loaded from a file on disk. Creating new objects instead of mutating existing objects often results in easier to understand code, so we encourage using this approach.
Renaming variables¶
Another useful option is the rename() method to rename dataset variables:
In [62]: ds.rename({'temperature': 'temp', 'precipitation': 'precip'})
Out[62]:
<xarray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
* x (x) int64 0 1
* y (y) int64 0 1
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
Data variables:
temp (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
precip (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
The related swap_dims() method allows you do to swap dimension and non-dimension variables:
In [63]: ds.coords['day'] = ('time', [6, 7, 8])
In [64]: ds.swap_dims({'time': 'day'})
Out[64]:
<xarray.Dataset>
Dimensions: (day: 3, x: 2, y: 2)
Coordinates:
* x (x) int64 0 1
* y (y) int64 0 1
time (day) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
* day (day) int64 6 7 8
Data variables:
temperature (x, y, day) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
precipitation (x, y, day) float64 5.904 2.453 3.404 9.847 9.195 0.3777 ...
Coordinates¶
Coordinates are ancillary variables stored for DataArray and Dataset objects in the coords attribute:
In [65]: ds.coords
Out[65]:
Coordinates:
* x (x) int64 0 1
* y (y) int64 0 1
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
day (time) int64 6 7 8
Unlike attributes, xarray does interpret and persist coordinates in operations that transform xarray objects.
One dimensional coordinates with a name equal to their sole dimension (marked by * when printing a dataset or data array) take on a special meaning in xarray. They are used for label based indexing and alignment, like the index found on a pandas DataFrame or Series. Indeed, these “dimension” coordinates use a pandas.Index internally to store their values.
Other than for indexing, xarray does not make any direct use of the values associated with coordinates. Coordinates with names not matching a dimension are not used for alignment or indexing, nor are they required to match when doing arithmetic (see Coordinates).
Modifying coordinates¶
To entirely add or remove coordinate arrays, you can use dictionary like syntax, as shown above.
To convert back and forth between data and coordinates, you can use the set_coords() and reset_coords() methods:
In [66]: ds.reset_coords()
Out[66]:
<xarray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
* x (x) int64 0 1
* y (y) int64 0 1
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
Data variables:
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
day (time) int64 6 7 8
In [67]: ds.set_coords(['temperature', 'precipitation'])
Out[67]:
<xarray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
* x (x) int64 0 1
* y (y) int64 0 1
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
day (time) int64 6 7 8
Data variables:
*empty*
In [68]: ds['temperature'].reset_coords(drop=True)
Out[68]:
<xarray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[ 11.041, 23.574, 20.772],
[ 9.346, 6.683, 17.175]],
[[ 11.6 , 19.536, 17.21 ],
[ 6.301, 9.61 , 15.909]]])
Coordinates:
* x (x) int64 0 1
* y (y) int64 0 1
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
Notice that these operations skip coordinates with names given by dimensions, as used for indexing. This mostly because we are not entirely sure how to design the interface around the fact that xarray cannot store a coordinate and variable with the name but different values in the same dictionary. But we do recognize that supporting something like this would be useful.
Coordinates methods¶
Coordinates objects also have a few useful methods, mostly for converting them into dataset objects:
In [69]: ds.coords.to_dataset()
Out[69]:
<xarray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* y (y) int64 0 1
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
lat (x, y) float64 42.25 42.21 42.63 42.59
* x (x) int64 0 1
day (time) int64 6 7 8
Data variables:
*empty*
The merge method is particularly interesting, because it implements the same logic used for merging coordinates in arithmetic operations (see Computation):
In [70]: alt = xr.Dataset(coords={'z': [10], 'lat': 0, 'lon': 0})
In [71]: ds.coords.merge(alt.coords)
Out[71]:
<xarray.Dataset>
Dimensions: (time: 3, x: 2, y: 2, z: 1)
Coordinates:
* x (x) int64 0 1
* y (y) int64 0 1
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
reference_time datetime64[ns] 2014-09-05
day (time) int64 6 7 8
* z (z) int64 10
Data variables:
*empty*
The coords.merge method may be useful if you want to implement your own binary operations that act on xarray objects. In the future, we hope to write more helper functions so that you can easily make your functions act like xarray’s built-in arithmetic.
Indexes¶
To convert a coordinate (or any DataArray) into an actual pandas.Index, use the to_index() method:
In [72]: ds['time'].to_index()
Out[72]: DatetimeIndex(['2014-09-06', '2014-09-07', '2014-09-08'], dtype='datetime64[ns]', name=u'time', freq='D')
A useful shortcut is the indexes property (on both DataArray and Dataset), which lazily constructs a dictionary whose keys are given by each dimension and whose the values are Index objects:
In [73]: ds.indexes
Out[73]:
y: Int64Index([0, 1], dtype='int64', name=u'y')
x: Int64Index([0, 1], dtype='int64', name=u'x')
time: DatetimeIndex(['2014-09-06', '2014-09-07', '2014-09-08'], dtype='datetime64[ns]', name=u'time', freq='D')
[1] | Latitude and longitude are 2D arrays because the dataset uses projected coordinates. reference_time refers to the reference time at which the forecast was made, rather than time which is the valid time for which the forecast applies. |
Indexing and selecting data¶
Similarly to pandas objects, xarray objects support both integer and label based lookups along each dimension. However, xarray objects also have named dimensions, so you can optionally use dimension names instead of relying on the positional ordering of dimensions.
Thus in total, xarray supports four different kinds of indexing, as described below and summarized in this table:
Dimension lookup | Index lookup | DataArray syntax | Dataset syntax |
---|---|---|---|
Positional | By integer | arr[:, 0] | not available |
Positional | By label | arr.loc[:, 'IA'] | not available |
By name | By integer | arr.isel(space=0) or arr[dict(space=0)] |
ds.isel(space=0) or ds[dict(space=0)] |
By name | By label | arr.sel(space='IA') or arr.loc[dict(space='IA')] |
ds.sel(space='IA') or ds.loc[dict(space='IA')] |
Positional indexing¶
Indexing a DataArray directly works (mostly) just like it does for numpy arrays, except that the returned object is always another DataArray:
In [1]: arr = xr.DataArray(np.random.rand(4, 3),
...: [('time', pd.date_range('2000-01-01', periods=4)),
...: ('space', ['IA', 'IL', 'IN'])])
...:
In [2]: arr[:2]
Out[2]:
<xarray.DataArray (time: 2, space: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
* space (space) |S2 'IA' 'IL' 'IN'
In [3]: arr[0, 0]
Out[3]:
<xarray.DataArray ()>
array(0.12696983303810094)
Coordinates:
time datetime64[ns] 2000-01-01
space |S2 'IA'
In [4]: arr[:, [2, 1]]
Out[4]:
<xarray.DataArray (time: 4, space: 2)>
array([[ 0.26 , 0.967],
[ 0.336, 0.377],
[ 0.123, 0.84 ],
[ 0.448, 0.373]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IN' 'IL'
Attributes are persisted in all indexing operations.
Warning
Positional indexing deviates from the NumPy when indexing with multiple arrays like arr[[0, 1], [0, 1]], as described in Orthogonal (outer) vs. vectorized indexing. See Pointwise indexing for how to achieve this functionality in xarray.
xarray also supports label-based indexing, just like pandas. Because we use a pandas.Index under the hood, label based indexing is very fast. To do label based indexing, use the loc attribute:
In [5]: arr.loc['2000-01-01':'2000-01-02', 'IA']
Out[5]:
<xarray.DataArray (time: 2)>
array([ 0.127, 0.897])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
space |S2 'IA'
You can perform any of the label indexing operations supported by pandas, including indexing with individual, slices and arrays of labels, as well as indexing with boolean arrays. Like pandas, label based indexing in xarray is inclusive of both the start and stop bounds.
Setting values with label based indexing is also supported:
In [6]: arr.loc['2000-01-01', ['IL', 'IN']] = -10
In [7]: arr
Out[7]:
<xarray.DataArray (time: 4, space: 3)>
array([[ 0.127, -10. , -10. ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Indexing with labeled dimensions¶
With labeled dimensions, we do not have to rely on dimension order and can use them explicitly to slice data. There are two ways to do this:
Use a dictionary as the argument for array positional or label based array indexing:
# index by integer array indices In [8]: arr[dict(space=0, time=slice(None, 2))] Out[8]: <xarray.DataArray (time: 2)> array([ 0.127, 0.897]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 space |S2 'IA' # index by dimension coordinate labels In [9]: arr.loc[dict(time=slice('2000-01-01', '2000-01-02'))] Out[9]: <xarray.DataArray (time: 2, space: 3)> array([[ 0.127, -10. , -10. ], [ 0.897, 0.377, 0.336]]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 * space (space) |S2 'IA' 'IL' 'IN'
Use the sel() and isel() convenience methods:
# index by integer array indices In [10]: arr.isel(space=0, time=slice(None, 2)) Out[10]: <xarray.DataArray (time: 2)> array([ 0.127, 0.897]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 space |S2 'IA' # index by dimension coordinate labels In [11]: arr.sel(time=slice('2000-01-01', '2000-01-02')) Out[11]: <xarray.DataArray (time: 2, space: 3)> array([[ 0.127, -10. , -10. ], [ 0.897, 0.377, 0.336]]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 * space (space) |S2 'IA' 'IL' 'IN'
The arguments to these methods can be any objects that could index the array along the dimension given by the keyword, e.g., labels for an individual value, Python slice() objects or 1-dimensional arrays.
Note
We would love to be able to do indexing with labeled dimension names inside brackets, but unfortunately, Python does yet not support indexing with keyword arguments like arr[space=0]
Warning
Do not try to assign values when using any of the indexing methods isel, isel_points, sel or sel_points:
# DO NOT do this
arr.isel(space=0) = 0
Depending on whether the underlying numpy indexing returns a copy or a view, the method will fail, and when it fails, it will fail silently. Instead, you should use normal index assignment:
# this is safe
arr[dict(space=0)] = 0
Pointwise indexing¶
xarray pointwise indexing supports the indexing along multiple labeled dimensions using list-like objects. While isel() performs orthogonal indexing, the isel_points() method provides similar numpy indexing behavior as if you were using multiple lists to index an array (e.g. arr[[0, 1], [0, 1]] ):
# index by integer array indices
In [12]: da = xr.DataArray(np.arange(56).reshape((7, 8)), dims=['x', 'y'])
In [13]: da
Out[13]:
<xarray.DataArray (x: 7, y: 8)>
array([[ 0, 1, 2, ..., 5, 6, 7],
[ 8, 9, 10, ..., 13, 14, 15],
[16, 17, 18, ..., 21, 22, 23],
...,
[32, 33, 34, ..., 37, 38, 39],
[40, 41, 42, ..., 45, 46, 47],
[48, 49, 50, ..., 53, 54, 55]])
Coordinates:
* x (x) int64 0 1 2 3 4 5 6
* y (y) int64 0 1 2 3 4 5 6 7
In [14]: da.isel_points(x=[0, 1, 6], y=[0, 1, 0])
Out[14]:
<xarray.DataArray (points: 3)>
array([ 0, 9, 48])
Coordinates:
y (points) int64 0 1 0
x (points) int64 0 1 6
* points (points) int64 0 1 2
There is also sel_points(), which analogously allows you to do point-wise indexing by label:
In [15]: times = pd.to_datetime(['2000-01-03', '2000-01-02', '2000-01-01'])
In [16]: arr.sel_points(space=['IA', 'IL', 'IN'], time=times)
Out[16]:
<xarray.DataArray (points: 3)>
array([ 0.451, 0.377, -10. ])
Coordinates:
time (points) datetime64[ns] 2000-01-03 2000-01-02 2000-01-01
space (points) |S2 'IA' 'IL' 'IN'
* points (points) int64 0 1 2
The equivalent pandas method to sel_points is lookup().
Dataset indexing¶
We can also use these methods to index all variables in a dataset simultaneously, returning a new dataset:
In [17]: ds = arr.to_dataset(name='foo')
In [18]: ds.isel(space=[0], time=[0])
Out[18]:
<xarray.Dataset>
Dimensions: (space: 1, time: 1)
Coordinates:
* time (time) datetime64[ns] 2000-01-01
* space (space) |S2 'IA'
Data variables:
foo (time, space) float64 0.127
In [19]: ds.sel(time='2000-01-01')
Out[19]:
<xarray.Dataset>
Dimensions: (space: 3)
Coordinates:
time datetime64[ns] 2000-01-01
* space (space) |S2 'IA' 'IL' 'IN'
Data variables:
foo (space) float64 0.127 -10.0 -10.0
In [20]: ds2 = da.to_dataset(name='bar')
In [21]: ds2.isel_points(x=[0, 1, 6], y=[0, 1, 0], dim='points')
Out[21]:
<xarray.Dataset>
Dimensions: (points: 3)
Coordinates:
y (points) int64 0 1 0
x (points) int64 0 1 6
* points (points) int64 0 1 2
Data variables:
bar (points) int64 0 9 48
Positional indexing on a dataset is not supported because the ordering of dimensions in a dataset is somewhat ambiguous (it can vary between different arrays). However, you can do normal indexing with labeled dimensions:
In [22]: ds[dict(space=[0], time=[0])]
Out[22]:
<xarray.Dataset>
Dimensions: (space: 1, time: 1)
Coordinates:
* time (time) datetime64[ns] 2000-01-01
* space (space) |S2 'IA'
Data variables:
foo (time, space) float64 0.127
In [23]: ds.loc[dict(time='2000-01-01')]
Out[23]:
<xarray.Dataset>
Dimensions: (space: 3)
Coordinates:
time datetime64[ns] 2000-01-01
* space (space) |S2 'IA' 'IL' 'IN'
Data variables:
foo (space) float64 0.127 -10.0 -10.0
Using indexing to assign values to a subset of dataset (e.g., ds[dict(space=0)] = 1) is not yet supported.
Dropping labels¶
The drop() method returns a new object with the listed index labels along a dimension dropped:
In [24]: ds.drop(['IN', 'IL'], dim='space')
Out[24]:
<xarray.Dataset>
Dimensions: (space: 1, time: 4)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA'
Data variables:
foo (time, space) float64 0.127 0.8972 0.4514 0.543
drop is both a Dataset and DataArray method.
Nearest neighbor lookups¶
The label based selection methods sel(), reindex() and reindex_like() all support method and tolerance keyword argument. The method parameter allows for enabling nearest neighbor (inexact) lookups by use of the methods 'pad', 'backfill' or 'nearest':
In [25]: data = xr.DataArray([1, 2, 3], dims='x')
In [26]: data.sel(x=[1.1, 1.9], method='nearest')
Out[26]:
<xarray.DataArray (x: 2)>
array([2, 3])
Coordinates:
* x (x) int64 1 2
In [27]: data.sel(x=0.1, method='backfill')
Out[27]:
<xarray.DataArray ()>
array(2)
Coordinates:
x int64 1
In [28]: data.reindex(x=[0.5, 1, 1.5, 2, 2.5], method='pad')
Out[28]:
<xarray.DataArray (x: 5)>
array([1, 2, 2, 3, 3])
Coordinates:
* x (x) float64 0.5 1.0 1.5 2.0 2.5
Tolerance limits the maximum distance for valid matches with an inexact lookup:
In [29]: data.reindex(x=[1.1, 1.5], method='nearest', tolerance=0.2)
Out[29]:
<xarray.DataArray (x: 2)>
array([ 2., nan])
Coordinates:
* x (x) float64 1.1 1.5
Using method='nearest' or a scalar argument with .sel() requires pandas version 0.16 or newer. Using tolerance requries pandas version 0.17 or newer.
The method parameter is not yet supported if any of the arguments to .sel() is a slice object:
In [30]: data.sel(x=slice(1, 3), method='nearest')
NotImplementedError
However, you don’t need to use method to do inexact slicing. Slicing already returns all values inside the range (inclusive), as long as the index labels are monotonic increasing:
In [31]: data.sel(x=slice(0.9, 3.1))
Out[31]:
<xarray.DataArray (x: 2)>
array([2, 3])
Coordinates:
* x (x) int64 1 2
Indexing axes with monotonic decreasing labels also works, as long as the slice or .loc arguments are also decreasing:
In [32]: reversed_data = data[::-1]
In [33]: reversed_data.loc[3.1:0.9]
Out[33]:
<xarray.DataArray (x: 2)>
array([3, 2])
Coordinates:
* x (x) int64 2 1
Masking with where¶
Indexing methods on xarray objects generally return a subset of the original data. However, it is sometimes useful to select an object with the same shape as the original data, but with some elements masked. To do this type of selection in xarray, use where():
In [34]: arr2 = xr.DataArray(np.arange(16).reshape(4, 4), dims=['x', 'y'])
In [35]: arr2.where(arr2.x + arr2.y < 4)
Out[35]:
<xarray.DataArray (x: 4, y: 4)>
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., nan],
[ 8., 9., nan, nan],
[ 12., nan, nan, nan]])
Coordinates:
* x (x) int64 0 1 2 3
* y (y) int64 0 1 2 3
This is particularly useful for ragged indexing of multi-dimensional data, e.g., to apply a 2D mask to an image. Note that where follows all the usual xarray broadcasting and alignment rules for binary operations (e.g., +) between the object being indexed and the condition, as described in Computation:
In [36]: arr2.where(arr2.y < 2)
Out[36]:
<xarray.DataArray (x: 4, y: 4)>
array([[ 0., 1., nan, nan],
[ 4., 5., nan, nan],
[ 8., 9., nan, nan],
[ 12., 13., nan, nan]])
Coordinates:
* x (x) int64 0 1 2 3
* y (y) int64 0 1 2 3
By default where maintains the original size of the data. For cases where the selected data size is much smaller than the original data, use of the option drop=True clips coordinate elements that are fully masked:
In [37]: arr2.where(arr2.y < 2, drop=True)
Out[37]:
<xarray.DataArray (x: 4, y: 2)>
array([[ 0., 1.],
[ 4., 5.],
[ 8., 9.],
[ 12., 13.]])
Coordinates:
* x (x) int64 0 1 2 3
* y (y) int64 0 1
Multi-level indexing¶
Just like pandas, advanced indexing on multi-level indexes is possible with loc and sel. You can slice a multi-index by providing multiple indexers, i.e., a tuple of slices, labels, list of labels, or any selector allowed by pandas:
In [38]: midx = pd.MultiIndex.from_product([list('abc'), [0, 1]],
....: names=('one', 'two'))
....:
In [39]: mda = xr.DataArray(np.random.rand(6, 3),
....: [('x', midx), ('y', range(3))])
....:
In [40]: mda
Out[40]:
<xarray.DataArray (x: 6, y: 3)>
array([[ 0.129, 0.86 , 0.82 ],
[ 0.352, 0.229, 0.777],
[ 0.595, 0.138, 0.853],
[ 0.236, 0.146, 0.59 ],
[ 0.574, 0.061, 0.59 ],
[ 0.245, 0.34 , 0.985]])
Coordinates:
* x (x) object ('a', 0) ('a', 1) ('b', 0) ('b', 1) ('c', 0) ('c', 1)
* y (y) int64 0 1 2
In [41]: mda.sel(x=(list('ab'), [0]))
Out[41]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.129, 0.86 , 0.82 ],
[ 0.595, 0.138, 0.853]])
Coordinates:
* x (x) object ('a', 0) ('b', 0)
* y (y) int64 0 1 2
You can also select multiple elements by providing a list of labels or tuples or a slice of tuples:
In [42]: mda.sel(x=[('a', 0), ('b', 1)])
Out[42]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.129, 0.86 , 0.82 ],
[ 0.236, 0.146, 0.59 ]])
Coordinates:
* x (x) object ('a', 0) ('b', 1)
* y (y) int64 0 1 2
Additionally, xarray supports dictionaries:
In [43]: mda.sel(x={'one': 'a', 'two': 0})
Out[43]:
<xarray.DataArray (y: 3)>
array([ 0.129, 0.86 , 0.82 ])
Coordinates:
x object ('a', 0)
* y (y) int64 0 1 2
In [44]: mda.loc[{'one': 'a'}, ...]
Out[44]:
<xarray.DataArray (two: 2, y: 3)>
array([[ 0.129, 0.86 , 0.82 ],
[ 0.352, 0.229, 0.777]])
Coordinates:
* two (two) int64 0 1
* y (y) int64 0 1 2
Like pandas, xarray handles partial selection on multi-index (level drop). As shown in the last example above, it also renames the dimension / coordinate when the multi-index is reduced to a single index.
Unlike pandas, xarray does not guess whether you provide index levels or dimensions when using loc in some ambiguous cases. For example, for mda.loc[{'one': 'a', 'two': 0}] and mda.loc['a', 0] xarray always interprets (‘one’, ‘two’) and (‘a’, 0) as the names and labels of the 1st and 2nd dimension, respectively. You must specify all dimensions or use the ellipsis in the loc specifier, e.g. in the example above, mda.loc[{'one': 'a', 'two': 0}, :] or mda.loc[('a', 0), ...].
Multi-dimensional indexing¶
xarray does not yet support efficient routines for generalized multi-dimensional indexing or regridding. However, we are definitely interested in adding support for this in the future (see GH475 for the ongoing discussion).
Copies vs. views¶
Whether array indexing returns a view or a copy of the underlying data depends on the nature of the labels. For positional (integer) indexing, xarray follows the same rules as NumPy:
- Positional indexing with only integers and slices returns a view.
- Positional indexing with arrays or lists returns a copy.
The rules for label based indexing are more complex:
- Label-based indexing with only slices returns a view.
- Label-based indexing with arrays returns a copy.
- Label-based indexing with scalars returns a view or a copy, depending upon if the corresponding positional indexer can be represented as an integer or a slice object. The exact rules are determined by pandas.
Whether data is a copy or a view is more predictable in xarray than in pandas, so unlike pandas, xarray does not produce SettingWithCopy warnings. However, you should still avoid assignment with chained indexing.
Orthogonal (outer) vs. vectorized indexing¶
Indexing with xarray objects has one important difference from indexing numpy arrays: you can only use one-dimensional arrays to index xarray objects, and each indexer is applied “orthogonally” along independent axes, instead of using numpy’s broadcasting rules to vectorize indexers. This means you can do indexing like this, which would require slightly more awkward syntax with numpy arrays:
In [45]: arr[arr['time.day'] > 1, arr['space'] != 'IL']
Out[45]:
<xarray.DataArray (time: 3, space: 2)>
array([[ 0.897, 0.336],
[ 0.451, 0.123],
[ 0.543, 0.448]])
Coordinates:
* time (time) datetime64[ns] 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IN'
This is a much simpler model than numpy’s advanced indexing. If you would like to do advanced-style array indexing in xarray, you have several options:
- Pointwise indexing
- Masking with where
- Index the underlying NumPy array directly using .values, e.g.,
In [46]: arr.values[arr.values > 0.5]
Out[46]: array([ 0.897, 0.84 , 0.543])
Align and reindex¶
xarray’s reindex, reindex_like and align impose a DataArray or Dataset onto a new set of coordinates corresponding to dimensions. The original values are subset to the index labels still found in the new labels, and values corresponding to new labels not found in the original object are in-filled with NaN.
xarray operations that combine multiple objects generally automatically align their arguments to share the same indexes. However, manual alignment can be useful for greater control and for increased performance.
To reindex a particular dimension, use reindex():
In [47]: arr.reindex(space=['IA', 'CA'])
Out[47]:
<xarray.DataArray (time: 4, space: 2)>
array([[ 0.127, nan],
[ 0.897, nan],
[ 0.451, nan],
[ 0.543, nan]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'CA'
The reindex_like() method is a useful shortcut. To demonstrate, we will make a subset DataArray with new values:
In [48]: foo = arr.rename('foo')
In [49]: baz = (10 * arr[:2, :2]).rename('baz')
In [50]: baz
Out[50]:
<xarray.DataArray 'baz' (time: 2, space: 2)>
array([[ 1.27 , -100. ],
[ 8.972, 3.767]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
* space (space) |S2 'IA' 'IL'
Reindexing foo with baz selects out the first two values along each dimension:
In [51]: foo.reindex_like(baz)
Out[51]:
<xarray.DataArray 'foo' (time: 2, space: 2)>
array([[ 0.127, -10. ],
[ 0.897, 0.377]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
* space (space) object 'IA' 'IL'
The opposite operation asks us to reindex to a larger shape, so we fill in the missing values with NaN:
In [52]: baz.reindex_like(foo)
Out[52]:
<xarray.DataArray 'baz' (time: 4, space: 3)>
array([[ 1.27 , -100. , nan],
[ 8.972, 3.767, nan],
[ nan, nan, nan],
[ nan, nan, nan]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) object 'IA' 'IL' 'IN'
The align() function lets us perform more flexible database-like 'inner', 'outer', 'left' and 'right' joins:
In [53]: xr.align(foo, baz, join='inner')
Out[53]:
(<xarray.DataArray 'foo' (time: 2, space: 2)>
array([[ 0.127, -10. ],
[ 0.897, 0.377]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
* space (space) object 'IA' 'IL',
<xarray.DataArray 'baz' (time: 2, space: 2)>
array([[ 1.27 , -100. ],
[ 8.972, 3.767]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
* space (space) object 'IA' 'IL')
In [54]: xr.align(foo, baz, join='outer')
Out[54]:
(<xarray.DataArray 'foo' (time: 4, space: 3)>
array([[ 0.127, -10. , -10. ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) object 'IA' 'IL' 'IN',
<xarray.DataArray 'baz' (time: 4, space: 3)>
array([[ 1.27 , -100. , nan],
[ 8.972, 3.767, nan],
[ nan, nan, nan],
[ nan, nan, nan]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) object 'IA' 'IL' 'IN')
Both reindex_like and align work interchangeably between DataArray and Dataset objects, and with any number of matching dimension names:
In [55]: ds
Out[55]:
<xarray.Dataset>
Dimensions: (space: 3, time: 4)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Data variables:
foo (time, space) float64 0.127 -10.0 -10.0 0.8972 0.3767 0.3362 ...
In [56]: ds.reindex_like(baz)
Out[56]:
<xarray.Dataset>
Dimensions: (space: 2, time: 2)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
* space (space) object 'IA' 'IL'
Data variables:
foo (time, space) float64 0.127 -10.0 0.8972 0.3767
In [57]: other = xr.DataArray(['a', 'b', 'c'], dims='other')
# this is a no-op, because there are no shared dimension names
In [58]: ds.reindex_like(other)
Out[58]:
<xarray.Dataset>
Dimensions: (space: 3, time: 4)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Data variables:
foo (time, space) float64 0.127 -10.0 -10.0 0.8972 0.3767 0.3362 ...
Computation¶
The labels associated with DataArray and Dataset objects enables some powerful shortcuts for computation, notably including aggregation and broadcasting by dimension names.
Basic array math¶
Arithmetic operations with a single DataArray automatically vectorize (like numpy) over all array values:
In [1]: arr = xr.DataArray(np.random.randn(2, 3),
...: [('x', ['a', 'b']), ('y', [10, 20, 30])])
...:
In [2]: arr - 3
Out[2]:
<xarray.DataArray (x: 2, y: 3)>
array([[-2.5308877 , -3.28286334, -4.5090585 ],
[-4.13563237, -1.78788797, -3.17321465]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
In [3]: abs(arr)
Out[3]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.4691123 , 0.28286334, 1.5090585 ],
[ 1.13563237, 1.21211203, 0.17321465]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
You can also use any of numpy’s or scipy’s many ufunc functions directly on a DataArray:
In [4]: np.sin(arr)
Out[4]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.45209466, -0.27910634, -0.99809483],
[-0.90680094, 0.9363595 , -0.17234978]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
Data arrays also implement many numpy.ndarray methods:
In [5]: arr.round(2)
Out[5]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.47, -0.28, -1.51],
[-1.14, 1.21, -0.17]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
In [6]: arr.T
Out[6]:
<xarray.DataArray (y: 3, x: 2)>
array([[ 0.4691123 , -1.13563237],
[-0.28286334, 1.21211203],
[-1.5090585 , -0.17321465]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
Missing values¶
xarray objects borrow the isnull(), notnull(), count(), dropna() and fillna() methods for working with missing data from pandas:
In [7]: x = xr.DataArray([0, 1, np.nan, np.nan, 2], dims=['x'])
In [8]: x.isnull()
Out[8]:
<xarray.DataArray (x: 5)>
array([False, False, True, True, False], dtype=bool)
Coordinates:
* x (x) int64 0 1 2 3 4
In [9]: x.notnull()
Out[9]:
<xarray.DataArray (x: 5)>
array([ True, True, False, False, True], dtype=bool)
Coordinates:
* x (x) int64 0 1 2 3 4
In [10]: x.count()
Out[10]:
<xarray.DataArray ()>
array(3)
In [11]: x.dropna(dim='x')
Out[11]:
<xarray.DataArray (x: 3)>
array([ 0., 1., 2.])
Coordinates:
* x (x) int64 0 1 4
In [12]: x.fillna(-1)
Out[12]:
<xarray.DataArray (x: 5)>
array([ 0., 1., -1., -1., 2.])
Coordinates:
* x (x) int64 0 1 2 3 4
Like pandas, xarray uses the float value np.nan (not-a-number) to represent missing values.
Aggregation¶
Aggregation methods have been updated to take a dim argument instead of axis. This allows for very intuitive syntax for aggregation methods that are applied along particular dimension(s):
In [13]: arr.sum(dim='x')
Out[13]:
<xarray.DataArray (y: 3)>
array([-0.66652007, 0.92924868, -1.68227315])
Coordinates:
* y (y) int64 10 20 30
In [14]: arr.std(['x', 'y'])
Out[14]:
<xarray.DataArray ()>
array(0.9156385956757354)
In [15]: arr.min()
Out[15]:
<xarray.DataArray ()>
array(-1.5090585031735124)
If you need to figure out the axis number for a dimension yourself (say, for wrapping code designed to work with numpy arrays), you can use the get_axis_num() method:
In [16]: arr.get_axis_num('y')
Out[16]: 1
These operations automatically skip missing values, like in pandas:
In [17]: xr.DataArray([1, 2, np.nan, 3]).mean()
Out[17]:
<xarray.DataArray ()>
array(2.0)
If desired, you can disable this behavior by invoking the aggregation method with skipna=False.
Rolling window operations¶
DataArray objects include a rolling() method. This method supports rolling window aggregation:
In [18]: arr = xr.DataArray(np.arange(0, 7.5, 0.5).reshape(3, 5),
....: dims=('x', 'y'))
....:
In [19]: arr
Out[19]:
<xarray.DataArray (x: 3, y: 5)>
array([[ 0. , 0.5, 1. , 1.5, 2. ],
[ 2.5, 3. , 3.5, 4. , 4.5],
[ 5. , 5.5, 6. , 6.5, 7. ]])
Coordinates:
* x (x) int64 0 1 2
* y (y) int64 0 1 2 3 4
rolling() is applied along one dimension using the name of the dimension as a key (e.g. y) and the window size as the value (e.g. 3). We get back a Rolling object:
In [20]: arr.rolling(y=3)
Out[20]: DataArrayRolling [window->3,center->False,dim->y]
The label position and minimum number of periods in the rolling window are controlled by the center and min_periods arguments:
In [21]: arr.rolling(y=3, min_periods=2, center=True)
Out[21]: DataArrayRolling [window->3,min_periods->2,center->True,dim->y]
Aggregation and summary methods can be applied directly to the Rolling object:
In [22]: r = arr.rolling(y=3)
In [23]: r.mean()
Out[23]:
<xarray.DataArray (y: 5, x: 3)>
array([[ nan, nan, nan],
[ nan, nan, nan],
[ 0.5, 3. , 5.5],
[ 1. , 3.5, 6. ],
[ 1.5, 4. , 6.5]])
Coordinates:
* x (x) int64 0 1 2
* y (y) int64 0 1 2 3 4
In [24]: r.reduce(np.std)
Out[24]:
<xarray.DataArray (y: 5, x: 3)>
array([[ nan, nan, nan],
[ nan, nan, nan],
[ 0.40824829, 0.40824829, 0.40824829],
[ 0.40824829, 0.40824829, 0.40824829],
[ 0.40824829, 0.40824829, 0.40824829]])
Coordinates:
* x (x) int64 0 1 2
* y (y) int64 0 1 2 3 4
Note that rolling window aggregations are much faster (both asymptotically and because they avoid a loop in Python) when bottleneck is installed. Otherwise, we fall back to a slower, pure Python implementation.
Finally, we can manually iterate through Rolling objects:
In [25]: for label, arr_window in r:
# arr_window is a view of x
Broadcasting by dimension name¶
DataArray objects are automatically align themselves (“broadcasting” in the numpy parlance) by dimension name instead of axis order. With xarray, you do not need to transpose arrays or insert dimensions of length 1 to get array operations to work, as commonly done in numpy with np.reshape() or np.newaxis.
This is best illustrated by a few examples. Consider two one-dimensional arrays with different sizes aligned along different dimensions:
In [26]: a = xr.DataArray([1, 2], [('x', ['a', 'b'])])
In [27]: a
Out[27]:
<xarray.DataArray (x: 2)>
array([1, 2])
Coordinates:
* x (x) |S1 'a' 'b'
In [28]: b = xr.DataArray([-1, -2, -3], [('y', [10, 20, 30])])
In [29]: b
Out[29]:
<xarray.DataArray (y: 3)>
array([-1, -2, -3])
Coordinates:
* y (y) int64 10 20 30
With xarray, we can apply binary mathematical operations to these arrays, and their dimensions are expanded automatically:
In [30]: a * b
Out[30]:
<xarray.DataArray (x: 2, y: 3)>
array([[-1, -2, -3],
[-2, -4, -6]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
Moreover, dimensions are always reordered to the order in which they first appeared:
In [31]: c = xr.DataArray(np.arange(6).reshape(3, 2), [b['y'], a['x']])
In [32]: c
Out[32]:
<xarray.DataArray (y: 3, x: 2)>
array([[0, 1],
[2, 3],
[4, 5]])
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
In [33]: a + c
Out[33]:
<xarray.DataArray (x: 2, y: 3)>
array([[1, 3, 5],
[3, 5, 7]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
This means, for example, that you always subtract an array from its transpose:
In [34]: c - c.T
Out[34]:
<xarray.DataArray (y: 3, x: 2)>
array([[0, 0],
[0, 0],
[0, 0]])
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
You can explicitly broadcast xaray data structures by using the broadcast() function:
a2, b2 = xr.broadcast(a, b2) a2 b2
Automatic alignment¶
xarray enforces alignment between index Coordinates (that is, coordinates with the same name as a dimension, marked by *) on objects used in binary operations.
Similarly to pandas, this alignment is automatic for arithmetic on binary operations. Note that unlike pandas, this the result of a binary operation is by the intersection (not the union) of coordinate labels:
In [35]: arr + arr[:1]
Out[35]:
<xarray.DataArray (x: 1, y: 5)>
array([[ 0., 1., 2., 3., 4.]])
Coordinates:
* x (x) int64 0
* y (y) int64 0 1 2 3 4
If the result would be empty, an error is raised instead:
In [36]: arr[:2] + arr[2:]
ValueError: no overlapping labels for some dimensions: ['x']
Before loops or performance critical code, it’s a good idea to align arrays explicitly (e.g., by putting them in the same Dataset or using align()) to avoid the overhead of repeated alignment with each operation. See Align and reindex for more details.
Note
There is no automatic alignment between arguments when performing in-place arithmetic operations such as +=. You will need to use manual alignment. This ensures in-place arithmetic never needs to modify data types.
Coordinates¶
Although index coordinates are aligned, other coordinates are not, and if their values conflict, they will be dropped. This is necessary, for example, because indexing turns 1D coordinates into scalar coordinates:
In [37]: arr[0]
Out[37]:
<xarray.DataArray (y: 5)>
array([ 0. , 0.5, 1. , 1.5, 2. ])
Coordinates:
x int64 0
* y (y) int64 0 1 2 3 4
In [38]: arr[1]
Out[38]:
<xarray.DataArray (y: 5)>
array([ 2.5, 3. , 3.5, 4. , 4.5])
Coordinates:
x int64 1
* y (y) int64 0 1 2 3 4
# notice that the scalar coordinate 'x' is silently dropped
In [39]: arr[1] - arr[0]
Out[39]:
<xarray.DataArray (y: 5)>
array([ 2.5, 2.5, 2.5, 2.5, 2.5])
Coordinates:
* y (y) int64 0 1 2 3 4
Still, xarray will persist other coordinates in arithmetic, as long as there are no conflicting values:
# only one argument has the 'x' coordinate
In [40]: arr[0] + 1
Out[40]:
<xarray.DataArray (y: 5)>
array([ 1. , 1.5, 2. , 2.5, 3. ])
Coordinates:
x int64 0
* y (y) int64 0 1 2 3 4
# both arguments have the same 'x' coordinate
In [41]: arr[0] - arr[0]
Out[41]:
<xarray.DataArray (y: 5)>
array([ 0., 0., 0., 0., 0.])
Coordinates:
x int64 0
* y (y) int64 0 1 2 3 4
Math with datasets¶
Datasets support arithmetic operations by automatically looping over all data variables:
In [42]: ds = xr.Dataset({'x_and_y': (('x', 'y'), np.random.randn(3, 5)),
....: 'x_only': ('x', np.random.randn(3))},
....: coords=arr.coords)
....:
In [43]: ds > 0
Out[43]:
<xarray.Dataset>
Dimensions: (x: 3, y: 5)
Coordinates:
* y (y) int64 0 1 2 3 4
* x (x) int64 0 1 2
Data variables:
x_only (x) bool True False True
x_and_y (x, y) bool True False False False False True True False False ...
Datasets support most of the same methods found on data arrays:
In [44]: ds.mean(dim='x')
Out[44]:
<xarray.Dataset>
Dimensions: (y: 5)
Coordinates:
* y (y) int64 0 1 2 3 4
Data variables:
x_only float64 -0.2799
x_and_y (y) float64 0.2553 0.08145 -0.4308 -1.411 -0.2989
In [45]: abs(ds)
Out[45]:
<xarray.Dataset>
Dimensions: (x: 3, y: 5)
Coordinates:
* y (y) int64 0 1 2 3 4
* x (x) int64 0 1 2
Data variables:
x_only (x) float64 0.1136 1.478 0.525
x_and_y (x, y) float64 0.1192 1.044 0.8618 2.105 0.4949 1.072 0.7216 ...
Unfortunately, a limitation of the current version of numpy means that we cannot override ufuncs for datasets, because datasets cannot be written as a single array [1]. apply() works around this limitation, by applying the given function to each variable in the dataset:
In [46]: ds.apply(np.sin)
Out[46]:
<xarray.Dataset>
Dimensions: (x: 3, y: 5)
Coordinates:
* x (x) int64 0 1 2
* y (y) int64 0 1 2 3 4
Data variables:
x_only (x) float64 0.1134 -0.9957 0.5012
x_and_y (x, y) float64 0.1189 -0.8645 -0.759 -0.8609 -0.475 0.8781 ...
Datasets also use looping over variables for broadcasting in binary arithmetic. You can do arithmetic between any DataArray and a dataset:
In [47]: ds + arr
Out[47]:
<xarray.Dataset>
Dimensions: (x: 3, y: 5)
Coordinates:
* x (x) int64 0 1 2
* y (y) int64 0 1 2 3 4
Data variables:
x_only (x, y) float64 0.1136 0.6136 1.114 1.614 2.114 1.022 1.522 ...
x_and_y (x, y) float64 0.1192 -0.5442 0.1382 -0.6046 1.505 3.572 3.722 ...
Arithmetic between two datasets matches data variables of the same name:
In [48]: ds2 = xr.Dataset({'x_and_y': 0, 'x_only': 100})
In [49]: ds - ds2
Out[49]:
<xarray.Dataset>
Dimensions: (x: 3, y: 5)
Coordinates:
* x (x) int64 0 1 2
* y (y) int64 0 1 2 3 4
Data variables:
x_only (x) float64 -99.89 -101.5 -99.48
x_and_y (x, y) float64 0.1192 -1.044 -0.8618 -2.105 -0.4949 1.072 ...
Similarly to index based alignment, the result has the intersection of all matching variables, and ValueError is raised if the result would be empty.
[1] | In some future version of NumPy, we should be able to override ufuncs for datasets by making use of __numpy_ufunc__. |
GroupBy: split-apply-combine¶
xarray supports “group by” operations with the same API as pandas to implement the split-apply-combine strategy:
- Split your data into multiple independent groups.
- Apply some function to each group.
- Combine your groups back into a single data object.
Group by operations work on both Dataset and DataArray objects. Most of the examples focus on grouping by a single one-dimensional variable, although support for grouping over a multi-dimensional variable has recently been implemented. Note that for one-dimensional data, it is usually faster to rely on pandas’ implementation of the same pipeline.
Split¶
Let’s create a simple example dataset:
In [1]: ds = xr.Dataset({'foo': (('x', 'y'), np.random.rand(4, 3))},
...: coords={'x': [10, 20, 30, 40],
...: 'letters': ('x', list('abba'))})
...:
In [2]: arr = ds['foo']
In [3]: ds
Out[3]:
<xarray.Dataset>
Dimensions: (x: 4, y: 3)
Coordinates:
* x (x) int64 10 20 30 40
letters (x) |S1 'a' 'b' 'b' 'a'
* y (y) int64 0 1 2
Data variables:
foo (x, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 0.4514 ...
If we groupby the name of a variable or coordinate in a dataset (we can also use a DataArray directly), we get back a GroupBy object:
In [4]: ds.groupby('letters')
Out[4]: <xarray.core.groupby.DatasetGroupBy at 0x7f23b4560610>
This object works very similarly to a pandas GroupBy object. You can view the group indices with the groups attribute:
In [5]: ds.groupby('letters').groups
Out[5]: {'a': [0, 3], 'b': [1, 2]}
You can also iterate over over groups in (label, group) pairs:
In [6]: list(ds.groupby('letters'))
Out[6]:
[('a', <xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) int64 10 40
letters (x) |S1 'a' 'a'
* y (y) int64 0 1 2
Data variables:
foo (x, y) float64 0.127 0.9667 0.2605 0.543 0.373 0.448),
('b', <xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) int64 20 30
letters (x) |S1 'b' 'b'
* y (y) int64 0 1 2
Data variables:
foo (x, y) float64 0.8972 0.3767 0.3362 0.4514 0.8403 0.1231)]
Just like in pandas, creating a GroupBy object is cheap: it does not actually split the data until you access particular values.
Binning¶
Sometimes you don’t want to use all the unique values to determine the groups but instead want to “bin” the data into coarser groups. You could always create a customized coordinate, but xarray facilitates this via the groupby_bins() method.
In [7]: x_bins = [0,25,50]
In [8]: ds.groupby_bins('x', x_bins).groups
Out[8]: {'(0, 25]': [0, 1], '(25, 50]': [2, 3]}
The binning is implemented via pandas.cut, whose documentation details how the bins are assigned. As seen in the example above, by default, the bins are labeled with strings using set notation to precisely identify the bin limits. To override this behavior, you can specify the bin labels explicitly. Here we choose float labels which identify the bin centers:
In [9]: x_bin_labels = [12.5,37.5]
In [10]: ds.groupby_bins('x', x_bins, labels=x_bin_labels).groups
Out[10]: {12.5: [0, 1], 37.5: [2, 3]}
Apply¶
To apply a function to each group, you can use the flexible apply() method. The resulting objects are automatically concatenated back together along the group axis:
In [11]: def standardize(x):
....: return (x - x.mean()) / x.std()
....:
In [12]: arr.groupby('letters').apply(standardize)
Out[12]:
<xarray.DataArray 'foo' (x: 4, y: 3)>
array([[-1.23 , 1.937, -0.726],
[ 1.42 , -0.46 , -0.607],
[-0.191, 1.214, -1.376],
[ 0.339, -0.302, -0.019]])
Coordinates:
* y (y) int64 0 1 2
* x (x) int64 10 20 30 40
letters (x) |S1 'a' 'b' 'b' 'a'
GroupBy objects also have a reduce() method and methods like mean() as shortcuts for applying an aggregation function:
In [13]: arr.groupby('letters').mean(dim='x')
Out[13]:
<xarray.DataArray 'foo' (letters: 2, y: 3)>
array([[ 0.335, 0.67 , 0.354],
[ 0.674, 0.609, 0.23 ]])
Coordinates:
* y (y) int64 0 1 2
* letters (letters) object 'a' 'b'
Using a groupby is thus also a convenient shortcut for aggregating over all dimensions other than the provided one:
In [14]: ds.groupby('x').std()
Out[14]:
<xarray.Dataset>
Dimensions: (x: 4)
Coordinates:
* x (x) int64 10 20 30 40
letters (x) |S1 'a' 'b' 'b' 'a'
Data variables:
foo (x) float64 0.3684 0.2554 0.2931 0.06957
First and last¶
There are two special aggregation operations that are currently only found on groupby objects: first and last. These provide the first or last example of values for group along the grouped dimension:
In [15]: ds.groupby('letters').first()
Out[15]:
<xarray.Dataset>
Dimensions: (letters: 2, y: 3)
Coordinates:
* y (y) int64 0 1 2
* letters (letters) object 'a' 'b'
Data variables:
foo (letters, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362
By default, they skip missing values (control this with skipna).
Grouped arithmetic¶
GroupBy objects also support a limited set of binary arithmetic operations, as a shortcut for mapping over all unique labels. Binary arithmetic is supported for (GroupBy, Dataset) and (GroupBy, DataArray) pairs, as long as the dataset or data array uses the unique grouped values as one of its index coordinates. For example:
In [16]: alt = arr.groupby('letters').mean()
In [17]: alt
Out[17]:
<xarray.DataArray 'foo' (letters: 2)>
array([ 0.453, 0.504])
Coordinates:
* letters (letters) object 'a' 'b'
In [18]: ds.groupby('letters') - alt
Out[18]:
<xarray.Dataset>
Dimensions: (x: 4, y: 3)
Coordinates:
* y (y) int64 0 1 2
* x (x) int64 10 20 30 40
letters (x) |S1 'a' 'b' 'b' 'a'
Data variables:
foo (x, y) float64 -0.3261 0.5137 -0.1926 0.3931 -0.1274 -0.1679 ...
This last line is roughly equivalent to the following:
results = []
for label, group in ds.groupby('letters'):
results.append(group - alt.sel(x=label))
xr.concat(results, dim='x')
Squeezing¶
When grouping over a dimension, you can control whether the dimension is squeezed out or if it should remain with length one on each group by using the squeeze parameter:
In [19]: next(iter(arr.groupby('x')))
Out[19]:
(10, <xarray.DataArray 'foo' (y: 3)>
array([ 0.127, 0.967, 0.26 ])
Coordinates:
x int64 10
letters |S1 'a'
* y (y) int64 0 1 2)
In [20]: next(iter(arr.groupby('x', squeeze=False)))
Out[20]:
(10, <xarray.DataArray 'foo' (x: 1, y: 3)>
array([[ 0.127, 0.967, 0.26 ]])
Coordinates:
* x (x) int64 10
letters (x) |S1 'a'
* y (y) int64 0 1 2)
Although xarray will attempt to automatically transpose dimensions back into their original order when you use apply, it is sometimes useful to set squeeze=False to guarantee that all original dimensions remain unchanged.
You can always squeeze explicitly later with the Dataset or DataArray squeeze() methods.
Multidimensional Grouping¶
Many datasets have a multidimensional coordinate variable (e.g. longitude) which is different from the logical grid dimensions (e.g. nx, ny). Such variables are valid under the CF conventions. Xarray supports groupby operations over multidimensional coordinate variables:
In [21]: da = xr.DataArray([[0,1],[2,3]],
....: coords={'lon': (['ny','nx'], [[30,40],[40,50]] ),
....: 'lat': (['ny','nx'], [[10,10],[20,20]] ),},
....: dims=['ny','nx'])
....:
In [22]: da
Out[22]:
<xarray.DataArray (ny: 2, nx: 2)>
array([[0, 1],
[2, 3]])
Coordinates:
lat (ny, nx) int64 10 10 20 20
lon (ny, nx) int64 30 40 40 50
* ny (ny) int64 0 1
* nx (nx) int64 0 1
In [23]: da.groupby('lon').sum()
Out[23]:
<xarray.DataArray (lon: 3)>
array([0, 3, 3])
Coordinates:
* lon (lon) int64 30 40 50
In [24]: da.groupby('lon').apply(lambda x: x - x.mean(), shortcut=False)
Out[24]:
<xarray.DataArray (ny: 2, nx: 2)>
array([[ 0. , -0.5],
[ 0.5, 0. ]])
Coordinates:
lat (ny, nx) int64 10 10 20 20
lon (ny, nx) int64 30 40 40 50
* ny (ny) int64 0 1
* nx (nx) int64 0 1
Because multidimensional groups have the ability to generate a very large number of bins, coarse-binning via groupby_bins() may be desirable:
In [25]: da.groupby_bins('lon', [0,45,50]).sum()
Out[25]:
<xarray.DataArray (lon_bins: 2)>
array([3, 3])
Coordinates:
* lon_bins (lon_bins) object '(0, 45]' '(45, 50]'
Reshaping and reorganizing data¶
These methods allow you to reorganize
Reordering dimensions¶
To reorder dimensions on a DataArray or across all variables on a Dataset, use transpose() or the .T property:
In [1]: ds = xr.Dataset({'foo': (('x', 'y', 'z'), [[[42]]]), 'bar': (('y', 'z'), [[24]])})
In [2]: ds.transpose('y', 'z', 'x')
Out[2]:
<xarray.Dataset>
Dimensions: (x: 1, y: 1, z: 1)
Coordinates:
* x (x) int64 0
* y (y) int64 0
* z (z) int64 0
Data variables:
foo (y, z, x) int64 42
bar (y, z) int64 24
In [3]: ds.T
Out[3]:
<xarray.Dataset>
Dimensions: (x: 1, y: 1, z: 1)
Coordinates:
* x (x) int64 0
* y (y) int64 0
* z (z) int64 0
Data variables:
foo (z, y, x) int64 42
bar (z, y) int64 24
Converting between datasets and arrays¶
To convert from a Dataset to a DataArray, use to_array():
In [4]: arr = ds.to_array()
In [5]: arr
Out[5]:
<xarray.DataArray (variable: 2, x: 1, y: 1, z: 1)>
array([[[[42]]],
[[[24]]]])
Coordinates:
* y (y) int64 0
* x (x) int64 0
* z (z) int64 0
* variable (variable) |S3 'foo' 'bar'
This method broadcasts all data variables in the dataset against each other, then concatenates them along a new dimension into a new array while preserving coordinates.
To convert back from a DataArray to a Dataset, use to_dataset():
In [6]: arr.to_dataset(dim='variable')
Out[6]:
<xarray.Dataset>
Dimensions: (x: 1, y: 1, z: 1)
Coordinates:
* y (y) int64 0
* x (x) int64 0
* z (z) int64 0
Data variables:
foo (x, y, z) int64 42
bar (x, y, z) int64 24
The broadcasting behavior of to_array means that the resulting array includes the union of data variable dimensions:
In [7]: ds2 = xr.Dataset({'a': 0, 'b': ('x', [3, 4, 5])})
# the input dataset has 4 elements
In [8]: ds2
Out[8]:
<xarray.Dataset>
Dimensions: (x: 3)
Coordinates:
* x (x) int64 0 1 2
Data variables:
a int64 0
b (x) int64 3 4 5
# the resulting array has 6 elements
In [9]: ds2.to_array()
Out[9]:
<xarray.DataArray (variable: 2, x: 3)>
array([[0, 0, 0],
[3, 4, 5]])
Coordinates:
* variable (variable) |S1 'a' 'b'
* x (x) int64 0 1 2
Otherwise, the result could not be represented as an orthogonal array.
If you use to_dataset without supplying the dim argument, the DataArray will be converted into a Dataset of one variable:
In [10]: arr.to_dataset(name='combined')
Out[10]:
<xarray.Dataset>
Dimensions: (variable: 2, x: 1, y: 1, z: 1)
Coordinates:
* y (y) int64 0
* x (x) int64 0
* z (z) int64 0
* variable (variable) |S3 'foo' 'bar'
Data variables:
combined (variable, x, y, z) int64 42 24
Stack and unstack¶
As part of xarray’s nascent support for pandas.MultiIndex, we have implemented stack() and unstack() method, for combining or splitting dimensions:
In [11]: array = xr.DataArray(np.random.randn(2, 3),
....: coords=[('x', ['a', 'b']), ('y', [0, 1, 2])])
....:
In [12]: stacked = array.stack(z=('x', 'y'))
In [13]: stacked
Out[13]:
<xarray.DataArray (z: 6)>
array([ 0.469, -0.283, -1.509, -1.136, 1.212, -0.173])
Coordinates:
* z (z) object ('a', 0) ('a', 1) ('a', 2) ('b', 0) ('b', 1) ('b', 2)
In [14]: stacked.unstack('z')
Out[14]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.469, -0.283, -1.509],
[-1.136, 1.212, -0.173]])
Coordinates:
* x (x) object 'a' 'b'
* y (y) int64 0 1 2
These methods are modeled on the pandas.DataFrame methods of the same name, although they in xarray they always create new dimensions rather than adding to the existing index or columns.
Like DataFrame.unstack, xarray’s unstack always succeeds, even if the multi-index being unstacked does not contain all possible levels. Missing levels are filled in with NaN in the resulting object:
In [15]: stacked2 = stacked[::2]
In [16]: stacked2
Out[16]:
<xarray.DataArray (z: 3)>
array([ 0.469, -1.509, 1.212])
Coordinates:
* z (z) object ('a', 0) ('a', 2) ('b', 1)
In [17]: stacked2.unstack('z')
Out[17]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.469, nan, -1.509],
[ nan, 1.212, nan]])
Coordinates:
* x (x) object 'a' 'b'
* y (y) int64 0 1 2
However, xarray’s stack has an important difference from pandas: unlike pandas, it does not automatically drop missing values. Compare:
In [18]: array = xr.DataArray([[np.nan, 1], [2, 3]], dims=['x', 'y'])
In [19]: array.stack(z=('x', 'y'))
Out[19]:
<xarray.DataArray (z: 4)>
array([ nan, 1., 2., 3.])
Coordinates:
* z (z) object (0, 0) (0, 1) (1, 0) (1, 1)
In [20]: array.to_pandas().stack()
Out[20]:
x y
0 1 1
1 0 2
1 3
dtype: float64
We departed from pandas’s behavior here because predictable shapes for new array dimensions is necessary for Out of core computation with dask.
Shift and roll¶
To adjust coordinate labels, you can use the shift() and roll() methods:
In [21]: array = xr.DataArray([1, 2, 3, 4], dims='x')
In [22]: array.shift(x=2)
Out[22]:
<xarray.DataArray (x: 4)>
array([ nan, nan, 1., 2.])
Coordinates:
* x (x) int64 0 1 2 3
In [23]: array.roll(x=2)
Out[23]:
<xarray.DataArray (x: 4)>
array([3, 4, 1, 2])
Coordinates:
* x (x) int64 2 3 0 1
Combining data¶
- For combining datasets or data arrays along a dimension, see concatenate.
- For combining datasets with different variables, see merge.
Concatenate¶
To combine arrays along existing or new dimension into a larger array, you can use concat(). concat takes an iterable of DataArray or Dataset objects, as well as a dimension name, and concatenates along that dimension:
In [1]: arr = xr.DataArray(np.random.randn(2, 3),
...: [('x', ['a', 'b']), ('y', [10, 20, 30])])
...:
In [2]: arr[:, :1]
Out[2]:
<xarray.DataArray (x: 2, y: 1)>
array([[ 0.4691123 ],
[-1.13563237]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10
# this resembles how you would use np.concatenate
In [3]: xr.concat([arr[:, :1], arr[:, 1:]], dim='y')
Out[3]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
In addition to combining along an existing dimension, concat can create a new dimension by stacking lower dimensional arrays together:
In [4]: arr[0]
Out[4]:
<xarray.DataArray (y: 3)>
array([ 0.4691123 , -0.28286334, -1.5090585 ])
Coordinates:
x |S1 'a'
* y (y) int64 10 20 30
# to combine these 1d arrays into a 2d array in numpy, you would use np.array
In [5]: xr.concat([arr[0], arr[1]], 'x')
Out[5]:
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
If the second argument to concat is a new dimension name, the arrays will be concatenated along that new dimension, which is always inserted as the first dimension:
In [6]: xr.concat([arr[0], arr[1]], 'new_dim')
Out[6]:
<xarray.DataArray (new_dim: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
Coordinates:
* y (y) int64 10 20 30
x (new_dim) |S1 'a' 'b'
* new_dim (new_dim) int64 0 1
The second argument to concat can also be an Index or DataArray object as well as a string, in which case it is used to label the values along the new dimension:
In [7]: xr.concat([arr[0], arr[1]], pd.Index([-90, -100], name='new_dim'))
Out[7]:
<xarray.DataArray (new_dim: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
Coordinates:
* y (y) int64 10 20 30
x (new_dim) |S1 'a' 'b'
* new_dim (new_dim) int64 -90 -100
Of course, concat also works on Dataset objects:
In [8]: ds = arr.to_dataset(name='foo')
In [9]: xr.concat([ds.sel(x='a'), ds.sel(x='b')], 'x')
Out[9]:
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
concat() has a number of options which provide deeper control over which variables are concatenated and how it handles conflicting variables between datasets. With the default parameters, xarray will load some coordinate variables into memory to compare them between datasets. This may be prohibitively expensive if you are manipulating your dataset lazily using Out of core computation with dask.
Merge¶
To combine variables and coordinates between multiple DataArray and/or Dataset object, use merge(). It can merge a list of Dataset, DataArray or dictionaries of objects convertible to DataArray objects:
In [10]: xr.merge([ds, ds.rename({'foo': 'bar'})])
Out[10]:
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
bar (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
In [11]: xr.merge([xr.DataArray(n, name='var%d' % n) for n in range(5)])
Out[11]:
<xarray.Dataset>
Dimensions: ()
Coordinates:
*empty*
Data variables:
var0 int64 0
var1 int64 1
var2 int64 2
var3 int64 3
var4 int64 4
If you merge another dataset (or a dictionary including data array objects), by default the resulting dataset will be aligned on the union of all index coordinates:
In [12]: other = xr.Dataset({'bar': ('x', [1, 2, 3, 4]), 'x': list('abcd')})
In [13]: xr.merge([ds, other])
Out[13]:
<xarray.Dataset>
Dimensions: (x: 4, y: 3)
Coordinates:
* x (x) object 'a' 'b' 'c' 'd'
* y (y) int64 10 20 30
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732 nan ...
bar (x) int64 1 2 3 4
This ensures that merge is non-destructive. xarray.MergeError is raised if you attempt to merge two variables with the same name but different values:
In [14]: xr.merge([ds, ds + 1])
MergeError: conflicting values for variable 'foo' on objects to be combined:
first value: <xarray.Variable (x: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
second value: <xarray.Variable (x: 2, y: 3)>
array([[ 1.4691123 , 0.71713666, -0.5090585 ],
[-0.13563237, 2.21211203, 0.82678535]])
The same non-destructive merging between DataArray index coordinates is used in the Dataset constructor:
In [15]: xr.Dataset({'a': arr[:-1], 'b': arr[1:]})
Out[15]:
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) object 'a' 'b'
* y (y) int64 10 20 30
Data variables:
a (x, y) float64 0.4691 -0.2829 -1.509 nan nan nan
b (x, y) float64 nan nan nan -1.136 1.212 -0.1732
Update¶
In contrast to merge, update modifies a dataset in-place without checking for conflicts, and will overwrite any existing variables with new values:
In [16]: ds.update({'space': ('space', [10.2, 9.4, 3.9])})
Out[16]:
<xarray.Dataset>
Dimensions: (space: 3, x: 2, y: 3)
Coordinates:
* x (x) object 'a' 'b'
* y (y) int64 10 20 30
* space (space) float64 10.2 9.4 3.9
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
However, dimensions are still required to be consistent between different Dataset variables, so you cannot change the size of a dimension unless you replace all dataset variables that use it.
update also performs automatic alignment if necessary. Unlike merge, it maintains the alignment of the original array instead of merging indexes:
In [17]: ds.update(other)
Out[17]:
<xarray.Dataset>
Dimensions: (space: 3, x: 2, y: 3)
Coordinates:
* x (x) object 'a' 'b'
* y (y) int64 10 20 30
* space (space) float64 10.2 9.4 3.9
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
bar (x) int64 1 2
The exact same alignment logic when setting a variable with __setitem__ syntax:
In [18]: ds['baz'] = xr.DataArray([9, 9, 9, 9, 9], coords=[('x', list('abcde'))])
In [19]: ds.baz
Out[19]:
<xarray.DataArray 'baz' (x: 2)>
array([9, 9])
Coordinates:
* x (x) object 'a' 'b'
Equals and identical¶
xarray objects can be compared by using the equals(), identical() and broadcast_equals() methods. These methods are used by the optional compat argument on concat and merge.
equals checks dimension names, indexes and array values:
In [20]: arr.equals(arr.copy())
Out[20]: True
identical also checks attributes, and the name of each object:
In [21]: arr.identical(arr.rename('bar'))
Out[21]: False
broadcast_equals does a more relaxed form of equality check that allows variables to have different dimensions, as long as values are constant along those new dimensions:
In [22]: left = xr.Dataset(coords={'x': 0})
In [23]: right = xr.Dataset({'x': [0, 0, 0]})
In [24]: left.broadcast_equals(right)
Out[24]: True
Like pandas objects, two xarray objects are still equal or identical if they have missing values marked by NaN in the same locations.
In contrast, the == operation performs element-wise comparison (like numpy):
In [25]: arr == arr.copy()
Out[25]:
<xarray.DataArray (x: 2, y: 3)>
array([[ True, True, True],
[ True, True, True]], dtype=bool)
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
Note that NaN does not compare equal to NaN in element-wise comparison; you may need to deal with missing values explicitly.
Time series data¶
A major use case for xarray is multi-dimensional time-series data. Accordingly, we’ve copied many of features that make working with time-series data in pandas such a joy to xarray. In most cases, we rely on pandas for the core functionality.
Creating datetime64 data¶
xarray uses the numpy dtypes datetime64[ns] and timedelta64[ns] to represent datetime data, which offer vectorized (if sometimes buggy) operations with numpy and smooth integration with pandas.
To convert to or create regular arrays of datetime64 data, we recommend using pandas.to_datetime() and pandas.date_range():
In [1]: pd.to_datetime(['2000-01-01', '2000-02-02'])
Out[1]: DatetimeIndex(['2000-01-01', '2000-02-02'], dtype='datetime64[ns]', freq=None)
In [2]: pd.date_range('2000-01-01', periods=365)
Out[2]:
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
'2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08',
'2000-01-09', '2000-01-10',
...
'2000-12-21', '2000-12-22', '2000-12-23', '2000-12-24',
'2000-12-25', '2000-12-26', '2000-12-27', '2000-12-28',
'2000-12-29', '2000-12-30'],
dtype='datetime64[ns]', length=365, freq='D')
Alternatively, you can supply arrays of Python datetime objects. These get converted automatically when used as arguments in xarray objects:
In [3]: import datetime
In [4]: xr.Dataset({'time': datetime.datetime(2000, 1, 1)})
Out[4]:
<xarray.Dataset>
Dimensions: ()
Coordinates:
*empty*
Data variables:
time datetime64[ns] 2000-01-01
When reading or writing netCDF files, xarray automatically decodes datetime and timedelta arrays using CF conventions (that is, by using a units attribute like 'days since 2000-01-01').
You can manual decode arrays in this form by passing a dataset to decode_cf():
In [5]: attrs = {'units': 'hours since 2000-01-01'}
In [6]: ds = xr.Dataset({'time': ('time', [0, 1, 2, 3], attrs)})
In [7]: xr.decode_cf(ds)
Out[7]:
<xarray.Dataset>
Dimensions: (time: 4)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...
Data variables:
*empty*
One unfortunate limitation of using datetime64[ns] is that it limits the native representation of dates to those that fall between the years 1678 and 2262. When a netCDF file contains dates outside of these bounds, dates will be returned as arrays of netcdftime.datetime objects.
Datetime indexing¶
xarray borrows powerful indexing machinery from pandas (see Indexing and selecting data).
This allows for several useful and suscinct forms of indexing, particularly for datetime64 data. For example, we support indexing with strings for single items and with the slice object:
In [8]: time = pd.date_range('2000-01-01', freq='H', periods=365 * 24)
In [9]: ds = xr.Dataset({'foo': ('time', np.arange(365 * 24)), 'time': time})
In [10]: ds.sel(time='2000-01')
Out[10]:
<xarray.Dataset>
Dimensions: (time: 744)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...
Data variables:
foo (time) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
In [11]: ds.sel(time=slice('2000-06-01', '2000-06-10'))
Out[11]:
<xarray.Dataset>
Dimensions: (time: 240)
Coordinates:
* time (time) datetime64[ns] 2000-06-01 2000-06-01T01:00:00 ...
Data variables:
foo (time) int64 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 ...
You can also select a particular time by indexing with a datetime.time object:
In [12]: ds.sel(time=datetime.time(12))
Out[12]:
<xarray.Dataset>
Dimensions: (time: 365)
Coordinates:
* time (time) datetime64[ns] 2000-01-01T12:00:00 2000-01-02T12:00:00 ...
Data variables:
foo (time) int64 12 36 60 84 108 132 156 180 204 228 252 276 300 ...
For more details, read the pandas documentation.
Datetime components¶
xarray supports a notion of “virtual” or “derived” coordinates for datetime components implemented by pandas, including “year”, “month”, “day”, “hour”, “minute”, “second”, “dayofyear”, “week”, “dayofweek”, “weekday” and “quarter”:
In [13]: ds['time.month']
Out[13]:
<xarray.DataArray 'month' (time: 8760)>
array([ 1, 1, 1, ..., 12, 12, 12], dtype=int32)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...
In [14]: ds['time.dayofyear']
Out[14]:
<xarray.DataArray 'dayofyear' (time: 8760)>
array([ 1, 1, 1, ..., 365, 365, 365], dtype=int32)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...
xarray adds 'season' to the list of datetime components supported by pandas:
In [15]: ds['time.season']
Out[15]:
<xarray.DataArray 'season' (time: 8760)>
array(['DJF', 'DJF', 'DJF', ..., 'DJF', 'DJF', 'DJF'],
dtype='|S3')
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...
The set of valid seasons consists of ‘DJF’, ‘MAM’, ‘JJA’ and ‘SON’, labeled by the first letters of the corresponding months.
You can use these shortcuts with both Datasets and DataArray coordinates.
Resampling and grouped operations¶
Datetime components couple particularly well with grouped operations (see GroupBy: split-apply-combine) for analyzing features that repeat over time. Here’s how to calculate the mean by time of day:
In [16]: ds.groupby('time.hour').mean()
Out[16]:
<xarray.Dataset>
Dimensions: (hour: 24)
Coordinates:
* hour (hour) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
foo (hour) float64 4.368e+03 4.369e+03 4.37e+03 4.371e+03 4.372e+03 ...
For upsampling or downsampling temporal resolutions, xarray offers a resample() method building on the core functionality offered by the pandas method of the same name. Resample uses essentialy the same api as resample in pandas.
For example, we can downsample our dataset from hourly to 6-hourly:
In [17]: ds.resample('6H', dim='time', how='mean')
Out[17]:
<xarray.Dataset>
Dimensions: (time: 1460)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T06:00:00 ...
Data variables:
foo (time) float64 2.5 8.5 14.5 20.5 26.5 32.5 38.5 44.5 50.5 56.5 ...
Resample also works for upsampling, in which case intervals without any values are marked by NaN:
In [18]: ds.resample('30Min', 'time')
Out[18]:
<xarray.Dataset>
Dimensions: (time: 17519)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T00:30:00 ...
Data variables:
foo (time) float64 0.0 nan 1.0 nan 2.0 nan 3.0 nan 4.0 nan 5.0 nan ...
Of course, all of these resampling and groupby operation work on both Dataset and DataArray objects with any number of additional dimensions.
For more examples of using grouped operations on a time dimension, see Toy weather data.
Working with pandas¶
One of the most important features of xarray is the ability to convert to and from pandas objects to interact with the rest of the PyData ecosystem. For example, for plotting labeled data, we highly recommend using the visualization built in to pandas itself or provided by the pandas aware libraries such as Seaborn.
Hierarchical and tidy data¶
Tabular data is easiest to work with when it meets the criteria for tidy data:
- Each column holds a different variable.
- Each rows holds a different observation.
In this “tidy data” format, we can represent any Dataset and DataArray in terms of pandas.DataFrame and pandas.Series, respectively (and vice-versa). The representation works by flattening non-coordinates to 1D, and turning the tensor product of coordinate indexes into a pandas.MultiIndex.
Dataset and DataFrame¶
To convert any dataset to a DataFrame in tidy form, use the Dataset.to_dataframe() method:
In [1]: ds = xr.Dataset({'foo': (('x', 'y'), np.random.randn(2, 3))},
...: coords={'x': [10, 20], 'y': ['a', 'b', 'c'],
...: 'along_x': ('x', np.random.randn(2)),
...: 'scalar': 123})
...:
In [2]: ds
Out[2]:
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* y (y) |S1 'a' 'b' 'c'
* x (x) int64 10 20
scalar int64 123
along_x (x) float64 0.1192 -1.044
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
In [3]: df = ds.to_dataframe()
In [4]: df
Out[4]:
foo scalar along_x
x y
10 a 0.469112 123 0.119209
b -0.282863 123 0.119209
c -1.509059 123 0.119209
20 a -1.135632 123 -1.044236
b 1.212112 123 -1.044236
c -0.173215 123 -1.044236
We see that each variable and coordinate in the Dataset is now a column in the DataFrame, with the exception of indexes which are in the index. To convert the DataFrame to any other convenient representation, use DataFrame methods like reset_index(), stack() and unstack().
To create a Dataset from a DataFrame, use the from_dataframe() class method:
In [5]: xr.Dataset.from_dataframe(df)
Out[5]:
<xarray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) int64 10 20
* y (y) object 'a' 'b' 'c'
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
scalar (x, y) int64 123 123 123 123 123 123
along_x (x, y) float64 0.1192 0.1192 0.1192 -1.044 -1.044 -1.044
Notice that that dimensions of variables in the Dataset have now expanded after the round-trip conversion to a DataFrame. This is because every object in a DataFrame must have the same indices, so we need to broadcast the data of each array to the full size of the new MultiIndex.
Likewise, all the coordinates (other than indexes) ended up as variables, because pandas does not distinguish non-index coordinates.
DataArray and Series¶
DataArray objects have a complementary representation in terms of a pandas.Series. Using a Series preserves the Dataset to DataArray relationship, because DataFrames are dict-like containers of Series. The methods are very similar to those for working with DataFrames:
In [6]: s = ds['foo'].to_series()
In [7]: s
Out[7]:
x y
10 a 0.469112
b -0.282863
c -1.509059
20 a -1.135632
b 1.212112
c -0.173215
Name: foo, dtype: float64
In [8]: xr.DataArray.from_series(s)
Out[8]:
<xarray.DataArray 'foo' (x: 2, y: 3)>
array([[ 0.469, -0.283, -1.509],
[-1.136, 1.212, -0.173]])
Coordinates:
* x (x) int64 10 20
* y (y) object 'a' 'b' 'c'
Both the from_series and from_dataframe methods use reindexing, so they work even if not the hierarchical index is not a full tensor product:
In [9]: s[::2]
Out[9]:
x y
10 a 0.469112
c -1.509059
20 b 1.212112
Name: foo, dtype: float64
In [10]: xr.DataArray.from_series(s[::2])
Out[10]:
<xarray.DataArray 'foo' (x: 2, y: 3)>
array([[ 0.469, nan, -1.509],
[ nan, 1.212, nan]])
Coordinates:
* x (x) int64 10 20
* y (y) object 'a' 'b' 'c'
Multi-dimensional data¶
DataArray.to_pandas() is a shortcut that lets you convert a DataArray directly into a pandas object with the same dimensionality (i.e., a 1D array is converted to a Series, 2D to DataFrame and 3D to Panel):
In [11]: arr = xr.DataArray(np.random.randn(2, 3),
....: coords=[('x', [10, 20]), ('y', ['a', 'b', 'c'])])
....:
In [12]: df = arr.to_pandas()
In [13]: df
Out[13]:
y a b c
x
10 -0.861849 -2.104569 -0.494929
20 1.071804 0.721555 -0.706771
To perform the inverse operation of converting any pandas objects into a data array with the same shape, simply use the DataArray constructor:
In [14]: xr.DataArray(df)
Out[14]:
<xarray.DataArray (x: 2, y: 3)>
array([[-0.862, -2.105, -0.495],
[ 1.072, 0.722, -0.707]])
Coordinates:
* x (x) int64 10 20
* y (y) object 'a' 'b' 'c'
xarray objects do not yet support hierarchical indexes, so if your data has a hierarchical index, you will either need to unstack it first or use the from_series() or from_dataframe() constructors described above.
Serialization and IO¶
xarray supports direct serialization and IO to several file formats. For more options, consider exporting your objects to pandas (see the preceding section) and using its broad range of IO tools.
Pickle¶
The simplest way to serialize an xarray object is to use Python’s built-in pickle module:
In [1]: import cPickle as pickle
In [2]: ds = xr.Dataset({'foo': (('x', 'y'), np.random.rand(4, 5))},
...: coords={'x': [10, 20, 30, 40],
...: 'y': pd.date_range('2000-01-01', periods=5),
...: 'z': ('x', list('abcd'))})
...:
# use the highest protocol (-1) because it is way faster than the default
# text based pickle format
In [3]: pkl = pickle.dumps(ds, protocol=-1)
In [4]: pickle.loads(pkl)
Out[4]:
<xarray.Dataset>
Dimensions: (x: 4, y: 5)
Coordinates:
* y (y) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04 ...
* x (x) int64 10 20 30 40
z (x) |S1 'a' 'b' 'c' 'd'
Data variables:
foo (x, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 0.4514 ...
Pickle support is important because it doesn’t require any external libraries and lets you use xarray objects with Python modules like multiprocessing. However, there are two important caveats:
- To simplify serialization, xarray’s support for pickle currently loads all array values into memory before dumping an object. This means it is not suitable for serializing datasets too big to load into memory (e.g., from netCDF or OPeNDAP).
- Pickle will only work as long as the internal data structure of xarray objects remains unchanged. Because the internal design of xarray is still being refined, we make no guarantees (at this point) that objects pickled with this version of xarray will work in future versions.
netCDF¶
Currently, the only disk based serialization format that xarray directly supports is netCDF. netCDF is a file format for fully self-described datasets that is widely used in the geosciences and supported on almost all platforms. We use netCDF because xarray was based on the netCDF data model, so netCDF files on disk directly correspond to Dataset objects. Recent versions netCDF are based on the even more widely used HDF5 file-format.
Reading and writing netCDF files with xarray requires the netCDF4-Python library or scipy to be installed.
We can save a Dataset to disk using the Dataset.to_netcdf method:
In [5]: ds.to_netcdf('saved_on_disk.nc')
By default, the file is saved as netCDF4 (assuming netCDF4-Python is installed). You can control the format and engine used to write the file with the format and engine arguments.
We can load netCDF files to create a new Dataset using open_dataset():
In [6]: ds_disk = xr.open_dataset('saved_on_disk.nc')
In [7]: ds_disk
Out[7]:
<xarray.Dataset>
Dimensions: (x: 4, y: 5)
Coordinates:
* y (y) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04 ...
* x (x) int32 10 20 30 40
z (x) |S1 'a' 'b' 'c' 'd'
Data variables:
foo (x, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 0.4514 ...
A dataset can also be loaded or written to a specific group within a netCDF file. To load from a group, pass a group keyword argument to the open_dataset function. The group can be specified as a path-like string, e.g., to access subgroup ‘bar’ within group ‘foo’ pass ‘/foo/bar’ as the group argument. When writing multiple groups in one file, pass mode='a' to to_netcdf to ensure that each call does not delete the file.
Data is always loaded lazily from netCDF files. You can manipulate, slice and subset Dataset and DataArray objects, and no array values are loaded into memory until you try to perform some sort of actual computation. For an example of how these lazy arrays work, see the OPeNDAP section below.
It is important to note that when you modify values of a Dataset, even one linked to files on disk, only the in-memory copy you are manipulating in xarray is modified: the original file on disk is never touched.
Tip
xarray’s lazy loading of remote or on-disk datasets is often but not always desirable. Before performing computationally intense operations, it is often a good idea to load a dataset entirely into memory by invoking the load() method.
Datasets have a close() method to close the associated netCDF file. However, it’s often cleaner to use a with statement:
# this automatically closes the dataset after use
In [8]: with xr.open_dataset('saved_on_disk.nc') as ds:
...: print(ds.keys())
...:
['y', 'x', 'foo', 'z']
Although xarray provides reasonable support for incremental reads of files on disk, it does not support incremental writes, which can be a useful strategy for dealing with datasets too big to fit into memory. Instead, xarray integrates with dask.array (see Out of core computation with dask), which provides a fully featured engine for streaming computation.
Reading encoded data¶
NetCDF files follow some conventions for encoding datetime arrays (as numbers with a “units” attribute) and for packing and unpacking data (as described by the “scale_factor” and “add_offset” attributes). If the argument decode_cf=True (default) is given to open_dataset, xarray will attempt to automatically decode the values in the netCDF objects according to CF conventions. Sometimes this will fail, for example, if a variable has an invalid “units” or “calendar” attribute. For these cases, you can turn this decoding off manually.
You can view this encoding information (among others) in the DataArray.encoding attribute:
In [9]: ds_disk['y'].encoding
Out[9]:
{'calendar': u'proleptic_gregorian',
'chunksizes': None,
'complevel': 0,
'contiguous': True,
'dtype': dtype('float64'),
'fletcher32': False,
'least_significant_digit': None,
'shuffle': False,
'source': 'saved_on_disk.nc',
'units': u'days since 2000-01-01 00:00:00',
'zlib': False}
Note that all operations that manipulate variables other than indexing will remove encoding information.
Writing encoded data¶
Conversely, you can customize how xarray writes netCDF files on disk by providing explicit encodings for each dataset variable. The encoding argument takes a dictionary with variable names as keys and variable specific encodings as values. These encodings are saved as attributes on the netCDF variables on disk, which allows xarray to faithfully read encoded data back into memory.
It is important to note that using encodings is entirely optional: if you do not supply any of these encoding options, xarray will write data to disk using a default encoding, or the options in the encoding attribute, if set. This works perfectly fine in most cases, but encoding can be useful for additional control, especially for enabling compression.
In the file on disk, these encodings as saved as attributes on each variable, which allow xarray and other CF-compliant tools for working with netCDF files to correctly read the data.
Scaling and type conversions¶
These encoding options work on any version of the netCDF file format:
- dtype: Any valid NumPy dtype or string convertable to a dtype, e.g., 'int16' or 'float32'. This controls the type of the data written on disk.
- _FillValue: Values of NaN in xarray variables are remapped to this value when saved on disk. This is important when converting floating point with missing values to integers on disk, because NaN is not a valid dtype for integer dtypes.
- scale_factor and add_offset: Used to convert from encoded data on disk to to the decoded data in memory, according to the formula decoded = scale_factor * encoded + add_offset.
These parameters can be fruitfully combined to compress discretized data on disk. For example, to save the variable foo with a precision of 0.1 in 16-bit integers while converting NaN to -9999, we would use encoding={'foo': {'dtype': 'int16', 'scale_factor': 0.1, '_FillValue': -9999}}. Compression and decompression with such discretization is extremely fast.
Chunk based compression¶
zlib, complevel, fletcher32, continguous and chunksizes can be used for enabling netCDF4/HDF5’s chunk based compression, as described in the documentation for createVariable for netCDF4-Python. This only works for netCDF4 files and thus requires using format='netCDF4' and either engine='netcdf4' or engine='h5netcdf'.
Chunk based gzip compression can yield impressive space savings, especially for sparse data, but it comes with significant performance overhead. HDF5 libraries can only read complete chunks back into memory, and maximum decompression speed is in the range of 50-100 MB/s. Worse, HDF5’s compression and decompression currently cannot be parallelized with dask. For these reasons, we recommend trying discretization based compression (described above) first.
Time units¶
The units and calendar attributes control how xarray serializes datetime64 and timedelta64 arrays to datasets on disk as numeric values. The units encoding should be a string like 'days since 1900-01-01' for datetime64 data or a string like 'days' for timedelta64 data. calendar should be one of the calendar types supported by netCDF4-python: ‘standard’, ‘gregorian’, ‘proleptic_gregorian’ ‘noleap’, ‘365_day’, ‘360_day’, ‘julian’, ‘all_leap’, ‘366_day’.
By default, xarray uses the ‘proleptic_gregorian’ calendar and units of the smallest time difference between values, with a reference time of the first time value.
OPeNDAP¶
xarray includes support for OPeNDAP (via the netCDF4 library or Pydap), which lets us access large datasets over HTTP.
For example, we can open a connection to GBs of weather data produced by the PRISM project, and hosted by IRI at Columbia:
In [10]: remote_data = xr.open_dataset(
....: 'http://iridl.ldeo.columbia.edu/SOURCES/.OSU/.PRISM/.monthly/dods',
....: decode_times=False)
....:
In [11]: remote_data
Out[11]:
<xarray.Dataset>
Dimensions: (T: 1422, X: 1405, Y: 621)
Coordinates:
* X (X) float32 -125.0 -124.958 -124.917 -124.875 -124.833 -124.792 -124.75 ...
* T (T) float32 -779.5 -778.5 -777.5 -776.5 -775.5 -774.5 -773.5 -772.5 -771.5 ...
* Y (Y) float32 49.9167 49.875 49.8333 49.7917 49.75 49.7083 49.6667 49.625 ...
Data variables:
ppt (T, Y, X) float64 ...
tdmean (T, Y, X) float64 ...
tmax (T, Y, X) float64 ...
tmin (T, Y, X) float64 ...
Attributes:
Conventions: IRIDL
expires: 1375315200
Note
Like many real-world datasets, this dataset does not entirely follow CF conventions. Unexpected formats will usually cause xarray’s automatic decoding to fail. The way to work around this is to either set decode_cf=False in open_dataset to turn off all use of CF conventions, or by only disabling the troublesome parser. In this case, we set decode_times=False because the time axis here provides the calendar attribute in a format that xarray does not expect (the integer 360 instead of a string like '360_day').
We can select and slice this data any number of times, and nothing is loaded over the network until we look at particular values:
In [12]: tmax = remote_data['tmax'][:500, ::3, ::3]
In [13]: tmax
Out[13]:
<xarray.DataArray 'tmax' (T: 500, Y: 207, X: 469)>
[48541500 values with dtype=float64]
Coordinates:
* Y (Y) float32 49.9167 49.7917 49.6667 49.5417 49.4167 49.2917 ...
* X (X) float32 -125.0 -124.875 -124.75 -124.625 -124.5 -124.375 ...
* T (T) float32 -779.5 -778.5 -777.5 -776.5 -775.5 -774.5 -773.5 ...
Attributes:
pointwidth: 120
standard_name: air_temperature
units: Celsius_scale
expires: 1443657600
# the data is downloaded automatically when we make the plot
In [14]: tmax[0].plot()

Formats supported by PyNIO¶
xarray can also read GRIB, HDF4 and other file formats supported by PyNIO, if PyNIO is installed. To use PyNIO to read such files, supply engine='pynio' to open_dataset().
We recommend installing PyNIO via conda:
conda install -c dbrown pynio
Combining multiple files¶
NetCDF files are often encountered in collections, e.g., with different files corresponding to different model runs. xarray can straightforwardly combine such files into a single Dataset by making use of concat().
Note
Version 0.5 includes experimental support for manipulating datasets that don’t fit into memory with dask. If you have dask installed, you can open multiple files simultaneously using open_mfdataset():
xr.open_mfdataset('my/files/*.nc')
This function automatically concatenates and merges into a single xarray datasets. For more details, see Reading and writing data.
For example, here’s how we could approximate MFDataset from the netCDF4 library:
from glob import glob
import xarray as xr
def read_netcdfs(files, dim):
# glob expands paths with * to a list of files, like the unix shell
paths = sorted(glob(files))
datasets = [xr.open_dataset(p) for p in paths]
combined = xr.concat(dataset, dim)
return combined
read_netcdfs('/all/my/files/*.nc', dim='time')
This function will work in many cases, but it’s not very robust. First, it never closes files, which means it will fail one you need to load more than a few thousands file. Second, it assumes that you want all the data from each file and that it can all fit into memory. In many situations, you only need a small subset or an aggregated summary of the data from each file.
Here’s a slightly more sophisticated example of how to remedy these deficiencies:
def read_netcdfs(files, dim, transform_func=None):
def process_one_path(path):
# use a context manager, to ensure the file gets closed after use
with xr.open_dataset(path) as ds:
# transform_func should do some sort of selection or
# aggregation
if transform_func is not None:
ds = transform_func(ds)
# load all data from the transformed dataset, to ensure we can
# use it after closing each original file
ds.load()
return ds
paths = sorted(glob(files))
datasets = [process_one_path(p) for p in paths]
combined = xr.concat(datasets, dim)
return combined
# here we suppose we only care about the combined mean of each file;
# you might also use indexing operations like .sel to subset datasets
read_netcdfs('/all/my/files/*.nc', dim='time',
transform_func=lambda ds: ds.mean())
This pattern works well and is very robust. We’ve used similar code to process tens of thousands of files constituting 100s of GB of data.
Out of core computation with dask¶
xarray integrates with dask to support streaming computation on datasets that don’t fit into memory.
Currently, dask is an entirely optional feature for xarray. However, the benefits of using dask are sufficiently strong that dask may become a required dependency in a future version of xarray.
For a full example of how to use xarray’s dask integration, read the blog post introducing xarray and dask.
What is a dask array?¶

Dask divides arrays into many small pieces, called chunks, each of which is presumed to be small enough to fit into memory.
Unlike NumPy, which has eager evaluation, operations on dask arrays are lazy. Operations queue up a series of taks mapped over blocks, and no computation is performed until you actually ask values to be computed (e.g., to print results to your screen or write to disk). At that point, data is loaded into memory and computation proceeds in a streaming fashion, block-by-block.
The actual computation is controlled by a multi-processing or thread pool, which allows dask to take full advantage of multiple processers available on most modern computers.
For more details on dask, read its documentation.
Reading and writing data¶
The usual way to create a dataset filled with dask arrays is to load the data from a netCDF file or files. You can do this by supplying a chunks argument to open_dataset() or using the open_mfdataset() function.
In [1]: ds = xr.open_dataset('example-data.nc', chunks={'time': 10})
In [2]: ds
Out[2]:
<xarray.Dataset>
Dimensions: (latitude: 180, longitude: 360, time: 365)
Coordinates:
* latitude (latitude) float64 89.5 88.5 87.5 86.5 85.5 84.5 83.5 82.5 ...
* time (time) datetime64[ns] 2015-01-01 2015-01-02 2015-01-03 ...
* longitude (longitude) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
Data variables:
temperature (time, latitude, longitude) float64 0.4691 -0.2829 -1.509 ...
In this example latitude and longitude do not appear in the chunks dict, so only one chunk will be used along those dimensions. It is also entirely equivalent to open a dataset using open_dataset and then chunk the data use the chunk method, e.g., xr.open_dataset('example-data.nc').chunk({'time': 10}).
To open multiple files simultaneously, use open_mfdataset():
xr.open_mfdataset('my/files/*.nc')
This function will automatically concatenate and merge dataset into one in the simple cases that it understands (see auto_combine() for the full disclaimer). By default, open_mfdataset will chunk each netCDF file into a single dask array; again, supply the chunks argument to control the size of the resulting dask arrays. In more complex cases, you can open each file individually using open_dataset and merge the result, as described in Combining data.
You’ll notice that printing a dataset still shows a preview of array values, even if they are actually dask arrays. We can do this quickly with dask because we only need to the compute the first few values (typically from the first block). To reveal the true nature of an array, print a DataArray:
In [3]: ds.temperature
Out[3]:
<xarray.DataArray 'temperature' (time: 365, latitude: 180, longitude: 360)>
dask.array<example..., shape=(365, 180, 360), dtype=float64, chunksize=(10, 180, 360)>
Coordinates:
* latitude (latitude) float64 89.5 88.5 87.5 86.5 85.5 84.5 83.5 82.5 ...
* time (time) datetime64[ns] 2015-01-01 2015-01-02 2015-01-03 ...
* longitude (longitude) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ...
Once you’ve manipulated a dask array, you can still write a dataset too big to fit into memory back to disk by using to_netcdf() in the usual way.
Using dask with xarray¶
Nearly all existing xarray methods (including those for indexing, computation, concatenating and grouped operations) have been extended to work automatically with dask arrays. When you load data as a dask array in an xarray data structure, almost all xarray operations will keep it as a dask array; when this is not possible, they will raise an exception rather than unexpectedly loading data into memory. Converting a dask array into memory generally requires an explicit conversion step. One noteable exception is indexing operations: to enable label based indexing, xarray will automatically load coordinate labels into memory.
The easiest way to convert an xarray data structure from lazy dask arrays into eager, in-memory numpy arrays is to use the load() method:
In [4]: ds.load()
Out[4]:
<xarray.Dataset>
Dimensions: (latitude: 180, longitude: 360, time: 365)
Coordinates:
* latitude (latitude) float64 89.5 88.5 87.5 86.5 85.5 84.5 83.5 82.5 ...
* time (time) datetime64[ns] 2015-01-01 2015-01-02 2015-01-03 ...
* longitude (longitude) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
Data variables:
temperature (time, latitude, longitude) float64 0.4691 -0.2829 -1.509 ...
You can also access values, which will always be a numpy array:
In [5]: ds.temperature.values
Out[5]:
array([[[ 4.691e-01, -2.829e-01, ..., -5.577e-01, 3.814e-01],
[ 1.337e+00, -1.531e+00, ..., 8.726e-01, -1.538e+00],
...
# truncated for brevity
Explicit conversion by wrapping a DataArray with np.asarray also works:
In [6]: np.asarray(ds.temperature)
Out[6]:
array([[[ 4.691e-01, -2.829e-01, ..., -5.577e-01, 3.814e-01],
[ 1.337e+00, -1.531e+00, ..., 8.726e-01, -1.538e+00],
...
With the current version of dask, there is no automatic alignment of chunks when performing operations between dask arrays with different chunk sizes. If your computation involves multiple dask arrays with different chunks, you may need to explicitly rechunk each array to ensure compatibility. With xarray, both converting data to a dask arrays and converting the chunk sizes of dask arrays is done with the chunk() method:
In [7]: rechunked = ds.chunk({'latitude': 100, 'longitude': 100})
You can view the size of existing chunks on an array by viewing the chunks attribute:
In [8]: rechunked.chunks
Out[8]: Frozen(SortedKeysDict({'latitude': (100, 80), 'longitude': (100, 100, 100, 60), 'time': (10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 5)}))
If there are not consistent chunksizes between all the arrays in a dataset along a particular dimension, an exception is raised when you try to access .chunks.
Note
In the future, we would like to enable automatic alignment of dask chunksizes (but not the other way around). We might also require that all arrays in a dataset share the same chunking alignment. Neither of these are currently done.
NumPy ufuncs like np.sin currently only work on eagerly evaluated arrays (this will change with the next major NumPy release). We have provided replacements that also work on all xarray objects, including those that store lazy dask arrays, in the xarray.ufuncs module:
In [9]: import xarray.ufuncs as xu
In [10]: xu.sin(rechunked)
Out[10]:
<xarray.Dataset>
Dimensions: (latitude: 180, longitude: 360, time: 365)
Coordinates:
* latitude (latitude) float64 89.5 88.5 87.5 86.5 85.5 84.5 83.5 82.5 ...
* longitude (longitude) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
* time (time) datetime64[ns] 2015-01-01 2015-01-02 2015-01-03 ...
Data variables:
temperature (time, latitude, longitude) float64 0.4521 -0.2791 -0.9981 ...
To access dask arrays directly, use the new DataArray.data attribute. This attribute exposes array data either as a dask array or as a numpy array, depending on whether it has been loaded into dask or not:
In [11]: ds.temperature.data
Out[11]: dask.array<xarray-..., shape=(365, 180, 360), dtype=float64, chunksize=(10, 180, 360)>
Note
In the future, we may extend .data to support other “computable” array backends beyond dask and numpy (e.g., to support sparse arrays).
Chunking and performance¶
The chunks parameter has critical performance implications when using dask arrays. If your chunks are too small, queueing up operations will be extremely slow, because dask will translates each operation into a huge number of operations mapped across chunks. Computation on dask arrays with small chunks can also be slow, because each operation on a chunk has some fixed overhead from the Python interpreter and the dask task executor.
Conversely, if your chunks are too big, some of your computation may be wasted, because dask only computes results one chunk at a time.
A good rule of thumb to create arrays with a minimum chunksize of at least one million elements (e.g., a 1000x1000 matrix). With large arrays (10+ GB), the cost of queueing up dask operations can be noticeable, and you may need even larger chunksizes.
Plotting¶
Introduction¶
Labeled data enables expressive computations. These same labels can also be used to easily create informative plots.
xarray’s plotting capabilities are centered around xarray.DataArray objects. To plot xarray.Dataset objects simply access the relevant DataArrays, ie dset['var1']. Here we focus mostly on arrays 2d or larger. If your data fits nicely into a pandas DataFrame then you’re better off using one of the more developed tools there.
xarray plotting functionality is a thin wrapper around the popular matplotlib library. Matplotlib syntax and function names were copied as much as possible, which makes for an easy transition between the two. Matplotlib must be installed before xarray can plot.
For more extensive plotting applications consider the following projects:
- Seaborn: “provides a high-level interface for drawing attractive statistical graphics.” Integrates well with pandas.
- Holoviews: “Composable, declarative data structures for building even complex visualizations easily.” Works for 2d datasets.
- Cartopy: Provides cartographic tools.
Imports¶
The following imports are necessary for all of the examples.
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: import matplotlib.pyplot as plt
In [4]: import xarray as xr
For these examples we’ll use the North American air temperature dataset.
In [5]: airtemps = xr.tutorial.load_dataset('air_temperature')
In [6]: airtemps
Out[6]:
<xarray.Dataset>
Dimensions: (lat: 25, lon: 53, time: 2920)
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 52.5 ...
* time (time) datetime64[ns] 2013-01-01 2013-01-01T06:00:00 ...
* lon (lon) float32 200.0 202.5 205.0 207.5 210.0 212.5 215.0 217.5 ...
Data variables:
air (time, lat, lon) float64 241.2 242.5 243.5 244.0 244.1 243.9 ...
Attributes:
platform: Model
Conventions: COARDS
references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.html
description: Data is from NMC initialized reanalysis
(4x/day). These are the 0.9950 sigma level values.
title: 4x daily NMC reanalysis (1948)
# Convert to celsius
In [7]: air = airtemps.air - 273.15
One Dimension¶
Simple Example¶
xarray uses the coordinate name to label the x axis.
In [8]: air1d = air.isel(lat=10, lon=10)
In [9]: air1d.plot()
Out[9]: [<matplotlib.lines.Line2D at 0x7f23a3b8d0d0>]

Additional Arguments¶
Additional arguments are passed directly to the matplotlib function which does the work. For example, xarray.plot.line() calls matplotlib.pyplot.plot passing in the index and the array values as x and y, respectively. So to make a line plot with blue triangles a matplotlib format string can be used:
In [10]: air1d[:200].plot.line('b-^')
Out[10]: [<matplotlib.lines.Line2D at 0x7f23a39f6e10>]

Note
Not all xarray plotting methods support passing positional arguments to the wrapped matplotlib functions, but they do all support keyword arguments.
Keyword arguments work the same way, and are more explicit.
In [11]: air1d[:200].plot.line(color='purple', marker='o')
Out[11]: [<matplotlib.lines.Line2D at 0x7f23a3d4ce90>]

Adding to Existing Axis¶
To add the plot to an existing axis pass in the axis as a keyword argument ax. This works for all xarray plotting methods. In this example axes is an array consisting of the left and right axes created by plt.subplots.
In [12]: fig, axes = plt.subplots(ncols=2)
In [13]: axes
Out[13]:
array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f23a3858910>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f23a3d3b990>], dtype=object)
In [14]: air1d.plot(ax=axes[0])
Out[14]: [<matplotlib.lines.Line2D at 0x7f23a3ccd350>]
In [15]: air1d.plot.hist(ax=axes[1])
Out[15]:
(array([ 9., 38., 255., 584., 542., 489., 368., 258., 327., 50.]),
array([ 0.95 , 2.719, 4.488, ..., 15.102, 16.871, 18.64 ]),
<a list of 10 Patch objects>)
In [16]: plt.tight_layout()
In [17]: plt.show()

On the right is a histogram created by xarray.plot.hist().
Two Dimensions¶
Simple Example¶
The default method xarray.DataArray.plot() sees that the data is 2 dimensional and calls xarray.plot.pcolormesh().
In [18]: air2d = air.isel(time=500)
In [19]: air2d.plot()
Out[19]: <matplotlib.collections.QuadMesh at 0x7f23b4864550>

All 2d plots in xarray allow the use of the keyword arguments yincrease and xincrease.
In [20]: air2d.plot(yincrease=False)
Out[20]: <matplotlib.collections.QuadMesh at 0x7f23a348f390>

Note
We use xarray.plot.pcolormesh() as the default two-dimensional plot method because it is more flexible than xarray.plot.imshow(). However, for large arrays, imshow can be much faster than pcolormesh. If speed is important to you and you are plotting a regular mesh, consider using imshow.
Missing Values¶
xarray plots data with Missing values.
In [21]: bad_air2d = air2d.copy()
In [22]: bad_air2d[dict(lat=slice(0, 10), lon=slice(0, 25))] = np.nan
In [23]: bad_air2d.plot()
Out[23]: <matplotlib.collections.QuadMesh at 0x7f23a2b62ed0>

Nonuniform Coordinates¶
It’s not necessary for the coordinates to be evenly spaced. Both xarray.plot.pcolormesh() (default) and xarray.plot.contourf() can produce plots with nonuniform coordinates.
In [24]: b = air2d.copy()
# Apply a nonlinear transformation to one of the coords
In [25]: b.coords['lat'] = np.log(b.coords['lat'])
In [26]: b.plot()
Out[26]: <matplotlib.collections.QuadMesh at 0x7f23a39e9310>

Calling Matplotlib¶
Since this is a thin wrapper around matplotlib, all the functionality of matplotlib is available.
In [27]: air2d.plot(cmap=plt.cm.Blues)
Out[27]: <matplotlib.collections.QuadMesh at 0x7f23a3cc3c90>
In [28]: plt.title('These colors prove North America\nhas fallen in the ocean')
Out[28]: <matplotlib.text.Text at 0x7f23a3b773d0>
In [29]: plt.ylabel('latitude')
Out[29]: <matplotlib.text.Text at 0x7f23a3bab450>
In [30]: plt.xlabel('longitude')
Out[30]: <matplotlib.text.Text at 0x7f23a3cc3510>
In [31]: plt.tight_layout()
In [32]: plt.show()

Note
xarray methods update label information and generally play around with the axes. So any kind of updates to the plot should be done after the call to the xarray’s plot. In the example below, plt.xlabel effectively does nothing, since d_ylog.plot() updates the xlabel.
In [33]: plt.xlabel('Never gonna see this.')
Out[33]: <matplotlib.text.Text at 0x7f23a2aecc10>
In [34]: air2d.plot()
Out[34]: <matplotlib.collections.QuadMesh at 0x7f23b5839750>
In [35]: plt.show()

Colormaps¶
xarray borrows logic from Seaborn to infer what kind of color map to use. For example, consider the original data in Kelvins rather than Celsius:
In [36]: airtemps.air.isel(time=0).plot()
Out[36]: <matplotlib.collections.QuadMesh at 0x7f23a2983450>

The Celsius data contain 0, so a diverging color map was used. The Kelvins do not have 0, so the default color map was used.
Robust¶
Outliers often have an extreme effect on the output of the plot. Here we add two bad data points. This affects the color scale, washing out the plot.
In [37]: air_outliers = airtemps.air.isel(time=0).copy()
In [38]: air_outliers[0, 0] = 100
In [39]: air_outliers[-1, -1] = 400
In [40]: air_outliers.plot()
Out[40]: <matplotlib.collections.QuadMesh at 0x7f23a287be50>

This plot shows that we have outliers. The easy way to visualize the data without the outliers is to pass the parameter robust=True. This will use the 2nd and 98th percentiles of the data to compute the color limits.
In [41]: air_outliers.plot(robust=True)
Out[41]: <matplotlib.collections.QuadMesh at 0x7f23a27b4290>

Observe that the ranges of the color bar have changed. The arrows on the color bar indicate that the colors include data points outside the bounds.
Discrete Colormaps¶
It is often useful, when visualizing 2d data, to use a discrete colormap, rather than the default continuous colormaps that matplotlib uses. The levels keyword argument can be used to generate plots with discrete colormaps. For example, to make a plot with 8 discrete color intervals:
In [42]: air2d.plot(levels=8)
Out[42]: <matplotlib.collections.QuadMesh at 0x7f23a3bd74d0>

It is also possible to use a list of levels to specify the boundaries of the discrete colormap:
In [43]: air2d.plot(levels=[0, 12, 18, 30])
Out[43]: <matplotlib.collections.QuadMesh at 0x7f23a264c390>

You can also specify a list of discrete colors through the colors argument:
In [44]: flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]
In [45]: air2d.plot(levels=[0, 12, 18, 30], colors=flatui)
Out[45]: <matplotlib.collections.QuadMesh at 0x7f23a2522a10>

Finally, if you have Seaborn installed, you can also specify a seaborn color palette to the cmap argument. Note that levels must be specified with seaborn color palettes if using imshow or pcolormesh (but not with contour or contourf, since levels are chosen automatically).
In [46]: air2d.plot(levels=10, cmap='husl')
Out[46]: <matplotlib.collections.QuadMesh at 0x7f239e924690>

Faceting¶
Faceting here refers to splitting an array along one or two dimensions and plotting each group. xarray’s basic plotting is useful for plotting two dimensional arrays. What about three or four dimensional arrays? That’s where facets become helpful.
Consider the temperature data set. There are 4 observations per day for two years which makes for 2920 values along the time dimension. One way to visualize this data is to make a seperate plot for each time period.
The faceted dimension should not have too many values; faceting on the time dimension will produce 2920 plots. That’s too much to be helpful. To handle this situation try performing an operation that reduces the size of the data in some way. For example, we could compute the average air temperature for each month and reduce the size of this dimension from 2920 -> 12. A simpler way is to just take a slice on that dimension. So let’s use a slice to pick 6 times throughout the first year.
In [47]: t = air.isel(time=slice(0, 365 * 4, 250))
In [48]: t.coords
Out[48]:
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 52.5 ...
* time (time) datetime64[ns] 2013-01-01 2013-03-04T12:00:00 2013-05-06 ...
* lon (lon) float32 200.0 202.5 205.0 207.5 210.0 212.5 215.0 217.5 ...
Simple Example¶
The easiest way to create faceted plots is to pass in row or col arguments to the xarray plotting methods/functions. This returns a xarray.plot.FacetGrid object.
In [49]: g_simple = t.plot(x='lon', y='lat', col='time', col_wrap=3)

4 dimensional¶
For 4 dimensional arrays we can use the rows and columns of the grids. Here we create a 4 dimensional array by taking the original data and adding a fixed amount. Now we can see how the temperature maps would compare if one were much hotter.
In [50]: t2 = t.isel(time=slice(0, 2))
In [51]: t4d = xr.concat([t2, t2 + 40], pd.Index(['normal', 'hot'], name='fourth_dim'))
# This is a 4d array
In [52]: t4d.coords
Out[52]:
Coordinates:
* lat (lat) float64 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 ...
* time (time) datetime64[ns] 2013-01-01 2013-03-04T12:00:00
* lon (lon) float64 200.0 202.5 205.0 207.5 210.0 212.5 215.0 ...
* fourth_dim (fourth_dim) object 'normal' 'hot'
In [53]: t4d.plot(x='lon', y='lat', col='time', row='fourth_dim')
Out[53]: <xarray.plot.facetgrid.FacetGrid at 0x7f239e489a10>

Other features¶
Faceted plotting supports other arguments common to xarray 2d plots.
In [54]: hasoutliers = t.isel(time=slice(0, 5)).copy()
In [55]: hasoutliers[0, 0, 0] = -100
In [56]: hasoutliers[-1, -1, -1] = 400
In [57]: g = hasoutliers.plot.pcolormesh('lon', 'lat', col='time', col_wrap=3,
....: robust=True, cmap='viridis')
....:

FacetGrid Objects¶
xarray.plot.FacetGrid is used to control the behavior of the multiple plots. It borrows an API and code from Seaborn. The structure is contained within the axes and name_dicts attributes, both 2d Numpy object arrays.
In [58]: g.axes
Out[58]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f239e2050d0>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f239e08b990>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f239e06aa10>],
[<matplotlib.axes._subplots.AxesSubplot object at 0x7f239dfedb90>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f239df55050>,
<matplotlib.axes._subplots.AxesSubplot object at 0x7f239ded7190>]], dtype=object)
In [59]: g.name_dicts
Out[59]:
array([[{'time': numpy.datetime64('2013-01-01T00:00:00.000000000+0000')},
{'time': numpy.datetime64('2013-03-04T12:00:00.000000000+0000')},
{'time': numpy.datetime64('2013-05-06T00:00:00.000000000+0000')}],
[{'time': numpy.datetime64('2013-07-07T12:00:00.000000000+0000')},
{'time': numpy.datetime64('2013-09-08T00:00:00.000000000+0000')}, None]], dtype=object)
It’s possible to select the xarray.DataArray or xarray.Dataset corresponding to the FacetGrid through the name_dicts.
In [60]: g.data.loc[g.name_dicts[0, 0]]
Out[60]:
<xarray.DataArray 'air' (lat: 25, lon: 53)>
array([[-100. , -30.65, -29.65, ..., -40.35, -37.65, -34.55],
[ -29.35, -28.65, -28.45, ..., -40.35, -37.85, -33.85],
[ -23.15, -23.35, -24.26, ..., -39.95, -36.76, -31.45],
...,
[ 23.45, 23.05, 23.25, ..., 22.25, 21.95, 21.55],
[ 22.75, 23.05, 23.64, ..., 22.75, 22.75, 22.05],
[ 23.14, 23.64, 23.95, ..., 23.75, 23.64, 23.45]])
Coordinates:
* lat (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 52.5 ...
time datetime64[ns] 2013-01-01
* lon (lon) float32 200.0 202.5 205.0 207.5 210.0 212.5 215.0 217.5 ...
Here is an example of using the lower level API and then modifying the axes after they have been plotted.
In [61]: g = t.plot.imshow('lon', 'lat', col='time', col_wrap=3, robust=True)
In [62]: for i, ax in enumerate(g.axes.flat):
....: ax.set_title('Air Temperature %d' % i)
....:
In [63]: bottomright = g.axes[-1, -1]
In [64]: bottomright.annotate('bottom right', (240, 40))
Out[64]: <matplotlib.text.Annotation at 0x7f239e0fa550>
In [65]: plt.show()

TODO: add an example of using the map method to plot dataset variables (e.g., with plt.quiver).
Maps¶
To follow this section you’ll need to have Cartopy installed and working.
This script will plot the air temperature on a map.
import xarray as xr
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
air = (xr.tutorial
.load_dataset('air_temperature')
.air
.isel(time=0))
ax = plt.axes(projection=ccrs.Orthographic(-80, 35))
ax.set_global()
air.plot.contourf(ax=ax, transform=ccrs.PlateCarree())
ax.coastlines()
plt.savefig('cartopy_example.png')
Here is the resulting image:

Details¶
Ways to Use¶
There are three ways to use the xarray plotting functionality:
- Use plot as a convenience method for a DataArray.
- Access a specific plotting method from the plot attribute of a DataArray.
- Directly from the xarray plot submodule.
These are provided for user convenience; they all call the same code.
In [66]: import xarray.plot as xplt
In [67]: da = xr.DataArray(range(5))
In [68]: fig, axes = plt.subplots(ncols=2, nrows=2)
In [69]: da.plot(ax=axes[0, 0])
Out[69]: [<matplotlib.lines.Line2D at 0x7f239da326d0>]
In [70]: da.plot.line(ax=axes[0, 1])
Out[70]: [<matplotlib.lines.Line2D at 0x7f239da32710>]
In [71]: xplt.plot(da, ax=axes[1, 0])
Out[71]: [<matplotlib.lines.Line2D at 0x7f239d8dd110>]
In [72]: xplt.line(da, ax=axes[1, 1])
Out[72]: [<matplotlib.lines.Line2D at 0x7f23ac2638d0>]
In [73]: plt.tight_layout()
In [74]: plt.show()

Here the output is the same. Since the data is 1 dimensional the line plot was used.
The convenience method xarray.DataArray.plot() dispatches to an appropriate plotting function based on the dimensions of the DataArray and whether the coordinates are sorted and uniformly spaced. This table describes what gets plotted:
Dimensions | Plotting function |
1 | xarray.plot.line() |
2 | xarray.plot.pcolormesh() |
Anything else | xarray.plot.hist() |
Coordinates¶
If you’d like to find out what’s really going on in the coordinate system, read on.
In [75]: a0 = xr.DataArray(np.zeros((4, 3, 2)), dims=('y', 'x', 'z'),
....: name='temperature')
....:
In [76]: a0[0, 0, 0] = 1
In [77]: a = a0.isel(z=0)
In [78]: a
Out[78]:
<xarray.DataArray 'temperature' (y: 4, x: 3)>
array([[ 1., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])
Coordinates:
* y (y) int64 0 1 2 3
* x (x) int64 0 1 2
z int64 0
The plot will produce an image corresponding to the values of the array. Hence the top left pixel will be a different color than the others. Before reading on, you may want to look at the coordinates and think carefully about what the limits, labels, and orientation for each of the axes should be.
In [79]: a.plot()
Out[79]: <matplotlib.collections.QuadMesh at 0x7f239d781510>

It may seem strange that the values on the y axis are decreasing with -0.5 on the top. This is because the pixels are centered over their coordinates, and the axis labels and ranges correspond to the values of the coordinates.
API reference¶
This page provides an auto-generated summary of xarray’s API. For more details and examples, refer to the relevant chapters in the main part of the documentation.
Top-level functions¶
align(*objects[, join, copy]) | Given any number of Dataset and/or DataArray objects, returns new objects with aligned indexes. |
broadcast(*args) | Explicitly broadcast any number of DataArray or Dataset objects against one another. |
concat(objs[, dim, data_vars, coords, ...]) | Concatenate xarray objects along a new or existing dimension. |
merge(objects[, compat, join]) | Merge any number of xarray objects into a single Dataset as variables. |
set_options(**kwargs) | Set global state within a controlled context |
Dataset¶
Creating a dataset¶
Dataset([data_vars, coords, attrs, compat]) | A multi-dimensional, in memory, array database. |
decode_cf(obj[, concat_characters, ...]) | Decode the given Dataset or Datastore according to CF conventions into a new Dataset. |
Attributes¶
Dataset.dims | Mapping from dimension names to lengths. |
Dataset.data_vars | Dictionary of xarray.DataArray objects corresponding to data variables |
Dataset.coords | Dictionary of xarray.DataArray objects corresponding to coordinate |
Dataset.attrs | Dictionary of global attributes on this dataset |
Dictionary interface¶
Datasets implement the mapping interface with keys given by variable names and values given by DataArray objects.
Dataset.__getitem__(key) | Access variables or coordinates this dataset as a DataArray. |
Dataset.__setitem__(key, value) | Add an array to this dataset. |
Dataset.__delitem__(key) | Remove a variable from this dataset. |
Dataset.update(other[, inplace]) | Update this dataset’s variables with those from another dataset. |
Dataset.iteritems(...) | |
Dataset.itervalues(...) |
Dataset contents¶
Dataset.copy([deep]) | Returns a copy of this dataset. |
Dataset.assign(**kwargs) | Assign new data variables to a Dataset, returning a new object with all the original variables in addition to the new ones. |
Dataset.assign_coords(**kwargs) | Assign new coordinates to this object, returning a new object with all the original data in addition to the new coordinates. |
Dataset.pipe(func, *args, **kwargs) | Apply func(self, *args, **kwargs) |
Dataset.merge(other[, inplace, ...]) | Merge the arrays of two datasets into a single dataset. |
Dataset.rename(name_dict[, inplace]) | Returns a new object with renamed variables and dimensions. |
Dataset.swap_dims(dims_dict[, inplace]) | Returns a new object with swapped dimensions. |
Dataset.drop(labels[, dim]) | Drop variables or index labels from this dataset. |
Dataset.set_coords(names[, inplace]) | Given names of one or more variables, set them as coordinates |
Dataset.reset_coords([names, drop, inplace]) | Given names of coordinates, reset them to become variables |
Comparisons¶
Dataset.equals(other) | Two Datasets are equal if they have matching variables and coordinates, all of which are equal. |
Dataset.identical(other) | Like equals, but also checks all dataset attributes and the attributes on all variables and coordinates. |
Dataset.broadcast_equals(other) | Two Datasets are broadcast equal if they are equal after broadcasting all variables against each other. |
Indexing¶
Dataset.loc | Attribute for location based indexing. |
Dataset.isel(**indexers) | Returns a new dataset with each array indexed along the specified dimension(s). |
Dataset.sel([method, tolerance]) | Returns a new dataset with each array indexed by tick labels along the specified dimension(s). |
Dataset.isel_points([dim]) | Returns a new dataset with each array indexed pointwise along the specified dimension(s). |
Dataset.sel_points([dim, method, tolerance]) | Returns a new dataset with each array indexed pointwise by tick labels along the specified dimension(s). |
Dataset.squeeze([dim]) | Returns a new dataset with squeezed data. |
Dataset.reindex([indexers, method, ...]) | Conform this object onto a new set of indexes, filling in missing values with NaN. |
Dataset.reindex_like(other[, method, ...]) | Conform this object onto the indexes of another object, filling in missing values with NaN. |
Computation¶
Dataset.apply(func[, keep_attrs, args]) | Apply a function over the data variables in this dataset. |
Dataset.reduce(func[, dim, keep_attrs, ...]) | Reduce this dataset by applying func along some dimension(s). |
Dataset.groupby(group[, squeeze]) | Returns a GroupBy object for performing grouped operations. |
Dataset.groupby_bins(group, bins[, right, ...]) | Returns a GroupBy object for performing grouped operations. |
Dataset.resample(freq, dim[, how, skipna, ...]) | Resample this object to a new temporal resolution. |
Dataset.diff(dim[, n, label]) | Calculate the n-th order discrete difference along given axis. |
Aggregation: all any argmax argmin max mean median min prod sum std var
Missing values: isnull notnull count dropna fillna where
ndarray methods: argsort clip conj conjugate imag round real T
Grouped operations: assign assign_coords first last fillna where
Reshaping and reorganizing¶
Dataset.transpose(*dims) | Return a new Dataset object with all array dimensions transposed. |
Dataset.stack(**dimensions) | Stack any number of existing dimensions into a single new dimension. |
Dataset.unstack(dim) | Unstack an existing dimension corresponding to a MultiIndex into multiple new dimensions. |
Dataset.shift(**shifts) | Shift this dataset by an offset along one or more dimensions. |
Dataset.roll(**shifts) | Roll this dataset by an offset along one or more dimensions. |
DataArray¶
DataArray(data[, coords, dims, name, attrs, ...]) | N-dimensional array with labeled coordinates and dimensions. |
Attributes¶
DataArray.values | The array’s data as a numpy.ndarray |
DataArray.data | The array’s data as a dask or numpy array |
DataArray.coords | Dictionary-like container of coordinate arrays. |
DataArray.dims | Dimension names associated with this array. |
DataArray.name | The name of this array. |
DataArray.attrs | Dictionary storing arbitrary metadata with this array. |
DataArray.encoding | Dictionary of format-specific settings for how this array should be serialized. |
DataArray contents¶
DataArray.assign_coords(**kwargs) | Assign new coordinates to this object, returning a new object with all the original data in addition to the new coordinates. |
DataArray.rename(new_name_or_name_dict) | Returns a new DataArray with renamed coordinates and/or a new name. |
DataArray.swap_dims(dims_dict) | Returns a new DataArray with swapped dimensions. |
DataArray.drop(labels[, dim]) | Drop coordinates or index labels from this DataArray. |
DataArray.reset_coords([names, drop, inplace]) | Given names of coordinates, reset them to become variables. |
DataArray.copy([deep]) | Returns a copy of this array. |
Indexing¶
DataArray.__getitem__(key) | |
DataArray.__setitem__(key, value) | |
DataArray.loc | Attribute for location based indexing like pandas. |
DataArray.isel(**indexers) | Return a new DataArray whose dataset is given by integer indexing along the specified dimension(s). |
DataArray.sel([method, tolerance]) | Return a new DataArray whose dataset is given by selecting index labels along the specified dimension(s). |
DataArray.isel_points([dim]) | Return a new DataArray whose dataset is given by pointwise integer indexing along the specified dimension(s). |
DataArray.sel_points([dim, method, tolerance]) | Return a new DataArray whose dataset is given by pointwise selection of index labels along the specified dimension(s). |
DataArray.squeeze([dim]) | Return a new DataArray object with squeezed data. |
DataArray.reindex([method, tolerance, copy]) | Conform this object onto a new set of indexes, filling in missing values with NaN. |
DataArray.reindex_like(other[, method, ...]) | Conform this object onto the indexes of another object, filling in missing values with NaN. |
Comparisons¶
DataArray.equals(other) | True if two DataArrays have the same dimensions, coordinates and values; otherwise False. |
DataArray.identical(other) | Like equals, but also checks the array name and attributes, and attributes on all coordinates. |
DataArray.broadcast_equals(other) | Two DataArrays are broadcast equal if they are equal after broadcasting them against each other such that they have the same dimensions. |
Computation¶
DataArray.reduce(func[, dim, axis, keep_attrs]) | Reduce this array by applying func along some dimension(s). |
DataArray.groupby(group[, squeeze]) | Returns a GroupBy object for performing grouped operations. |
DataArray.groupby_bins(group, bins[, right, ...]) | Returns a GroupBy object for performing grouped operations. |
DataArray.rolling([min_periods, center]) | Rolling window object. |
DataArray.resample(freq, dim[, how, skipna, ...]) | Resample this object to a new temporal resolution. |
DataArray.get_axis_num(dim) | Return axis number(s) corresponding to dimension(s) in this array. |
DataArray.diff(dim[, n, label]) | Calculate the n-th order discrete difference along given axis. |
DataArray.dot(other) | Perform dot product of two DataArrays along their shared dims. |
Aggregation: all any argmax argmin max mean median min prod sum std var
Missing values: isnull notnull count dropna fillna where
ndarray methods: argsort clip conj conjugate imag searchsorted round real T
Grouped operations: assign_coords first last fillna where
Reshaping and reorganizing¶
DataArray.transpose(*dims) | Return a new DataArray object with transposed dimensions. |
DataArray.stack(**dimensions) | Stack any number of existing dimensions into a single new dimension. |
DataArray.unstack(dim) | Unstack an existing dimension corresponding to a MultiIndex into multiple new dimensions. |
DataArray.shift(**shifts) | Shift this array by an offset along one or more dimensions. |
DataArray.roll(**shifts) | Roll this array by an offset along one or more dimensions. |
Universal functions¶
This functions are copied from NumPy, but extended to work on NumPy arrays, dask arrays and all xarray objects. You can find them in the xarray.ufuncs module:
angle arccos arccosh arcsin arcsinh arctan arctan2 arctanh ceil conj copysign cos cosh deg2rad degrees exp expm1 fabs fix floor fmax fmin fmod fmod frexp hypot imag iscomplex isfinite isinf isnan isreal ldexp log log10 log1p log2 logaddexp logaddexp2 logical_and logical_not logical_or logical_xor maximum minimum nextafter rad2deg radians real rint sign signbit sin sinh sqrt square tan tanh trunc
IO / Conversion¶
Dataset methods¶
open_dataset(filename_or_obj[, group, ...]) | Load and decode a dataset from a file or file-like object. |
open_mfdataset(paths[, chunks, concat_dim, ...]) | Open multiple files as a single dataset. |
Dataset.to_netcdf([path, mode, format, ...]) | Write dataset contents to a netCDF file. |
save_mfdataset(datasets, paths[, mode, ...]) | Write multiple datasets to disk as netCDF files simultaneously. |
Dataset.to_array([dim, name]) | Convert this dataset into an xarray.DataArray |
Dataset.to_dataframe() | Convert this dataset into a pandas.DataFrame. |
Dataset.from_dataframe(dataframe) | Convert a pandas.DataFrame into an xarray.Dataset |
Dataset.close() | Close any files linked to this dataset |
Dataset.load() | Manually trigger loading of this dataset’s data from disk or a remote source into memory and return this dataset. |
Dataset.chunk([chunks, name_prefix, token, lock]) | Coerce all arrays in this dataset into dask arrays with the given chunks. |
Dataset.filter_by_attrs(**kwargs) | Returns a Dataset with variables that match specific conditions. |
DataArray methods¶
DataArray.to_dataset([dim, name]) | Convert a DataArray to a Dataset. |
DataArray.to_pandas() | Convert this array into a pandas object with the same shape. |
DataArray.to_series() | Convert this array into a pandas.Series. |
DataArray.to_dataframe([name]) | Convert this array and its coordinates into a tidy pandas.DataFrame. |
DataArray.to_index() | Convert this variable to a pandas.Index. |
DataArray.to_masked_array([copy]) | Convert this array into a numpy.ma.MaskedArray |
DataArray.to_cdms2() | Convert this array into a cdms2.Variable |
DataArray.from_series(series) | Convert a pandas.Series into an xarray.DataArray. |
DataArray.from_cdms2(variable) | Convert a cdms2.Variable into an xarray.DataArray |
DataArray.load() | Manually trigger loading of this array’s data from disk or a remote source into memory and return this array. |
DataArray.chunk([chunks]) | Coerce this array’s data into a dask arrays with the given chunks. |
Plotting¶
plot.plot(darray[, row, col, col_wrap, ax, ...]) | Default plot of DataArray using matplotlib.pyplot. |
plot.contourf(darray[, x, y, ax, row, col, ...]) | Filled contour plot of 2d DataArray |
plot.contour(darray[, x, y, ax, row, col, ...]) | Contour plot of 2d DataArray |
plot.hist(darray[, ax]) | Histogram of DataArray |
plot.imshow(darray[, x, y, ax, row, col, ...]) | Image plot of 2d DataArray using matplotlib.pyplot |
plot.line(darray, *args, **kwargs) | Line plot of 1 dimensional DataArray index against values |
plot.pcolormesh(darray[, x, y, ax, row, ...]) | Pseudocolor plot of 2d DataArray |
plot.FacetGrid(data[, col, row, col_wrap, ...]) | Initialize the matplotlib figure and FacetGrid object. |
Advanced API¶
Variable(dims, data[, attrs, encoding, fastpath]) | A netcdf-like variable consisting of dimensions, data and attributes which describe a single Array. |
Coordinate(name, data[, attrs, encoding, ...]) | Wrapper around pandas.Index that adds xarray specific functionality. |
register_dataset_accessor(name) | Register a custom property on xarray.Dataset objects. |
register_dataarray_accessor(name) | Register a custom accessor on xarray.DataArray objects. |
These backends provide a low-level interface for lazily loading data from external file-formats or protocols, and can be manually invoked to create arguments for the from_store and dump_to_store Dataset methods:
backends.NetCDF4DataStore(filename[, mode, ...]) | Store for reading and writing data via the Python-NetCDF4 library. |
backends.H5NetCDFStore(filename[, mode, ...]) | Store for reading and writing data via h5netcdf |
backends.PydapDataStore(url) | Store for accessing OpenDAP datasets with pydap. |
backends.ScipyDataStore(filename_or_obj[, ...]) | Store for reading and writing data via scipy.io.netcdf. |
xarray Internals¶
xarray builds upon two of the foundational libraries of the scientific Python stack, NumPy and pandas. It is written in pure Python (no C or Cython extensions), which makes it easy to develop and extend. Instead, we push compiled code to optional dependencies.
Variable objects¶
The core internal data structure in xarray is the Variable, which is used as the basic building block behind xarray’s Dataset and DataArray types. A Variable consists of:
- dims: A tuple of dimension names.
- data: The N-dimensional array (typically, a NumPy or Dask array) storing the Variable’s data. It must have the same number of dimensions as the length of dims.
- attrs: An ordered dictionary of metadata associated with this array. By convention, xarray’s built-in operations never use this metadata.
- encoding: Another ordered dictionary used to store information about how these variable’s data is represented on disk. See Reading encoded data for more details.
Variable has an interface similar to NumPy arrays, but extended to make use of named dimensions. For example, it uses dim in preference to an axis argument for methods like mean, and supports Broadcasting by dimension name.
However, unlike Dataset and DataArray, the basic Variable does not include coordinate labels along each axis.
Variable is public API, but because of its incomplete support for labeled data, it is mostly intended for advanced uses, such as in xarray itself or for writing new backends. You can access the variable objects that correspond to xarray objects via the (readonly) Dataset.variables and DataArray.variable attributes.
Extending xarray¶
xarray is designed as a general purpose library, and hence tries to avoid including overly domain specific methods. But inevitably, the need for more domain specific logic arises.
One standard solution to this problem is to subclass Dataset and/or DataArray to add domain specific functionality. However, inheritance is not very robust. It’s easy to inadvertently use internal APIs when subclassing, which means that your code may break when xarray upgrades. Furthermore, many builtin methods will only return native xarray objects.
The standard advice is to use composition over inheritance, but reimplementing an API as large as xarray’s on your own objects can be an onerous task, even if most methods are only forwarding to xarray implementations.
To resolve this dilemma, xarray has the experimental register_dataset_accessor() and register_dataarray_accessor() decorators for adding custom “accessors” on xarray objects. Here’s how you might use these decorators to write a custom “geo” accessor implementing a geography specific extension to xarray:
import xarray as xr
@xr.register_dataset_accessor('geo')
class GeoAccessor(object):
def __init__(self, xarray_obj):
self._obj = xarray_obj
self._center = None
@property
def center(self):
"""Return the geographic center point of this dataset."""
if self._center is None:
# we can use a cache on our accessor objects, because accessors
# themselves are cached on instances that access them.
lon = self._obj.latitude
lat = self._obj.longitude
self._center = (float(lon.mean()), float(lat.mean()))
return self._center
def plot(self):
"""Plot data on a map."""
return 'plotting!'
This achieves the same result as if the Dataset class had a cached property defined that returns an instance of your class:
class Dataset:
...
@property
def geo(self)
return GeoAccessor(self)
However, using the register accessor decorators is preferable to simply adding your own ad-hoc property (i.e., Dataset.geo = property(...)), for two reasons:
- It ensures that the name of your property does not conflict with any other attributes or methods.
- Instances of accessor object will be cached on the xarray object that creates them. This means you can save state on them (e.g., to cache computed properties).
Back in an interactive IPython session, we can use these properties:
In [1]: ds = xr.Dataset({'longitude': np.linspace(0, 10),
...: 'latitude': np.linspace(0, 20)})
...:
In [2]: ds.geo.center
Out[2]: (10.0, 5.0)
In [3]: ds.geo.plot()
Out[3]: 'plotting!'
The intent here is that libraries that extend xarray could add such an accessor to implement subclass specific functionality rather than using actual subclasses or patching in a large number of domain specific methods.
To help users keep things straight, please let us know if you plan to write a new accessor for an open source library. In the future, we will maintain a list of accessors and the libraries that implement them on this page.
Here are several existing libraries that build functionality upon xarray. They may be useful points of reference for your work:
- xgcm: General Circulation Model Postprocessing. Uses subclassing and custom xarray backends.
- PyGDX: Python 3 package for accessing data stored in GAMS Data eXchange (GDX) files. Also uses a custom subclass.
- windspharm: Spherical harmonic wind analysis in Python.
- eofs: EOF analysis in Python.
See also¶
- Stephan Hoyer’s SciPy2015 talk introducing xarray to a general audience.
- Stephan Hoyer’s 2015 Unidata Users Workshop talk and tutorial (with answers) introducing xarray to users familiar with netCDF.
- Nicolas Fauchereau’s tutorial on xarray for netCDF users.
Get in touch¶
- Ask usage questions on StackOverflow.
- Report bugs, suggest features or view the source code on GitHub.
- For less well defined questions or ideas, use the mailing list.
- You can also try our chatroom on Gitter.
License¶
xarray is available under the open source Apache License.
History¶
xarray is an evolution of an internal tool developed at The Climate Corporation. It was originally written by Climate Corp researchers Stephan Hoyer, Alex Kleeman and Eugene Brevdo and was released as open source in May 2014. The project was renamed from “xray” in January 2016.