What’s New

v0.5 (1 June 2005)

Highlights

The headline feature in this release is experimental support for out-of-core computing (data that doesn’t fit into memory) with dask. This includes a new top-level function open_mfdataset() that makes it easy to open a collection of netCDF (using dask) as a single xray.Dataset object. For more on dask, read the new documentation section Out of core computation with dask.

Dask makes it possible to harness parallelism and manipulate gigantic datasets with xray. It is currently an optional dependency, but it may become required in the future.

Backwards incompatible changes

  • The logic used for choosing which variables are concatenated with concat() has changed. Previously, by default any variables which were equal across a dimension were not concatenated. This lead to some surprising behavior, where the behavior of groupby and concat operations could depend on runtime values (GH268). For example:

    In [1]: ds = xray.Dataset({'x': 0})
    
    In [2]: xray.concat([ds, ds], dim='y')
    Out[2]: 
    <xray.Dataset>
    Dimensions:  ()
    Coordinates:
        *empty*
    Data variables:
        x        int64 0
    

    Now, the default always concatenates data variables:

    In [3]: xray.concat([ds, ds], dim='y')
    Out[3]: 
    <xray.Dataset>
    Dimensions:  (y: 2)
    Coordinates:
      * y        (y) int64 0 1
    Data variables:
        x        (y) int64 0 0
    

    To obtain the old behavior, supply the argument concat_over=[].

Enhancements

  • New to_array() and enhanced to_dataset() methods make it easy to switch back and forth between arrays and datasets:

    In [4]: ds = xray.Dataset({'a': 1, 'b': ('x', [1, 2, 3])},
       ...:                   coords={'c': 42}, attrs={'Conventions': 'None'})
       ...: 
    
    In [5]: ds.to_array()
    Out[5]: 
    <xray.DataArray (variable: 2, x: 3)>
    array([[1, 1, 1],
           [1, 2, 3]])
    Coordinates:
      * variable  (variable) |S1 'a' 'b'
        c         int64 42
      * x         (x) int64 0 1 2
    Attributes:
        Conventions: None
    
    In [6]: ds.to_array().to_dataset(dim='variable')
    Out[6]: 
    <xray.Dataset>
    Dimensions:  (x: 3)
    Coordinates:
        c        int64 42
      * x        (x) int64 0 1 2
    Data variables:
        a        (x) int64 1 1 1
        b        (x) int64 1 2 3
    Attributes:
        Conventions: None
    
  • New fillna() method to fill missing values, modeled off the pandas method of the same name:

    In [7]: array = xray.DataArray([np.nan, 1, np.nan, 3], dims='x')
    
    In [8]: array.fillna(0)
    Out[8]: 
    <xray.DataArray (x: 4)>
    array([ 0.,  1.,  0.,  3.])
    Coordinates:
      * x        (x) int64 0 1 2 3
    

    fillna works on both Dataset and DataArray objects, and uses index based alignment and broadcasting like standard binary operations. It also can be applied by group, as illustrated in Fill missing values with climatology.

  • New assign() and assign_coords() methods patterned off the new DataFrame.assign method in pandas:

    In [9]: ds = xray.Dataset({'y': ('x', [1, 2, 3])})
    
    In [10]: ds.assign(z = lambda ds: ds.y ** 2)
    Out[10]: 
    <xray.Dataset>
    Dimensions:  (x: 3)
    Coordinates:
      * x        (x) int64 0 1 2
    Data variables:
        y        (x) int64 1 2 3
        z        (x) int64 1 4 9
    
    In [11]: ds.assign_coords(z = ('x', ['a', 'b', 'c']))
    Out[11]: 
    <xray.Dataset>
    Dimensions:  (x: 3)
    Coordinates:
      * x        (x) int64 0 1 2
        z        (x) |S1 'a' 'b' 'c'
    Data variables:
        y        (x) int64 1 2 3
    

    These methods return a new Dataset (or DataArray) with updated data or coordinate variables.

  • sel() now supports the method parameter, which works like the paramter of the same name on reindex(). It provides a simple interface for doing nearest-neighbor interpolation:

    In [12]: ds.sel(x=1.1, method='nearest')
    Out[12]: 
    <xray.Dataset>
    Dimensions:  ()
    Coordinates:
        x        int64 1
    Data variables:
        y        int64 2
    
    In [13]: ds.sel(x=[1.1, 2.1], method='pad')
    Out[13]: 
    <xray.Dataset>
    Dimensions:  (x: 2)
    Coordinates:
      * x        (x) int64 1 2
    Data variables:
        y        (x) int64 2 3
    

    See Nearest neighbor lookups for more details.

  • You can now control the underlying backend used for accessing remote datasets (via OPeNDAP) by specifying engine='netcdf4' or engine='pydap'.

  • xray now provides experimental support for reading and writing netCDF4 files directly via h5py with the h5netcdf package, avoiding the netCDF4-Python package. You will need to install h5netcdf and specify engine='h5netcdf' to try this feature.

  • Accessing data from remote datasets now has retrying logic (with exponential backoff) that should make it robust to occasional bad responses from DAP servers.

  • You can control the width of the Dataset repr with xray.set_options. It can be used either as a context manager, in which case the default is restored outside the context:

    In [14]: ds = xray.Dataset({'x': np.arange(1000)})
    
    In [15]: with xray.set_options(display_width=40):
       ....:     print(ds)
       ....: 
    <xray.Dataset>
    Dimensions:  (x: 1000)
    Coordinates:
      * x        (x) int64 0 1 2 3 4 5 6 ...
    Data variables:
        *empty*
    

    Or to set a global option:

    In [16]: xray.set_options(display_width=80)
    

    The default value for the display_width option is 80.

Deprecations

  • The method load_data() has been renamed to the more succinct load().

v0.4.1 (18 March 2015)

The release contains bug fixes and several new features. All changes should be fully backwards compatible.

Enhancements

  • New documentation sections on Time series data and Combining multiple files.

  • resample() lets you resample a dataset or data array to a new temporal resolution. The syntax is the same as pandas, except you need to supply the time dimension explicitly:

    In [17]: time = pd.date_range('2000-01-01', freq='6H', periods=10)
    
    In [18]: array = xray.DataArray(np.arange(10), [('time', time)])
    
    In [19]: array.resample('1D', dim='time')
    Out[19]: 
    <xray.DataArray (time: 3)>
    array([ 1.5,  5.5,  8.5])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
    

    You can specify how to do the resampling with the how argument and other options such as closed and label let you control labeling:

    In [20]: array.resample('1D', dim='time', how='sum', label='right')
    Out[20]: 
    <xray.DataArray (time: 3)>
    array([ 6, 22, 17])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-02 2000-01-03 2000-01-04
    

    If the desired temporal resolution is higher than the original data (upsampling), xray will insert missing values:

    In [21]: array.resample('3H', 'time')
    Out[21]: 
    <xray.DataArray (time: 19)>
    array([  0.,  nan,   1., ...,   8.,  nan,   9.])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-01 2000-01-01T03:00:00 ...
    
  • first and last methods on groupby objects let you take the first or last examples from each group along the grouped axis:

    In [22]: array.groupby('time.day').first()
    Out[22]: 
    <xray.DataArray (day: 3)>
    array([0, 4, 8])
    Coordinates:
      * day      (day) int64 1 2 3
    

    These methods combine well with resample:

    In [23]: array.resample('1D', dim='time', how='first')
    Out[23]: 
    <xray.DataArray (time: 3)>
    array([0, 4, 8])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
    
  • swap_dims() allows for easily swapping one dimension out for another:

    In [24]: ds = xray.Dataset({'x': range(3), 'y': ('x', list('abc'))})
    
    In [25]: ds
    Out[25]: 
    <xray.Dataset>
    Dimensions:  (x: 3)
    Coordinates:
      * x        (x) int64 0 1 2
    Data variables:
        y        (x) |S1 'a' 'b' 'c'
    
    In [26]: ds.swap_dims({'x': 'y'})
    Out[26]: 
    <xray.Dataset>
    Dimensions:  (y: 3)
    Coordinates:
      * y        (y) |S1 'a' 'b' 'c'
        x        (y) int64 0 1 2
    Data variables:
        *empty*
    

    This was possible in earlier versions of xray, but required some contortions.

  • open_dataset() and to_netcdf() now accept an engine argument to explicitly select which underlying library (netcdf4 or scipy) is used for reading/writing a netCDF file.

Bug fixes

  • Fixed a bug where data netCDF variables read from disk with engine='scipy' could still be associated with the file on disk, even after closing the file (GH341). This manifested itself in warnings about mmapped arrays and segmentation faults (if the data was accessed).
  • Silenced spurious warnings about all-NaN slices when using nan-aware aggregation methods (GH344).
  • Dataset aggregations with keep_attrs=True now preserve attributes on data variables, not just the dataset itself.
  • Tests for xray now pass when run on Windows (GH360).
  • Fixed a regression in v0.4 where saving to netCDF could fail with the error ValueError: could not automatically determine time units.

v0.4 (2 March, 2015)

This is one of the biggest releases yet for xray: it includes some major changes that may break existing code, along with the usual collection of minor enhancements and bug fixes. On the plus side, this release includes all hitherto planned breaking changes, so the upgrade path for xray should be smoother going forward.

Breaking changes

  • We now automatically align index labels in arithmetic, dataset construction, merging and updating. This means the need for manually invoking methods like align() and reindex_like() should be vastly reduced.

    For arithmetic, we align based on the intersection of labels:

    In [27]: lhs = xray.DataArray([1, 2, 3], [('x', [0, 1, 2])])
    
    In [28]: rhs = xray.DataArray([2, 3, 4], [('x', [1, 2, 3])])
    
    In [29]: lhs + rhs
    Out[29]: 
    <xray.DataArray (x: 2)>
    array([4, 6])
    Coordinates:
      * x        (x) int64 1 2
    

    For dataset construction and merging, we align based on the union of labels:

    In [30]: xray.Dataset({'foo': lhs, 'bar': rhs})
    Out[30]: 
    <xray.Dataset>
    Dimensions:  (x: 4)
    Coordinates:
      * x        (x) int64 0 1 2 3
    Data variables:
        foo      (x) float64 1.0 2.0 3.0 nan
        bar      (x) float64 nan 2.0 3.0 4.0
    

    For update and __setitem__, we align based on the original object:

    In [31]: lhs.coords['rhs'] = rhs
    
    In [32]: lhs
    Out[32]: 
    <xray.DataArray (x: 3)>
    array([1, 2, 3])
    Coordinates:
      * x        (x) int64 0 1 2
        rhs      (x) float64 nan 2.0 3.0
    
  • Aggregations like mean or median now skip missing values by default:

    In [33]: xray.DataArray([1, 2, np.nan, 3]).mean()
    Out[33]: 
    <xray.DataArray ()>
    array(2.0)
    

    You can turn this behavior off by supplying the keyword arugment skipna=False.

    These operations are lightning fast thanks to integration with bottleneck, which is a new optional dependency for xray (numpy is used if bottleneck is not installed).

  • Scalar coordinates no longer conflict with constant arrays with the same value (e.g., in arithmetic, merging datasets and concat), even if they have different shape (GH243). For example, the coordinate c here persists through arithmetic, even though it has different shapes on each DataArray:

    In [34]: a = xray.DataArray([1, 2], coords={'c': 0}, dims='x')
    
    In [35]: b = xray.DataArray([1, 2], coords={'c': ('x', [0, 0])}, dims='x')
    
    In [36]: (a + b).coords
    Out[36]: 
    Coordinates:
      * x        (x) int64 0 1
        c        (x) int64 0 0
    

    This functionality can be controlled through the compat option, which has also been added to the Dataset constructor.

  • Datetime shortcuts such as 'time.month' now return a DataArray with the name 'month', not 'time.month' (GH345). This makes it easier to index the resulting arrays when they are used with groupby:

    In [37]: time = xray.DataArray(pd.date_range('2000-01-01', periods=365),
       ....:                       dims='time', name='time')
       ....: 
    
    In [38]: counts = time.groupby('time.month').count()
    
    In [39]: counts.sel(month=2)
    Out[39]: 
    <xray.DataArray 'time' ()>
    array(29)
    Coordinates:
        month    int64 2
    

    Previously, you would need to use something like counts.sel(**{'time.month': 2}}), which is much more awkward.

  • The season datetime shortcut now returns an array of string labels such ‘DJF’:

    In [40]: ds = xray.Dataset({'t': pd.date_range('2000-01-01', periods=12, freq='M')})
    
    In [41]: ds['t.season']
    Out[41]: 
    <xray.DataArray 'season' (t: 12)>
    array(['DJF', 'DJF', 'MAM', ..., 'SON', 'SON', 'DJF'], 
          dtype='|S3')
    Coordinates:
      * t        (t) datetime64[ns] 2000-01-31 2000-02-29 2000-03-31 2000-04-30 ...
    

    Previously, it returned numbered seasons 1 through 4.

  • We have updated our use of the terms of “coordinates” and “variables”. What were known in previous versions of xray as “coordinates” and “variables” are now referred to throughout the documentation as “coordinate variables” and “data variables”. This brings xray in closer alignment to CF Conventions. The only visible change besides the documentation is that Dataset.vars has been renamed Dataset.data_vars.

  • You will need to update your code if you have been ignoring deprecation warnings: methods and attributes that were deprecated in xray v0.3 or earlier (e.g., dimensions, attributes`) have gone away.

Enhancements

  • Support for reindex() with a fill method. This provides a useful shortcut for upsampling:

    In [42]: data = xray.DataArray([1, 2, 3], dims='x')
    
    In [43]: data.reindex(x=[0.5, 1, 1.5, 2, 2.5], method='pad')
    Out[43]: 
    <xray.DataArray (x: 5)>
    array([1, 2, 2, 3, 3])
    Coordinates:
      * x        (x) float64 0.5 1.0 1.5 2.0 2.5
    

    This will be especially useful once pandas 0.16 is released, at which point xray will immediately support reindexing with method=’nearest’.

  • Use functions that return generic ndarrays with DataArray.groupby.apply and Dataset.apply (GH327 and GH329). Thanks Jeff Gerard!

  • Consolidated the functionality of dumps (writing a dataset to a netCDF3 bytestring) into to_netcdf() (GH333).

  • to_netcdf() now supports writing to groups in netCDF4 files (GH333). It also finally has a full docstring – you should read it!

  • open_dataset() and to_netcdf() now work on netCDF3 files when netcdf4-python is not installed as long as scipy is available (GH333).

  • The new Dataset.drop and DataArray.drop methods makes it easy to drop explicitly listed variables or index labels:

    # drop variables
    In [44]: ds = xray.Dataset({'x': 0, 'y': 1})
    
    In [45]: ds.drop('x')
    Out[45]: 
    <xray.Dataset>
    Dimensions:  ()
    Coordinates:
        *empty*
    Data variables:
        y        int64 1
    
    # drop index labels
    In [46]: arr = xray.DataArray([1, 2, 3], coords=[('x', list('abc'))])
    
    In [47]: arr.drop(['a', 'c'], dim='x')
    Out[47]: 
    <xray.DataArray (x: 1)>
    array([2])
    Coordinates:
      * x        (x) |S1 'b'
    
  • broadcast_equals() has been added to correspond to the new compat option.

  • Long attributes are now truncated at 500 characters when printing a dataset (GH338). This should make things more convenient for working with datasets interactively.

  • Added a new documentation example, Calculating Seasonal Averages from Timeseries of Monthly Means. Thanks Joe Hamman!

Bug fixes

  • Several bug fixes related to decoding time units from netCDF files (GH316, GH330). Thanks Stefan Pfenninger!
  • xray no longer requires decode_coords=False when reading datasets with unparseable coordinate attributes (GH308).
  • Fixed DataArray.loc indexing with ... (GH318).
  • Fixed an edge case that resulting in an error when reindexing multi-dimensional variables (GH315).
  • Slicing with negative step sizes (GH312).
  • Invalid conversion of string arrays to numeric dtype (GH305).
  • Fixed``repr()`` on dataset objects with non-standard dates (GH347).

Deprecations

  • dump and dumps have been deprecated in favor of to_netcdf().
  • drop_vars has been deprecated in favor of drop().

Future plans

The biggest feature I’m excited about working toward in the immediate future is supporting out-of-core operations in xray using Dask, a part of the Blaze project. For a preview of using Dask with weather data, read this blog post by Matthew Rocklin. See GH328 for more details.

v0.3.2 (23 December, 2014)

This release focused on bug-fixes, speedups and resolving some niggling inconsistencies.

There are a few cases where the behavior of xray differs from the previous version. However, I expect that in almost all cases your code will continue to run unmodified.

Warning

xray now requires pandas v0.15.0 or later. This was necessary for supporting TimedeltaIndex without too many painful hacks.

Backwards incompatible changes

  • Arrays of datetime.datetime objects are now automatically cast to datetime64[ns] arrays when stored in an xray object, using machinery borrowed from pandas:

    In [48]: from datetime import datetime
    
    In [49]: xray.Dataset({'t': [datetime(2000, 1, 1)]})
    Out[49]: 
    <xray.Dataset>
    Dimensions:  (t: 1)
    Coordinates:
      * t        (t) datetime64[ns] 2000-01-01
    Data variables:
        *empty*
    
  • xray now has support (including serialization to netCDF) for TimedeltaIndex. datetime.timedelta objects are thus accordingly cast to timedelta64[ns] objects when appropriate.

  • Masked arrays are now properly coerced to use NaN as a sentinel value (GH259).

Enhancements

  • Due to popular demand, we have added experimental attribute style access as a shortcut for dataset variables, coordinates and attributes:

    In [50]: ds = xray.Dataset({'tmin': ([], 25, {'units': 'celcius'})})
    
    In [51]: ds.tmin.units
    Out[51]: 'celcius'
    

    Tab-completion for these variables should work in editors such as IPython. However, setting variables or attributes in this fashion is not yet supported because there are some unresolved ambiguities (GH300).

  • You can now use a dictionary for indexing with labeled dimensions. This provides a safe way to do assignment with labeled dimensions:

    In [52]: array = xray.DataArray(np.zeros(5), dims=['x'])
    
    In [53]: array[dict(x=slice(3))] = 1
    
    In [54]: array
    Out[54]: 
    <xray.DataArray (x: 5)>
    array([ 1.,  1.,  1.,  0.,  0.])
    Coordinates:
      * x        (x) int64 0 1 2 3 4
    
  • Non-index coordinates can now be faithfully written to and restored from netCDF files. This is done according to CF conventions when possible by using the coordinates attribute on a data variable. When not possible, xray defines a global coordinates attribute.

  • Preliminary support for converting xray.DataArray objects to and from CDAT cdms2 variables.

  • We sped up any operation that involves creating a new Dataset or DataArray (e.g., indexing, aggregation, arithmetic) by a factor of 30 to 50%. The full speed up requires cyordereddict to be installed.

Bug fixes

  • Fix for to_dataframe() with 0d string/object coordinates (GH287)
  • Fix for to_netcdf with 0d string variable (GH284)
  • Fix writing datetime64 arrays to netcdf if NaT is present (GH270)
  • Fix align silently upcasts data arrays when NaNs are inserted (GH264)

Future plans

  • I am contemplating switching to the terms “coordinate variables” and “data variables” instead of the (currently used) “coordinates” and “variables”, following their use in CF Conventions (GH293). This would mostly have implications for the documentation, but I would also change the Dataset attribute vars to data.
  • I no longer certain that automatic label alignment for arithmetic would be a good idea for xray – it is a feature from pandas that I have not missed (GH186).
  • The main API breakage that I do anticipate in the next release is finally making all aggregation operations skip missing values by default (GH130). I’m pretty sick of writing ds.reduce(np.nanmean, 'time').
  • The next version of xray (0.4) will remove deprecated features and aliases whose use currently raises a warning.

If you have opinions about any of these anticipated changes, I would love to hear them – please add a note to any of the referenced GitHub issues.

v0.3.1 (22 October, 2014)

This is mostly a bug-fix release to make xray compatible with the latest release of pandas (v0.15).

We added several features to better support working with missing values and exporting xray objects to pandas. We also reorganized the internal API for serializing and deserializing datasets, but this change should be almost entirely transparent to users.

Other than breaking the experimental DataStore API, there should be no backwards incompatible changes.

New features

  • Added count() and dropna() methods, copied from pandas, for working with missing values (GH247, GH58).
  • Added DataArray.to_pandas for converting a data array into the pandas object with the same dimensionality (1D to Series, 2D to DataFrame, etc.) (GH255).
  • Support for reading gzipped netCDF3 files (GH239).
  • Reduced memory usage when writing netCDF files (GH251).
  • ‘missing_value’ is now supported as an alias for the ‘_FillValue’ attribute on netCDF variables (GH245).
  • Trivial indexes, equivalent to range(n) where n is the length of the dimension, are no longer written to disk (GH245).

Bug fixes

  • Compatibility fixes for pandas v0.15 (GH262).
  • Fixes for display and indexing of NaT (not-a-time) (GH238, GH240)
  • Fix slicing by label was an argument is a data array (GH250).
  • Test data is now shipped with the source distribution (GH253).
  • Ensure order does not matter when doing arithmetic with scalar data arrays (GH254).
  • Order of dimensions preserved with DataArray.to_dataframe (GH260).

v0.3 (21 September 2014)

New features

  • Revamped coordinates: “coordinates” now refer to all arrays that are not used to index a dimension. Coordinates are intended to allow for keeping track of arrays of metadata that describe the grid on which the points in “variable” arrays lie. They are preserved (when unambiguous) even though mathematical operations.
  • Dataset math Dataset objects now support all arithmetic operations directly. Dataset-array operations map across all dataset variables; dataset-dataset operations act on each pair of variables with the same name.
  • GroupBy math: This provides a convenient shortcut for normalizing by the average value of a group.
  • The dataset __repr__ method has been entirely overhauled; dataset objects now show their values when printed.
  • You can now index a dataset with a list of variables to return a new dataset: ds[['foo', 'bar']].

Backwards incompatible changes

  • Dataset.__eq__ and Dataset.__ne__ are now element-wise operations instead of comparing all values to obtain a single boolean. Use the method equals() instead.

Deprecations

  • Dataset.noncoords is deprecated: use Dataset.vars instead.
  • Dataset.select_vars deprecated: index a Dataset with a list of variable names instead.
  • DataArray.select_vars and DataArray.drop_vars deprecated: use reset_coords() instead.

v0.2 (14 August 2014)

This is major release that includes some new features and quite a few bug fixes. Here are the highlights:

  • There is now a direct constructor for DataArray objects, which makes it possible to create a DataArray without using a Dataset. This is highlighted in the refreshed tutorial.
  • You can perform aggregation operations like mean directly on Dataset objects, thanks to Joe Hamman. These aggregation methods also worked on grouped datasets.
  • xray now works on Python 2.6, thanks to Anna Kuznetsova.
  • A number of methods and attributes were given more sensible (usually shorter) names: labeled -> sel, indexed -> isel, select -> select_vars, unselect -> drop_vars, dimensions -> dims, coordinates -> coords, attributes -> attrs.
  • New load_data() and close() methods for datasets facilitate lower level of control of data loaded from disk.

v0.1.1 (20 May 2014)

xray 0.1.1 is a bug-fix release that includes changes that should be almost entirely backwards compatible with v0.1:

  • Python 3 support (GH53)
  • Required numpy version relaxed to 1.7 (GH129)
  • Return numpy.datetime64 arrays for non-standard calendars (GH126)
  • Support for opening datasets associated with NetCDF4 groups (GH127)
  • Bug-fixes for concatenating datetime arrays (GH134)

Special thanks to new contributors Thomas Kluyver, Joe Hamman and Alistair Miles.

v0.1 (2 May 2014)

Initial release.