xray: N-D labeled arrays and datasets in Python

xray is an open source project and Python package that aims to bring the labeled data power of pandas to the physical sciences, by providing N-dimensional variants of the core pandas data structures, Series and DataFrame: the xray DataArray and Dataset.

Our goal is to provide a pandas-like and pandas-compatible toolkit for analytics on multi-dimensional arrays, rather than the tabular data for which pandas excels. Our approach adopts the Common Data Model for self- describing scientific data in widespread use in the Earth sciences (e.g., netCDF and OPeNDAP): xray.Dataset is an in-memory representation of a netCDF file.

Documentation

Why xray?

Adding dimensions names and coordinate indexes to numpy’s ndarray makes many powerful array operations possible:

  • Apply operations over dimensions by name: x.sum('time').
  • Select values by label instead of integer location: x.loc['2014-01-01'] or x.sel(time='2014-01-01').
  • Mathematical operations (e.g., x - y) vectorize across multiple dimensions (known in numpy as “broadcasting”) based on dimension names, not array shape.
  • Flexible split-apply-combine operations with groupby: x.groupby('time.dayofyear').mean().
  • Database like aligment based on coordinate labels that smoothly handles missing values: x, y = xray.align(x, y, join='outer').
  • Keep track of arbitrary metadata in the form of a Python dictionary: x.attrs.

pandas excels at working with tabular data. That suffices for many statistical analyses, but physical scientists rely on N-dimensional arrays – which is where xray comes in.

xray aims to provide a data analysis toolkit as powerful as pandas but designed for working with homogeneous N-dimensional arrays instead of tabular data. When possible, we copy the pandas API and rely on pandas’s highly optimized internals (in particular, for fast indexing).

Because xray implements the same data model as the netCDF file format, xray datasets have a natural and portable serialization format. But it is also easy to robustly convert an xray DataArray to and from a numpy ndarray or a pandas DataFrame or Series, providing compatibility with the full PyData ecosystem.

Our target audience is anyone who needs N-dimensional labeled arrays, but we are particularly focused on the data analysis needs of physical scientists – especially geoscientists who already know and love netCDF.

Warning

xray is a fairly new project and is still under heavy development. Although we will make a best effort to maintain compatibility with the current API, it is likely that the API will change in future versions as xray matures. Some changes are already anticipated, as called out in the Tutorial.

Installing xray

Required dependencies:

  • Python 2.6, 2.7 or 3.3
  • numpy (1.7 or later)
  • pandas (0.13.1 or later)

Optional dependencies:

The easiest way to get all these dependencies installed is to use the Anaconda python distribution.

To install xray, use pip:

pip install xray

Warning

If you don’t already have recent versions of numpy and pandas installed, installing xray will automatically update them.

Tutorial

To get started, we will import numpy, pandas and xray:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: import xray

DataArray

xray.DataArray is xray’s implementation of a labeled, multi-dimensional array. It has three key properties:

  • values: a numpy.ndarray holding the array’s values
  • dims: dimension names for each axis, e.g., ('x', 'y', 'z')
  • coords: tick labels along each dimension, e.g., 1-dimensional arrays of numbers, datetime objects or strings.

xray uses dims and coords to enable its core metadata aware operations. Dimensions provide names that xray uses instead of the axis argument found in many numpy functions. Coordinates enable fast label based indexing and alignment, like the index found on a pandas DataFrame and Series.

DataArray objects also can have a name and can hold arbitrary metadata in the form of their attrs property (an ordered dictionary). Names and attributes are strictly for users and user-written code: xray makes no attempt to interpret them, and propagates them only in unambiguous cases.

Creating a DataArray

The DataArray constructor takes a multi-dimensional array of values (e.g., a numpy ndarray), a list of coordinates labels and a list of dimension names:

In [4]: data = np.random.rand(4, 3)

In [5]: locs = ['IA', 'IL', 'IN']

In [6]: times = pd.date_range('2000-01-01', periods=4)

In [7]: foo = xray.DataArray(data, coords=[times, locs], dims=['time', 'space'])

In [8]: foo
Out[8]: 
<xray.DataArray (time: 4, space: 3)>
array([[ 0.12696983,  0.96671784,  0.26047601],
       [ 0.89723652,  0.37674972,  0.33622174],
       [ 0.45137647,  0.84025508,  0.12310214],
       [ 0.5430262 ,  0.37301223,  0.44799682]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    Empty

All of these arguments (except for data) are optional, and will be filled in with default values:

In [9]: xray.DataArray(data)
Out[9]: 
<xray.DataArray (dim_0: 4, dim_1: 3)>
array([[ 0.12696983,  0.96671784,  0.26047601],
       [ 0.89723652,  0.37674972,  0.33622174],
       [ 0.45137647,  0.84025508,  0.12310214],
       [ 0.5430262 ,  0.37301223,  0.44799682]])
Coordinates:
    dim_0: Int64Index([0, 1, 2, 3], dtype='int64')
    dim_1: Int64Index([0, 1, 2], dtype='int64')
Attributes:
    Empty

You can also create a DataArray by supplying a pandas Series, DataFrame or Panel, in which case any non-specified arguments in the DataArray constructor will be filled in from the pandas object:

In [10]: df = pd.DataFrame({'x': [0, 1], 'y': [2, 3]}, index=['a', 'b'])

In [11]: df.index.name = 'abc'

In [12]: df.columns.name = 'xyz'

In [13]: df
Out[13]: 
xyz  x  y
abc      
a    0  2
b    1  3

In [14]: xray.DataArray(df)
Out[14]: 
<xray.DataArray (abc: 2, xyz: 2)>
array([[0, 2],
       [1, 3]])
Coordinates:
    abc: Index([u'a', u'b'], dtype='object')
    xyz: Index([u'x', u'y'], dtype='object')
Attributes:
    Empty

xray does not (yet!) support labeling coordinate values with a pandas.MultiIndex (see GH164). However, the alternate from_series constructor will automatically unpack any hierarchical indexes it encounters by expanding the series into a multi-dimensional array, as described in Working with pandas.

DataArray properties

Let’s take a look at the important properties on our array:

In [15]: foo.values
Out[15]: 
array([[ 0.12696983,  0.96671784,  0.26047601],
       [ 0.89723652,  0.37674972,  0.33622174],
       [ 0.45137647,  0.84025508,  0.12310214],
       [ 0.5430262 ,  0.37301223,  0.44799682]])

In [16]: foo.dims
Out[16]: ('time', 'space')

In [17]: foo.coords
Out[17]: 
time: <class 'pandas.tseries.index.DatetimeIndex'>
      [2000-01-01, ..., 2000-01-04]
      Length: 4, Freq: D, Timezone: None
space: Index([u'IA', u'IL', u'IN'], dtype='object')

In [18]: foo.attrs
Out[18]: OrderedDict()

In [19]: print(foo.name)
None

Now fill in some of that missing metadata:

In [20]: foo.name = 'foo'

In [21]: foo.attrs['units'] = 'meters'

In [22]: foo
Out[22]: 
<xray.DataArray 'foo' (time: 4, space: 3)>
array([[ 0.12696983,  0.96671784,  0.26047601],
       [ 0.89723652,  0.37674972,  0.33622174],
       [ 0.45137647,  0.84025508,  0.12310214],
       [ 0.5430262 ,  0.37301223,  0.44799682]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    units: meters

The coords property is dict like. Individual coordinates can be accessed by name:

In [23]: foo.coords['time']
Out[23]: 
<xray.Coordinate 'time' (time: 4)>
array(['1999-12-31T18:00:00.000000000-0600',
       '2000-01-01T18:00:00.000000000-0600',
       '2000-01-02T18:00:00.000000000-0600',
       '2000-01-03T18:00:00.000000000-0600'], dtype='datetime64[ns]')
Attributes:
    Empty

These are xray.Coordinate objects, which contain tick-labels for each dimension.

You can also access coordinates by indexing a DataArray directly by name, in which case it returns another DataArray:

In [24]: foo['time']
Out[24]: 
<xray.DataArray 'time' (time: 4)>
array(['1999-12-31T18:00:00.000000000-0600',
       '2000-01-01T18:00:00.000000000-0600',
       '2000-01-02T18:00:00.000000000-0600',
       '2000-01-03T18:00:00.000000000-0600'], dtype='datetime64[ns]')
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
Linked dataset variables:
    space, foo
Attributes:
    Empty

Dataset

xray.Dataset is xray’s multi-dimensional equivalent of a DataFrame. It is a dict-like container of labeled arrays (DataArray objects) with aligned dimensions. It is designed as an in-memory representation of the data model from the netCDF file format.

Creating a Dataset

To make an xray.Dataset from scratch, pass in a dictionary with values in the form (dimensions, data[, attrs]):

In [25]: times
Out[25]: 
<class 'pandas.tseries.index.DatetimeIndex'>
[2000-01-01, ..., 2000-01-04]
Length: 4, Freq: D, Timezone: None

In [26]: locs
Out[26]: ['IA', 'IL', 'IN']

In [27]: data
Out[27]: 
array([[ 0.12696983,  0.96671784,  0.26047601],
       [ 0.89723652,  0.37674972,  0.33622174],
       [ 0.45137647,  0.84025508,  0.12310214],
       [ 0.5430262 ,  0.37301223,  0.44799682]])

In [28]: ds = xray.Dataset({'time': ('time', times),
   ....:                    'space': ('space', locs),
   ....:                    'foo': (['time', 'space'], data)})
   ....: 

In [29]: ds
Out[29]: 
<xray.Dataset>
Dimensions:     (space: 3, time: 4)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    foo              1        0   
Attributes:
    Empty
  • dimensions should be a sequence of strings.
  • data should be a numpy.ndarray (or array-like object) that has a dimensionality equal to the length of the dimensions list.

We can also use xray.Variable or xray.DataArray objects instead of tuples:

In [30]: xray.Dataset({'bar': foo})
Out[30]: 
<xray.Dataset>
Dimensions:     (space: 3, time: 4)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    bar              1        0   
Attributes:
    Empty

You can also create an dataset from a pandas.DataFrame with Dataset.from_dataframe or from a netCDF file on disk with open_dataset(). See Working with pandas and Serialization and IO.

Dataset contents

Dataset implements the Python dictionary interface, with values given by xray.DataArray objects:

In [31]: 'foo' in ds
Out[31]: True

In [32]: ds.keys()
Out[32]: ['space', 'foo', 'time']

In [33]: ds['foo']
Out[33]: 
<xray.DataArray 'foo' (time: 4, space: 3)>
array([[ 0.12696983,  0.96671784,  0.26047601],
       [ 0.89723652,  0.37674972,  0.33622174],
       [ 0.45137647,  0.84025508,  0.12310214],
       [ 0.5430262 ,  0.37301223,  0.44799682]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    Empty

The valid keys include each listed “coordinate” and “noncoordinate” variables. Coordinates are arrays that label values along a particular dimension, implemented as a thin wrapper wrapper around a pandas.Index object. They are created automatically from dataset arrays whose name is equal to the one item in their list of dimensions.

Noncoordinate variables include all arrays in a Dataset other than its coordinates. These arrays can exist along multiple dimensions. The numbers in the columns in the Dataset representation indicate the order in which dimensions appear for each array (on a Dataset, the dimensions are always listed in alphabetical order).

We didn’t explicitly include an coordinate for the “space” dimension, so it was filled with an array of ascending integers of the proper length:

In [34]: ds['space']
Out[34]: 
<xray.DataArray 'space' (space: 3)>
array(['IA', 'IL', 'IN'], 
      dtype='|S2')
Coordinates:
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Linked dataset variables:
    foo, time
Attributes:
    Empty

Noncoordinate and coordinate variables are listed explicitly by the noncoords and coords attributes.

There are also a few derived variables based on datetime coordinates that you can access from a dataset (e.g., “year”, “month” and “day”), even if you didn’t explicitly add them. These are known as “virtual_variables”:

In [35]: ds['time.dayofyear']
Out[35]: 
<xray.DataArray 'time.dayofyear' (time: 4)>
array([1, 2, 3, 4])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
Linked dataset variables:
    space, foo
Attributes:
    Empty

Finally, datasets also store arbitrary metadata in the form of attributes:

In [36]: ds.attrs
Out[36]: OrderedDict()

In [37]: ds.attrs['title'] = 'example attribute'

In [38]: ds
Out[38]: 
<xray.Dataset>
Dimensions:     (space: 3, time: 4)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    foo              1        0   
Attributes:
    title: example attribute

xray does not enforce any restrictions on attributes, but serialization to some file formats may fail if you put in objects that are not strings, numbers or numpy.ndarray objects.

Modifying datasets

We can update a dataset in-place using Python’s standard dictionary syntax:

In [39]: ds['numbers'] = ('time', [10, 10, 20, 20])

In [40]: ds['abc'] = ('space', ['A', 'B', 'C'])

In [41]: ds
Out[41]: 
<xray.Dataset>
Dimensions:     (space: 3, time: 4)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    foo              1        0   
    numbers                   0   
    abc              0            
Attributes:
    title: example attribute

It should be evident now how a Dataset lets you store many arrays along a (partially) shared set of common dimensions and coordinates.

To change the variables in a Dataset, you can use all the standard dictionary methods, including values, items, __del__, get and update.

You also can select and drop an explicit list of variables by using the select_vars() and drop_vars() methods to return a new Dataset. select_vars automatically includes the relevant coordinates:

In [42]: ds.select_vars('abc')
Out[42]: 
<xray.Dataset>
Dimensions:     (space: 3)
Coordinates:
    space            X   
Noncoordinates:
    abc              0   
Attributes:
    title: example attribute

If a dimension name is given as an argument to drop_vars, it also drops all variables that use that dimension:

In [43]: ds.drop_vars('time', 'space')
Out[43]: 
<xray.Dataset>
Dimensions:     ()
Coordinates:
    None
Noncoordinates:
    None
Attributes:
    title: example attribute

You can copy a Dataset by using the copy() method:

In [44]: ds2 = ds.copy()

In [45]: del ds2['time']

In [46]: ds2
Out[46]: 
<xray.Dataset>
Dimensions:     (space: 3)
Coordinates:
    space            X   
Noncoordinates:
    abc              0   
Attributes:
    title: example attribute

By default, the copy is shallow, so only the container will be copied: the contents of the Dataset will still be the same underlying xray.Variable. You can copy all data by supplying the argument deep=True.

Indexing

Indexing a DataArray works (mostly) just like it does for numpy arrays, except that the returned object is always another DataArray:

In [47]: foo[:2]
Out[47]: 
<xray.DataArray 'foo' (time: 2, space: 3)>
array([[ 0.12696983,  0.96671784,  0.26047601],
       [ 0.89723652,  0.37674972,  0.33622174]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, 2000-01-02]
          Length: 2, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    units: meters

In [48]: foo[0, 0]
Out[48]: 
<xray.DataArray 'foo' ()>
array(0.12696983303810094)
Linked dataset variables:
    time, space
Attributes:
    units: meters

In [49]: foo[:, [2, 1]]
Out[49]: 
<xray.DataArray 'foo' (time: 4, space: 2)>
array([[ 0.26047601,  0.96671784],
       [ 0.33622174,  0.37674972],
       [ 0.12310214,  0.84025508],
       [ 0.44799682,  0.37301223]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IN', u'IL'], dtype='object')
Attributes:
    units: meters

xray also supports label-based indexing, just like pandas. Because Coordinate is a thin wrapper around a pandas.Index, label based indexing is very fast. To do label based indexing, use the loc attribute:

In [50]: foo.loc['2000-01-01':'2000-01-02', 'IA']
Out[50]: 
<xray.DataArray 'foo' (time: 2)>
array([ 0.12696983,  0.89723652])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, 2000-01-02]
          Length: 2, Freq: D, Timezone: None
Linked dataset variables:
    space
Attributes:
    units: meters

You can perform any of the label indexing operations supported by pandas, including indexing with individual, slices and arrays of labels, as well as indexing with boolean arrays. Like pandas, label based indexing in xray is inclusive of both the start and stop bounds.

Setting values with label based indexing is also supported:

In [51]: foo.loc['2000-01-01', ['IL', 'IN']] = -10

In [52]: foo
Out[52]: 
<xray.DataArray 'foo' (time: 4, space: 3)>
array([[  0.12696983, -10.        , -10.        ],
       [  0.89723652,   0.37674972,   0.33622174],
       [  0.45137647,   0.84025508,   0.12310214],
       [  0.5430262 ,   0.37301223,   0.44799682]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    units: meters

With labeled dimensions, we do not have to rely on dimension order and can use them explicitly to slice data with the sel() and isel() methods:

# index by integer array indices
In [53]: foo.isel(space=0, time=slice(None, 2))
Out[53]: 
<xray.DataArray 'foo' (time: 2)>
array([ 0.12696983,  0.89723652])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, 2000-01-02]
          Length: 2, Freq: D, Timezone: None
Linked dataset variables:
    space
Attributes:
    units: meters

# index by coordinate labels
In [54]: foo.sel(time=slice('2000-01-01', '2000-01-02'))
Out[54]: 
<xray.DataArray 'foo' (time: 2, space: 3)>
array([[  0.12696983, -10.        , -10.        ],
       [  0.89723652,   0.37674972,   0.33622174]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, 2000-01-02]
          Length: 2, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    units: meters

The arguments to these methods can be any objects that could index the array along that dimension, e.g., labels for an individual value, Python slice objects or 1-dimensional arrays.

We can also use these methods to index all variables in a dataset simultaneously, returning a new dataset:

In [55]: ds.isel(space=[0], time=[0])
Out[55]: 
<xray.Dataset>
Dimensions:     (space: 1, time: 1)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    foo              1        0   
    numbers                   0   
    abc              0            
Attributes:
    title: example attribute

In [56]: ds.sel(time='2000-01-01')
Out[56]: 
<xray.Dataset>
Dimensions:     (space: 3)
Coordinates:
    space            X   
Noncoordinates:
    foo              0   
    time                 
    numbers              
    abc              0   
Attributes:
    title: example attribute

Indexing with xray objects has one important difference from indexing numpy arrays: you can only use one-dimensional arrays to index xray objects, and each indexer is applied “orthogonally” along independent axes, instead of using numpy’s array broadcasting. This means you can do indexing like this, which wouldn’t work with numpy arrays:

In [57]: foo[foo['time.day'] > 1, foo['space'] != 'IL']
Out[57]: 
<xray.DataArray 'foo' (time: 3, space: 2)>
array([[ 0.89723652,  0.33622174],
       [ 0.45137647,  0.12310214],
       [ 0.5430262 ,  0.44799682]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-02, ..., 2000-01-04]
          Length: 3, Freq: D, Timezone: None
    space: Index([u'IA', u'IN'], dtype='object')
Attributes:
    units: meters

This is a much simpler model than numpy’s advanced indexing, and is basically the only model that works for labeled arrays. If you would like to do advanced indexing, you can always index .values directly instead:

In [58]: foo.values[foo.values > 0.5]
Out[58]: array([ 0.89723652,  0.84025508,  0.5430262 ])

Computation

The metadata of DataArray objects enables particularly nice features for doing mathematical operations.

Basic math

Basic math with DataArray objects works just as you would expect:

In [59]: foo - 3
Out[59]: 
<xray.DataArray 'foo' (time: 4, space: 3)>
array([[ -2.87303017, -13.        , -13.        ],
       [ -2.10276348,  -2.62325028,  -2.66377826],
       [ -2.54862353,  -2.15974492,  -2.87689786],
       [ -2.4569738 ,  -2.62698777,  -2.55200318]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    Empty

You can also use any of numpy’s or scipy’s many ufunc functions directly on a DataArray:

In [60]: np.sin(foo)
Out[60]: 
<xray.DataArray 'foo' (time: 4, space: 3)>
array([[ 0.12662895,  0.54402111,  0.54402111],
       [ 0.78160612,  0.36790009,  0.32992275],
       [ 0.43620456,  0.74481335,  0.12279146],
       [ 0.51672923,  0.36442217,  0.43316091]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    Empty

DataArray also has metadata aware versions of many numpy.ndarray methods:

In [61]: foo.T
Out[61]: 
<xray.DataArray 'foo' (space: 3, time: 4)>
array([[  0.12696983,   0.89723652,   0.45137647,   0.5430262 ],
       [-10.        ,   0.37674972,   0.84025508,   0.37301223],
       [-10.        ,   0.33622174,   0.12310214,   0.44799682]])
Coordinates:
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
Attributes:
    units: meters

In [62]: foo.round(2)
Out[62]: 
<xray.DataArray 'foo' (time: 4, space: 3)>
array([[  0.13, -10.  , -10.  ],
       [  0.9 ,   0.38,   0.34],
       [  0.45,   0.84,   0.12],
       [  0.54,   0.37,   0.45]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    Empty

It also has the isnull and notnull methods from pandas:

In [63]: xray.DataArray([0, 1, np.nan, np.nan, 2]).isnull()
Out[63]: 
<xray.DataArray (dim_0: 5)>
array([False, False,  True,  True, False], dtype=bool)
Coordinates:
    dim_0: Int64Index([0, 1, 2, 3, 4], dtype='int64')
Attributes:
    Empty

You cannot directly do math with Dataset objects (yet!), but you can map an operation over any or all non-coordinates in a dataset by using apply():

In [64]: ds.drop_vars('abc').apply(lambda x: 2 * x)
Out[64]: 
<xray.Dataset>
Dimensions:     (space: 3, time: 4)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    foo              1        0   
    numbers                   0   
Attributes:
    Empty
Aggregation

Aggregation methods from ndarray have been updated to take a dim argument instead of axis. This allows for very intuitive syntax for aggregation methods that are applied along particular dimension(s):

In [65]: foo.sum('time')
Out[65]: 
<xray.DataArray 'foo' (space: 3)>
array([ 2.01860903, -8.40998298, -9.09267929])
Coordinates:
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    Empty

In [66]: foo.std(['time', 'space'])
Out[66]: 
<xray.DataArray 'foo' ()>
array(3.901454019694515)
Attributes:
    Empty

In [67]: foo.min()
Out[67]: 
<xray.DataArray 'foo' ()>
array(-10.0)
Attributes:
    Empty

These operations also work on Dataset objects, by mapping over all non-coordinates:

In [68]: ds.mean('time')
Out[68]: 
<xray.Dataset>
Dimensions:     (space: 3)
Coordinates:
    space            X   
Noncoordinates:
    foo              0   
    numbers              
    abc              0   
Attributes:
    Empty

If you need to figure out the axis number for a dimension yourself (say, for wrapping code designed to work with numpy arrays), you can use the get_axis_num() method:

In [69]: foo.get_axis_num('space')
Out[69]: 1

To perform a NA skipping aggregations, pass the NA aware numpy functions directly to reduce method:

In [70]: foo.reduce(np.nanmean, 'time')
Out[70]: 
<xray.DataArray 'foo' (space: 3)>
array([ 0.50465226, -2.10249574, -2.27316982])
Coordinates:
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    Empty

Warning

Currently, xray uses the standard ndarray methods which do not automatically skip missing values, but we expect to switch the default to NA skipping versions (like pandas) in a future version (GH130).

Broadcasting

DataArray objects are automatically align themselves (“broadcasting” in the numpy parlance) by dimension name instead of axis order. With xray, you do not need to transpose arrays or insert dimensions of length 1 to get array operations to work, as commonly done in numpy with np.reshape() or np.newaxis.

This is best illustrated by a few examples. Consider two one-dimensional arrays with different sizes aligned along different dimensions:

In [71]: a = xray.DataArray([1, 2, 3, 4], [['a', 'b', 'c', 'd']], ['x'])

In [72]: a
Out[72]: 
<xray.DataArray (x: 4)>
array([1, 2, 3, 4])
Coordinates:
    x: Index([u'a', u'b', u'c', u'd'], dtype='object')
Attributes:
    Empty

In [73]: b = xray.DataArray([-1, -2, -3], dims=['y'])

In [74]: b
Out[74]: 
<xray.DataArray (y: 3)>
array([-1, -2, -3])
Coordinates:
    y: Int64Index([0, 1, 2], dtype='int64')
Attributes:
    Empty

With xray, we can apply binary mathematical operations to these arrays, and their dimensions are expanded automatically:

In [75]: a * b
Out[75]: 
<xray.DataArray (x: 4, y: 3)>
array([[ -1,  -2,  -3],
       [ -2,  -4,  -6],
       [ -3,  -6,  -9],
       [ -4,  -8, -12]])
Coordinates:
    x: Index([u'a', u'b', u'c', u'd'], dtype='object')
    y: Int64Index([0, 1, 2], dtype='int64')
Attributes:
    Empty

Moreover, dimensions are always reordered to the order in which they first appeared:

In [76]: c = xray.DataArray(np.arange(12).reshape(3, 4), [b['y'], a['x']])

In [77]: c
Out[77]: 
<xray.DataArray (y: 3, x: 4)>
array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])
Coordinates:
    y: Int64Index([0, 1, 2], dtype='int64')
    x: Index([u'a', u'b', u'c', u'd'], dtype='object')
Attributes:
    Empty

In [78]: a + c
Out[78]: 
<xray.DataArray (x: 4, y: 3)>
array([[ 1,  5,  9],
       [ 3,  7, 11],
       [ 5,  9, 13],
       [ 7, 11, 15]])
Coordinates:
    x: Index([u'a', u'b', u'c', u'd'], dtype='object')
    y: Int64Index([0, 1, 2], dtype='int64')
Attributes:
    Empty

This means, for example, that you always subtract an array from its transpose!

In [79]: c - c.T
Out[79]: 
<xray.DataArray (y: 3, x: 4)>
array([[0, 0, 0, 0],
       [0, 0, 0, 0],
       [0, 0, 0, 0]])
Coordinates:
    y: Int64Index([0, 1, 2], dtype='int64')
    x: Index([u'a', u'b', u'c', u'd'], dtype='object')
Attributes:
    Empty
Alignment

Performing most binary operations on xray objects requires that the all coordinate values are equal:

In [80]: a + a[:2]
ValueError: coordinate 'x' is not aligned

However, xray does have some methods (copied from pandas) that make aligning DataArray and Dataset objects manually easy and fast.

Warning

pandas does index based alignment automatically when doing math, using join='outer'. xray doesn’t have automatic alignment yet, but we do intend to enable it in a future version (GH186). Unlike pandas, we expect to default to join='inner'.

Reindexing returns modified arrays with new coordinates, filling in missing values with NaN. To reindex a particular dimension, use reindex():

The reindex_like() method is a useful shortcut. To demonstrate, we will make a subset DataArray with new values:

In [81]: baz = (10 * foo[:2, :2]).rename('baz')

In [82]: baz
Out[82]: 
<xray.DataArray 'baz' (time: 2, space: 2)>
array([[   1.26969833, -100.        ],
       [   8.97236524,    3.76749716]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, 2000-01-02]
          Length: 2, Freq: D, Timezone: None
    space: Index([u'IA', u'IL'], dtype='object')
Attributes:
    Empty

Reindexing foo with baz selects out the first two values along each dimension:

In [83]: foo.reindex_like(baz)
Out[83]: 
<xray.DataArray 'foo' (time: 2, space: 2)>
array([[  0.12696983, -10.        ],
       [  0.89723652,   0.37674972]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, 2000-01-02]
          Length: 2, Freq: D, Timezone: None
    space: Index([u'IA', u'IL'], dtype='object')
Attributes:
    units: meters

The opposite operation asks us to reindex to a larger shape, so we fill in the missing values with NaN:

In [84]: baz.reindex_like(foo)
Out[84]: 
<xray.DataArray 'baz' (time: 4, space: 3)>
array([[   1.26969833, -100.        ,           nan],
       [   8.97236524,    3.76749716,           nan],
       [          nan,           nan,           nan],
       [          nan,           nan,           nan]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    Empty

The align() function lets us perform more flexible 'inner', 'outer', 'left' and 'right' joins:

In [85]: xray.align(foo, baz, join='inner')
Out[85]: 
(<xray.DataArray 'foo' (time: 2, space: 2)>
 array([[  0.12696983, -10.        ],
        [  0.89723652,   0.37674972]])
 Coordinates:
     time: <class 'pandas.tseries.index.DatetimeIndex'>
           [2000-01-01, 2000-01-02]
           Length: 2, Freq: None, Timezone: None
     space: Index([u'IA', u'IL'], dtype='object')
 Attributes:
     units: meters, <xray.DataArray 'baz' (time: 2, space: 2)>
 array([[   1.26969833, -100.        ],
        [   8.97236524,    3.76749716]])
 Coordinates:
     time: <class 'pandas.tseries.index.DatetimeIndex'>
           [2000-01-01, 2000-01-02]
           Length: 2, Freq: None, Timezone: None
     space: Index([u'IA', u'IL'], dtype='object')
 Attributes:
     Empty)

In [86]: xray.align(foo, baz, join='outer')
Out[86]: 
(<xray.DataArray 'foo' (time: 4, space: 3)>
 array([[  0.12696983, -10.        , -10.        ],
        [  0.89723652,   0.37674972,   0.33622174],
        [  0.45137647,   0.84025508,   0.12310214],
        [  0.5430262 ,   0.37301223,   0.44799682]])
 Coordinates:
     time: <class 'pandas.tseries.index.DatetimeIndex'>
           [2000-01-01, ..., 2000-01-04]
           Length: 4, Freq: D, Timezone: None
     space: Index([u'IA', u'IL', u'IN'], dtype='object')
 Attributes:
     units: meters, <xray.DataArray 'baz' (time: 4, space: 3)>
 array([[   1.26969833, -100.        ,           nan],
        [   8.97236524,    3.76749716,           nan],
        [          nan,           nan,           nan],
        [          nan,           nan,           nan]])
 Coordinates:
     time: <class 'pandas.tseries.index.DatetimeIndex'>
           [2000-01-01, ..., 2000-01-04]
           Length: 4, Freq: D, Timezone: None
     space: Index([u'IA', u'IL', u'IN'], dtype='object')
 Attributes:
     Empty)

Both reindex_like and align work interchangeably with DataArray and xray.Dataset objects with any number of overlapping dimensions:

In [87]: ds
Out[87]: 
<xray.Dataset>
Dimensions:     (space: 3, time: 4)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    foo              1        0   
    numbers                   0   
    abc              0            
Attributes:
    title: example attribute

In [88]: ds.reindex_like(baz)
Out[88]: 
<xray.Dataset>
Dimensions:     (space: 2, time: 2)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    foo              1        0   
    numbers                   0   
    abc              0            
Attributes:
    title: example attribute

GroupBy: split-apply-combine

Pandas has very convenient support for “group by” operations, which implement the split-apply-combine strategy for crunching data:

  • Split your data into multiple independent groups.
  • Apply some function to each group.
  • Combine your groups back into a single data object.

xray implements this same pattern using very similar syntax to pandas. Group by operations work on both Dataset and DataArray objects. Note that currently, you can only group by a single one-dimensional variable (eventually, we hope to remove this limitation).

Split

Recall the “numbers” variable in our dataset:

In [89]: ds['numbers']
Out[89]: 
<xray.DataArray 'numbers' (time: 4)>
array([10, 10, 20, 20])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
Linked dataset variables:
    space, foo, abc
Attributes:
    Empty

If we groupby the name of a variable in a dataset (we can also use a DataArray directly), we get back a xray.GroupBy object:

In [90]: ds.groupby('numbers')
Out[90]: <xray.groupby.DatasetGroupBy at 0x7f411f94b550>

This object works very similarly to a pandas GroupBy object. You can view the group indices with the groups attribute:

In [91]: ds.groupby('numbers').groups
Out[91]: {10: [0, 1], 20: [2, 3]}

You can also iterate over over groups in (label, group) pairs:

In [92]: list(ds.groupby('numbers'))
Out[92]: 
[(10, <xray.Dataset>
  Dimensions:     (space: 3, time: 2)
  Coordinates:
      space            X            
      time                      X   
  Noncoordinates:
      foo              1        0   
      numbers                   0   
      abc              0            
  Attributes:
      title: example attribute), (20, <xray.Dataset>
  Dimensions:     (space: 3, time: 2)
  Coordinates:
      space            X            
      time                      X   
  Noncoordinates:
      foo              1        0   
      numbers                   0   
      abc              0            
  Attributes:
      title: example attribute)]

Just like in pandas, creating a GroupBy object doesn’t actually split the data until you want to access particular values.

Apply

To apply a function to each group, you can use the flexible xray.GroupBy.apply() method. The resulting objects are automatically concatenated back together along the group axis:

In [93]: def standardize(x):
   ....:     return (x - x.mean()) / x.std()
   ....: 

In [94]: ds['foo'].groupby('numbers').apply(standardize)
Out[94]: 
<xray.DataArray (time: 4, space: 3)>
array([[ 0.64391378, -1.41264919, -1.41264919],
       [ 0.80033786,  0.69463853,  0.6864082 ],
       [-0.05512164,  1.76892497, -1.59490205],
       [ 0.37476413, -0.42269144, -0.07097397]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: None, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Linked dataset variables:
    numbers
Attributes:
    Empty

GroupBy objects also have a reduce() method and methods like mean() as shortcuts for applying an aggregation function:

In [95]: foo.groupby('time').mean()
Out[95]: 
<xray.DataArray 'foo' (time: 4)>
array([-6.62434339,  0.53673599,  0.4715779 ,  0.45467842])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
Attributes:
    units: meters

In [96]: ds.groupby('numbers').reduce(np.nanmean)
Out[96]: 
<xray.Dataset>
Dimensions:     (numbers: 2)
Coordinates:
    numbers           X    
Noncoordinates:
    foo               0    
Attributes:
    Empty
Squeezing

When grouping over a dimension, you can control whether the dimension is squeezed out or if it should remain with length one on each group by using the squeeze parameter:

In [97]: next(iter(foo.groupby('space')))
Out[97]: 
('IA', <xray.DataArray 'foo' (time: 4)>
 array([ 0.12696983,  0.89723652,  0.45137647,  0.5430262 ])
 Coordinates:
     time: <class 'pandas.tseries.index.DatetimeIndex'>
           [2000-01-01, ..., 2000-01-04]
           Length: 4, Freq: D, Timezone: None
 Linked dataset variables:
     space
 Attributes:
     units: meters)
In [98]: next(iter(foo.groupby('space', squeeze=False)))
Out[98]: 
('IA', <xray.DataArray 'foo' (time: 4, space: 1)>
 array([[ 0.12696983],
        [ 0.89723652],
        [ 0.45137647],
        [ 0.5430262 ]])
 Coordinates:
     time: <class 'pandas.tseries.index.DatetimeIndex'>
           [2000-01-01, ..., 2000-01-04]
           Length: 4, Freq: D, Timezone: None
     space: Index([u'IA'], dtype='object')
 Attributes:
     units: meters)

Although xray will attempt to automatically transpose dimensions back into their original order when you use apply, it is sometimes useful to set squeeze=False to guarantee that all original dimensions remain unchanged.

You can always squeeze explicitly later with the Dataset or DataArray squeeze() methods.

Combining data

Concatenate

To combine arrays along a dimension into a larger arrays, you can use the DataArray.concat and Dataset.concat class methods:

In [99]: xray.DataArray.concat([foo[0], foo[1]], 'new_dim')
Out[99]: 
<xray.DataArray 'foo' (new_dim: 2, space: 3)>
array([[  0.12696983, -10.        , -10.        ],
       [  0.89723652,   0.37674972,   0.33622174]])
Coordinates:
    new_dim: Int64Index([0, 1], dtype='int64')
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Linked dataset variables:
    time
Attributes:
    units: meters

In [100]: xray.Dataset.concat([ds.sel(time='2000-01-01'), ds.sel(time='2000-01-03')],
   .....:                     'new_dim')
   .....: 
Out[100]: 
<xray.Dataset>
Dimensions:     (new_dim: 2, space: 3)
Coordinates:
    new_dim           X              
    space                        X   
Noncoordinates:
    abc                          0   
    foo               0          1   
    numbers           0              
    time              0              
Attributes:
    title: example attribute

The second argument to concat can be Coordinate or DataArray object as well as a string, in which case it is used to label the values along the new dimension:

In [101]: xray.DataArray.concat([foo[0], foo[1]], xray.Coordinate('x', [-90, -100]))
Out[101]: 
<xray.DataArray 'foo' (x: 2, space: 3)>
array([[  0.12696983, -10.        , -10.        ],
       [  0.89723652,   0.37674972,   0.33622174]])
Coordinates:
    x: Int64Index([-90, -100], dtype='int64')
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Linked dataset variables:
    time
Attributes:
    units: meters

Dataset.concat has a number of options which control how it combines data, and in particular, how it handles conflicting variables between datasets.

Merge and update

To combine multiple Datasets, you can use the merge() and update() methods. Merge checks for conflicting variables before merging and by default it returns a new Dataset:

In [102]: ds.merge({'hello': ('space', np.arange(3) + 10)})
Out[102]: 
<xray.Dataset>
Dimensions:     (space: 3, time: 4)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    foo              1        0   
    numbers                   0   
    abc              0            
    hello            0            
Attributes:
    title: example attribute

In contrast, update modifies a dataset in-place without checking for conflicts, and will overwrite any existing variables with new values:

In [103]: ds.update({'space': ('space', [10.2, 9.4, 3.9])})
Out[103]: 
<xray.Dataset>
Dimensions:     (space: 3, time: 4)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    foo              1        0   
    numbers                   0   
    abc              0            
Attributes:
    title: example attribute

However, dimensions are still required to be consistent between different Dataset variables, so you cannot change the size of a dimension unless you replace all dataset variables that use it.

Equals and identical

xray objects can be compared by using the equals() and identical() methods.

equals checks dimension names, indexes and array values:

In [104]: foo.equals(foo.copy())
Out[104]: True

identical also checks attributes, and the name of each object:

In [105]: foo.identical(foo.rename('bar'))
Out[105]: False

In contrast, the == for DataArray objects performs element- wise comparison (like numpy):

In [106]: foo == foo.copy()
Out[106]: 
<xray.DataArray (time: 4, space: 3)>
array([[ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True],
       [ True,  True,  True]], dtype=bool)
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    Empty

Like pandas objects, two xray objects are still equal or identical if they have missing values marked by NaN, as long as the missing values are in the same locations in both objects. This is not true for NaN in general, which usually compares False to everything, including itself:

In [107]: np.nan == np.nan
Out[107]: False

Working with pandas

One of the most important features of xray is the ability to convert to and from pandas objects to interact with the rest of the PyData ecosystem. For example, for plotting labeled data, we highly recommend using the visualization built in to pandas itself or provided by the pandas aware libraries such as Seaborn and ggplot.

Fortunately, there are straightforward representations of Dataset and DataArray in terms of pandas.DataFrame and pandas.Series, respectively. The representation works by flattening non-coordinates to 1D, and turning the tensor product of coordinate indexes into a pandas.MultiIndex.

Note

If you want to convert a pandas data-structure into a DataArray with the same number of dimensions, you can simply use the DataArray construtor directly

pandas.DataFrame

To convert to a DataFrame, use the Dataset.to_dataframe() method:

In [108]: df = ds.to_dataframe()

In [109]: df
Out[109]: 
                        foo  numbers abc
space time                              
10.2  2000-01-01   0.126970       10   A
      2000-01-02   0.897237       10   A
      2000-01-03   0.451376       20   A
      2000-01-04   0.543026       20   A
9.4   2000-01-01 -10.000000       10   B
      2000-01-02   0.376750       10   B
      2000-01-03   0.840255       20   B
      2000-01-04   0.373012       20   B
3.9   2000-01-01 -10.000000       10   C
      2000-01-02   0.336222       10   C
      2000-01-03   0.123102       20   C
      2000-01-04   0.447997       20   C

We see that each nonindex in the Dataset is now a column in the DataFrame. The DataFrame representation is reminiscent of Hadley Wickham’s notion of tidy data. To convert the DataFrame to any other convenient representation, use DataFrame methods like reset_index(), stack() and unstack().

To create a Dataset from a DataFrame, use the from_dataframe() class method:

In [110]: xray.Dataset.from_dataframe(df)
Out[110]: 
<xray.Dataset>
Dimensions:     (space: 3, time: 4)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    foo              0        1   
    numbers          0        1   
    abc              0        1   
Attributes:
    Empty

Notice that that dimensions of non-coordinates in the Dataset have now expanded after the round-trip conversion to a DataFrame. This is because every object in a DataFrame must have the same indices, so needed to broadcast the data of each array to the full size of the new MultiIndex.

pandas.Series

DataArray objects have a complementary representation in terms of a pandas.Series. Using a Series preserves the Dataset to DataArray relationship, because DataFrames are dict-like containers of Series. The methods are very similar to those for working with DataFrames:

In [111]: s = foo.to_series()

In [112]: s
Out[112]: 
time        space
2000-01-01  IA        0.126970
            IL      -10.000000
            IN      -10.000000
2000-01-02  IA        0.897237
            IL        0.376750
            IN        0.336222
2000-01-03  IA        0.451376
            IL        0.840255
            IN        0.123102
2000-01-04  IA        0.543026
            IL        0.373012
            IN        0.447997
Name: foo, dtype: float64

In [113]: xray.DataArray.from_series(s)
Out[113]: 
<xray.DataArray 'foo' (time: 4, space: 3)>
array([[  0.12696983, -10.        , -10.        ],
       [  0.89723652,   0.37674972,   0.33622174],
       [  0.45137647,   0.84025508,   0.12310214],
       [  0.5430262 ,   0.37301223,   0.44799682]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: None, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    Empty

Both the from_series and from_dataframe methods use reindexing, so they works even if not the hierarchical index is not a full tensor product:

In [114]: s[::2]
Out[114]: 
time        space
2000-01-01  IA        0.126970
            IN      -10.000000
2000-01-02  IL        0.376750
2000-01-03  IA        0.451376
            IN        0.123102
2000-01-04  IL        0.373012
Name: foo, dtype: float64

In [115]: xray.DataArray.from_series(s[::2])
Out[115]: 
<xray.DataArray 'foo' (time: 4, space: 3)>
array([[  0.12696983,          nan, -10.        ],
       [         nan,   0.37674972,          nan],
       [  0.45137647,          nan,   0.12310214],
       [         nan,   0.37301223,          nan]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: None, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    Empty

Serialization and IO

xray supports direct serialization and IO to several file formats. For more options, consider exporting your objects to pandas (see the preceeding section) and using its broad range of IO tools.

Pickle

The simplest way to serialize an xray object is to use Python’s built-in pickle module:

In [116]: import cPickle as pickle

In [117]: pkl = pickle.dumps(ds)

In [118]: pickle.loads(pkl)
Out[118]: 
<xray.Dataset>
Dimensions:     (space: 3, time: 4)
Coordinates:
    space            X            
    time                      X   
Noncoordinates:
    foo              1        0   
    numbers                   0   
    abc              0            
Attributes:
    title: example attribute

Pickle support is important because it doesn’t require any external libraries and lets you use xray objects with Python modules like multiprocessing. However, there are two important cavaets:

  1. To simplify serialization, xray’s support for pickle currently loads all array values into memory before dumping an object. This means it is not suitable for serializing datasets too big to load into memory (e.g., from netCDF or OPeNDAP).
  2. Pickle will only work as long as the internal data structure of xray objects remains unchanged. Because the internal design of xray is still being refined, we make no guarantees (at this point) that objects pickled with this version of xray will work in future versions.
Reading and writing to disk (netCDF)

Currently, the only external serialization format that xray supports is netCDF. netCDF is a file format for fully self-described datasets that is widely used in the geosciences and supported on almost all platforms. We use netCDF because xray was based on the netCDF data model, so netCDF files on disk directly correspond to Dataset objects. Recent versions netCDF are based on the even more widely used HDF5 file-format.

Reading and writing netCDF files with xray requires the Python-netCDF4 library.

We can save a Dataset to disk using the Dataset.to_netcdf method:

In [119]: ds.to_netcdf('saved_on_disk.nc')

By default, the file is saved as netCDF4.

We can load netCDF files to create a new Dataset using the open_dataset() function:

In [120]: ds_disk = xray.open_dataset('saved_on_disk.nc')

In [121]: ds_disk
Out[121]: 
<xray.Dataset>
Dimensions:     (space: 4, time: 3)
Coordinates:
    space            X
    time                      X
Noncoordinates:
    foo              1        0
    numbers          0
    abc                       0
Attributes:
    title: example attribute

A dataset can also be loaded from a specific group within a netCDF file. To load from a group, pass a group keyword argument to the open_dataset function. The group can be specified as a path-like string, e.g., to access subgroup ‘bar’ within group ‘foo’ pass ‘/foo/bar’ as the group argument.

Data is loaded lazily from netCDF files. You can manipulate, slice and subset Dataset and DataArray objects, and no array values are loaded into memory until necessary. For an example of how these lazy arrays work, see the OPeNDAP section below.

Datasets have a close() method to close the associated netCDF file. The preferred way to handle this is to use a context-manager:

In [122]: with xray.open_dataset('my_file.nc') as ds:
...           print(ds.keys())
Out[122]: ['space', 'foo', 'time', 'numbers', 'abc']

Note

Although xray provides reasonable support for incremental reads of files on disk, it does not yet support incremental writes, which is important for dealing with datasets that do not fit into memory. This is a significant shortcoming that we hope to resolve (GH199) by adding the ability to create Dataset objects directly linked to a netCDF file on disk.

NetCDF files follow some conventions for encoding datetime arrays (as numbers with a “units” attribute) and for packing and unpacking data (as described by the “scale_factor” and “_FillValue” attributes). If the argument decode_cf=True (default) is given to open_dataset, xray will attempt to automatically decode the values in the netCDF objects according to CF conventions. Sometimes this will fail, for example, if a variable has an invalid “units” or “calendar” attribute. For these cases, you can turn this decoding off manually.

You can view this encoding information and control the details of how xray serializes objects, by viewing and manipulating the DataArray.encoding attribute:

In [123]: ds_disk['time'].encoding
Out[123]: 
{'calendar': u'proleptic_gregorian',
 'chunksizes': None,
 'complevel': 0,
 'contiguous': True,
 'dtype': dtype('float64'),
 'fletcher32': False,
 'least_significant_digit': None,
 'shuffle': False,
 'units': u'days since 2000-01-01 00:00:00',
 'zlib': False}
Working with remote datasets (OPeNDAP)

xray includes support for OPeNDAP (via the netCDF4 library or Pydap), which lets us access large datasets over HTTP.

For example, we can open a connetion to GBs of weather data produced by the PRISM project, and hosted by International Research Institute for Climate and Society at Columbia:

In [124]: remote_data = xray.open_dataset(
    'http://iridl.ldeo.columbia.edu/SOURCES/.OSU/.PRISM/.monthly/dods')

In [125]: remote_data
Out[125]: 
<xray.Dataset>
Dimensions:     (T: 1432, X: 1405, Y: 621)
Coordinates:
    T               X
    X                        X
    Y                                 X
Noncoordinates:
    ppt             0        2        1
    tdmean          0        2        1
    tmax            0        2        1
    tmin            0        2        1
Attributes:
    Conventions: IRIDL
    expires: 1401580800

In [126]: remote_data['tmax']
Out[126]: 
<xray.DataArray 'tmax' (T: 1432, Y: 621, X: 1405)>
[1249427160 values with dtype=float64]
Attributes:
    pointwidth: 120
    units: Celsius_scale
    missing_value: -9999
    standard_name: air_temperature
    expires: 1401580800

We can select and slice this data any number of times, and nothing is loaded over the network until we look at particular values:

In [127]: tmax = remote_data['tmax'][:500, ::3, ::3]

In [128]: tmax
Out[128]: 
<xray.DataArray 'tmax' (T: 500, Y: 207, X: 469)>
[48541500 values with dtype=float64]
Attributes:
    pointwidth: 120
    units: Celsius_scale
    missing_value: -9999
    standard_name: air_temperature
    expires: 1401580800

Now, let’s access and plot a small subset:

In [129]: tmax_ss = tmax[0]

For this dataset, we still need to manually fill in some of the values with NaN to indicate that they are missing. As soon as we access tmax_ss.values, the values are loaded over the network and cached on the DataArray so they can be manipulated:

In [130]: tmax_ss.values[tmax_ss.values < -99] = np.nan

Finally, we can plot the values with matplotlib:

In [131]: import matplotlib.pyplot as plt

In [132]: from matplotlib.cm import get_cmap

In [133]: plt.figure(figsize=(9, 5))

In [134]: plt.gca().patch.set_color('0')

In [135]: plt.contourf(tmax_ss['X'], tmax_ss['Y'], tmax_ss.values, 20,
     ...:     cmap=get_cmap('RdBu_r'))

In [136]: plt.colorbar()
_images/opendap-prism-tmax.png
Loading into memory

xray’s lazy loading of remote or on-disk datasets is not always desirable. In such cases, you can use the load_data() method to force loading a Dataset or DataArray entirely into memory. In particular, this can lead to significant speedups if done before performing array-based indexing.

Notes on xray’s internals

Warning

These implementation details may be useful for advanced users, but they will change in future versions.

DataArray

In the current version of xray, DataArrays are simply pointers to a dataset (the dataset attribute) and the name of a variable in the dataset (the name attribute), which indicates to which variable array operations should be applied. These variables are listed in the DataArray representation as “linked dataset variables”:

In [137]: foo
Out[137]: 
<xray.DataArray 'foo' (time: 4, space: 3)>
array([[  0.12696983, -10.        , -10.        ],
       [  0.89723652,   0.37674972,   0.33622174],
       [  0.45137647,   0.84025508,   0.12310214],
       [  0.5430262 ,   0.37301223,   0.44799682]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    units: meters

Usually, xray automatically manages the Dataset objects that data arrays points to in a satisfactory fashion.

However, in some cases, particularly for performance reasons, you may want to explicitly ensure that the dataset only includes the variables you are interested in. For these cases, use the xray.DataArray.select_vars() method to select the names of variables you want to keep around, by default including the name of only the DataArray itself:

In [138]: foo2 = foo.select_vars()

In [139]: foo2
Out[139]: 
<xray.DataArray 'foo' (time: 4, space: 3)>
array([[  0.12696983, -10.        , -10.        ],
       [  0.89723652,   0.37674972,   0.33622174],
       [  0.45137647,   0.84025508,   0.12310214],
       [  0.5430262 ,   0.37301223,   0.44799682]])
Coordinates:
    time: <class 'pandas.tseries.index.DatetimeIndex'>
          [2000-01-01, ..., 2000-01-04]
          Length: 4, Freq: D, Timezone: None
    space: Index([u'IA', u'IL', u'IN'], dtype='object')
Attributes:
    units: meters

foo2 is generally an equivalent labeled array to foo, but we dropped the dataset variables that are no longer relevant:

In [140]: foo.dataset.keys()
Out[140]: ['time', 'space', 'foo']

In [141]: foo2.dataset.keys()
Out[141]: ['time', 'space', 'foo']

Note

This feature may change in a future version of xray, because we intend to support non-index coordinates (GH197), which should cover all the use cases for “linked dataset variables” in a much more obvious fashion.

Variable

Variable implements xray’s basic building block for Dataset and DataArray variables. It supports the numpy ndarray interface, but is extended to support and use basic metadata (not including index values). It consists of:

  1. dims: A tuple of dimension names.
  2. values: The N-dimensional array (for example, of type numpy.ndarray) storing the array’s data. It must have the same number of dimensions as the length of dimensions.
  3. attrs: An ordered dictionary of additional metadata to associate with this array.

The main functional difference between Variables and numpy arrays is that numerical operations on Variables implement array broadcasting by dimension name. For example, adding an Variable with dimensions (‘time’,) to another Variable with dimensions (‘space’,) results in a new Variable with dimensions (‘time’, ‘space’). Furthermore, numpy reduce operations like mean or sum are overwritten to take a “dimension” argument instead of an “axis”.

Variables are light-weight objects used as the building block for datasets. They are more primitive objects, so operations with them provide marginally higher performance than using DataArrays. However, manipulating data in the form of a Dataset or DataArray should almost always be preferred, because they can use more complete metadata in context of coordinate labels.

You can find a read-only copy of the variables associated with a Dataset in its .variables attribute, or for a DataArray in its .variable attribute.

Examples

Shared setup:

import xray
import numpy as np
import pandas as pd

np.random.seed(123)

def make_example_data():
    times = pd.date_range('2000-01-01', '2001-12-31', name='time')
    annual_cycle = np.sin(2 * np.pi * (times.dayofyear / 365.25 - 0.28))

    base = 10 + 15 * annual_cycle.reshape(-1, 1)
    tmin_values = base + 5 * np.random.randn(annual_cycle.size, 10)
    tmax_values = base + 10 + 5 * np.random.randn(annual_cycle.size, 10)

    ds = xray.Dataset({'tmin': (('time', 'x'), tmin_values),
                       'tmax': (('time', 'x'), tmax_values),
                       'time': ('time', times)})
    return ds

ds = make_example_data()

Monthly averaging

def year_month(xray_obj):
    """Given an xray object with a 'time' coordinate, return an DataArray
    with values given by the first date of the month in which each time
    falls.
    """
    time = xray_obj.coords['time']
    values = pd.Index(time).to_period('M').to_timestamp()
    return xray.DataArray(values, [time], name='year_month')

ds.mean('x').to_dataframe().plot()

monthly_avg = ds.groupby(year_month(ds)).mean('time')
monthly_avg.mean('x').to_dataframe().plot(style='s-')
_images/examples_tmin_tmax_plot.png _images/examples_tmin_tmax_plot_mean.png

Calculate monthly anomalies

def unique_item(items):
    """Return the single unique element of an iterable, or raise an error
    """
    items = set(items)
    assert len(items) == 1
    return items.pop()

def _anomaly_one_month(ds):
    month = unique_item(ds['time.month'].values)
    rel_clim = climatology.sel(**{'time.month': month})
    return ds.apply(lambda x: x - rel_clim[x.name])

climatology = ds.groupby('time.month').mean('time')
anomalies = ds.groupby('time.month').apply(_anomaly_one_month)
# in a future verson of xray, this should be as easy as:
# anomalies = ds.groupby('time.month') - climatology

anomalies.mean('x').drop_vars('time.month').to_dataframe().plot()
_images/examples_anomalies_plot.png

Frequently Asked Questions

Why is pandas not enough?

pandas, thanks to its unrivaled speed and flexibility, has emerged as the premier python package for working with labeled arrays. So why are we contributing to further fragmentation in the ecosystem for working with data arrays in Python?

Sometimes, we really want to work with collections of higher dimensional array (ndim > 2), or arrays for which the order of dimensions (e.g., columns vs rows) shouldn’t really matter. For example, climate and weather data is often natively expressed in 4 or more dimensions: time, x, y and z.

Pandas does support N-dimensional panels, but the implementation is very limited:

  • You need to create a new factory type for each dimensionality.
  • You can’t do math between NDPanels with different dimensionality.
  • Each dimension in a NDPanel has a name (e.g., ‘labels’, ‘items’, ‘major_axis’, etc.) but the dimension names refer to order, not their meaning. You can’t specify an operation as to be applied along the “time” axis.

Fundamentally, the N-dimensional panel is limited by its context in pandas’s tabular model, which treats a 2D DataFrame as a collections of 1D Series, a 3D Panel as a collection of 2D DataFrame, and so on. pandas gets a lot of things right, but scientific users need fully multi- dimensional data structures.

When should I use xray instead of pandas?

It’s not an either/or choice! xray provides robust support for converting back and forth between the tabular data-structures of pandas and its own multi-dimensional data-structures.

That said, you should only bother with xray if some aspect of data is fundamentally multi-dimensional. If your data is unstructured or one-dimensional, stick with pandas, which is a more developed toolkit for doing data analysis in Python.

How can I use xray with heterogeneous data?

All items in a DataArray must have a single (homogeneous) data type. To work with heterogeneous or structured data types in xray, put separate DataArray objects in a single Dataset.

The Dataset object allows for most of the flexibility of heterogenerous data without the complexity or performance cost, because its constituent arrays only have a single dtype.

What is your approach to metadata?

We are firm believers in the power of labeled data! In addition to dimensions and coordinates, xray supports arbitrary metadata in the form of global (Dataset) and variable specific (DataArray) attributes (attrs).

Automatic interpretation of labels is powerful but also reduces flexibility. With xray, we draw a firm line between labels that the library understands (dims and coords) and labels for users and user code (attrs). For example, we do not automatically intrepret and enforce units or CF conventions. (An exception is serialization to netCDF with cf_conventions=True.)

An implication of this choice is that we do not propagate attrs through most operations unless explicitly flagged (some methods have a keep_attrs option). Similarly, xray usually drops conflicting attrs when combining arrays and datasets instead of raising an exception, unless explicitly requested with the option compat='identical'. The guiding principle is that metadata should not be allowed to get in the way.

API reference

Dataset

Creating a dataset
Dataset([variables, coords, attrs]) A netcdf-like data object consisting of variables and attributes which together form a self describing dataset.
open_dataset(nc[, decode_cf, ...]) Load a dataset from a file or file-like object.
Attributes and underlying data
Dataset.coords Dictionary of xray.Coordinate objects used for label based indexing.
Dataset.noncoords Dictionary of DataArrays whose names do not match dimensions.
Dataset.dims Mapping from dimension names to lengths.
Dataset.attrs Dictionary of global attributes on this dataset
Dataset contents

Datasets implement the mapping interface with keys given by variable names and values given by DataArray objects.

Dataset.__getitem__(key) Access the given variable name in this dataset as a DataArray.
Dataset.__setitem__(key, value) Add an array to this dataset.
Dataset.__delitem__(key) Remove a variable from this dataset.
Dataset.update(other[, inplace]) Update this dataset’s variables and attributes with those from another dataset.
Dataset.merge(other[, inplace, ...]) Merge the variables of two datasets into a single new dataset.
Dataset.concat(datasets[, dim, indexers, ...]) Concatenate datasets along a new or existing dimension.
Dataset.copy([deep]) Returns a copy of this dataset.
Dataset.load_data() Manually trigger loading of this dataset’s data from disk or a remote source into memory and return this dataset.
Dataset.iteritems(...)
Dataset.itervalues(...)
Dataset.virtual_variables A frozenset of variable names that don’t exist in this dataset but for which could be created on demand.
Comparisons
Dataset.equals(other) Two Datasets are equal if they have the same variables and all variables are equal.
Dataset.identical(other) Two Datasets are identical if they have the same variables and all variables are identical (with the same attributes), and they also have the same global attributes.
Selecting
Dataset.isel(**indexers) Return a new dataset with each array indexed along the specified dimension(s).
Dataset.sel(**indexers) Return a new dataset with each variable indexed by tick labels along the specified dimension(s).
Dataset.reindex([copy]) Conform this object onto a new set of coordinates, filling in missing values with NaN.
Dataset.reindex_like(other[, copy]) Conform this object onto the coordinates of another object, filling in missing values with NaN.
Dataset.rename(name_dict[, inplace]) Returns a new object with renamed variables and dimensions.
Dataset.select_vars(*names) Returns a new dataset that contains only the named variables and their coordinates.
Dataset.drop_vars(*names) Returns a new dataset without the named variables.
Dataset.squeeze([dim]) Return a new dataset with squeezed data.
Dataset.groupby(group[, squeeze]) Group this dataset by unique values of the indicated group.
Computations
Dataset.apply(func[, keep_attrs]) Apply a function over noncoordinate variables in this dataset.
Dataset.reduce(func[, dim, keep_attrs]) Reduce this dataset by applying func along some dimension(s).
Dataset.all([dim, keep_attrs]) Reduce this Dataset’s data by applying numpy.all along some dimension(s).
Dataset.any([dim, keep_attrs]) Reduce this Dataset’s data by applying numpy.any along some dimension(s).
Dataset.argmax([dim, keep_attrs]) Reduce this Dataset’s data by applying numpy.argmax along some dimension(s).
Dataset.argmin([dim, keep_attrs]) Reduce this Dataset’s data by applying numpy.argmin along some dimension(s).
Dataset.max([dim, keep_attrs]) Reduce this Dataset’s data by applying numpy.max along some dimension(s).
Dataset.min([dim, keep_attrs]) Reduce this Dataset’s data by applying numpy.min along some dimension(s).
Dataset.mean([dim, keep_attrs]) Reduce this Dataset’s data by applying numpy.mean along some dimension(s).
Dataset.std([dim, keep_attrs]) Reduce this Dataset’s data by applying numpy.std along some dimension(s).
Dataset.sum([dim, keep_attrs]) Reduce this Dataset’s data by applying numpy.sum along some dimension(s).
Dataset.var([dim, keep_attrs]) Reduce this Dataset’s data by applying numpy.var along some dimension(s).
IO / Conversion
Dataset.to_netcdf(filepath, **kwdargs) Dump dataset contents to a location on disk using the netCDF4 package.
Dataset.to_dataframe() Convert this dataset into a pandas.DataFrame.
Dataset.from_dataframe(dataframe) Convert a pandas.DataFrame into an xray.Dataset
Dataset.close() Close any datastores linked to this dataset
Dataset internals

These attributes and classes provide a low-level interface for working with Dataset variables. In general you should use the Dataset dictionary- like interface instead and working with DataArray objects:

Dataset.variables Dictionary of Variable objects contained in this dataset.
Variable(dims, data[, attrs, encoding]) A netcdf-like variable consisting of dimensions, data and attributes which describe a single Array.
Coordinate(name, data[, attrs, encoding]) Wrapper around pandas.Index that adds xray specific functionality.
Backends (experimental)

These backends provide a low-level interface for lazily loading data from external file-formats or protocols, and can be manually invoked to create arguments for the from_store and dump_to_store Dataset methods.

backends.NetCDF4DataStore(filename[, mode, ...]) Store for reading and writing data via the Python-NetCDF4 library.
backends.PydapDataStore(url) Store for accessing OpenDAP datasets with pydap.
backends.ScipyDataStore(filename_or_obj[, ...]) Store for reading and writing data via scipy.io.netcdf.

DataArray

DataArray([data, coords, dims, name, attrs, ...]) N-dimensional array with labeled coordinates and dimensions.
Attributes and underlying data
DataArray.values The variables’s data as a numpy.ndarray
DataArray.coords Dictionary-like container of xray.Coordinate objects used for label based indexing.
DataArray.dims
DataArray.name The name of the variable in dataset to which array operations are applied.
DataArray.attrs Dictionary storing arbitrary metadata with this array.
DataArray.encoding Dictionary of format-specific settings for how this array should be serialized.
DataArray.variable
Selecting
DataArray.__getitem__(key)
DataArray.__setitem__(key, value)
DataArray.loc Attribute for location based indexing like pandas..
DataArray.isel(**indexers) Return a new DataArray whose dataset is given by integer indexing along the specified dimension(s).
DataArray.sel(**indexers) Return a new DataArray whose dataset is given by selecting index labels along the specified dimension(s).
DataArray.reindex([copy]) Conform this object onto a new set of coordinates, filling in missing values with NaN.
DataArray.reindex_like(other[, copy]) Conform this object onto the coordinates of another object, filling in missing values with NaN.
DataArray.rename(new_name_or_name_dict) Returns a new DataArray with renamed variables.
DataArray.select_vars(*names) Returns a new DataArray with only the named variables, as well as this DataArray’s array variable (and all associated coordinates).
DataArray.drop_vars(*names) Returns a new DataArray without the named variables.
DataArray.squeeze([dim]) Return a new DataArray object with squeezed data.
Group operations
DataArray.groupby(group[, squeeze]) Group this dataset by unique values of the indicated group.
DataArray.concat(arrays[, dim, indexers, ...]) Stack arrays along a new or existing dimension to form a new DataArray.
Computations
DataArray.transpose(*dims) Return a new DataArray object with transposed dimensions.
DataArray.T
DataArray.reduce(func[, dim, axis, keep_attrs]) Reduce this array by applying func along some dimension(s).
DataArray.get_axis_num(dim) Return axis number(s) corresponding to dimension(s) in this array.
DataArray.all([dim, axis, keep_attrs]) Reduce this DataArray’s data by applying numpy.all along some dimension(s).
DataArray.any([dim, axis, keep_attrs]) Reduce this DataArray’s data by applying numpy.any along some dimension(s).
DataArray.argmax([dim, axis, keep_attrs]) Reduce this DataArray’s data by applying numpy.argmax along some dimension(s).
DataArray.argmin([dim, axis, keep_attrs]) Reduce this DataArray’s data by applying numpy.argmin along some dimension(s).
DataArray.max([dim, axis, keep_attrs]) Reduce this DataArray’s data by applying numpy.max along some dimension(s).
DataArray.min([dim, axis, keep_attrs]) Reduce this DataArray’s data by applying numpy.min along some dimension(s).
DataArray.mean([dim, axis, keep_attrs]) Reduce this DataArray’s data by applying numpy.mean along some dimension(s).
DataArray.std([dim, axis, keep_attrs]) Reduce this DataArray’s data by applying numpy.std along some dimension(s).
DataArray.sum([dim, axis, keep_attrs]) Reduce this DataArray’s data by applying numpy.sum along some dimension(s).
DataArray.var([dim, axis, keep_attrs]) Reduce this DataArray’s data by applying numpy.var along some dimension(s).
DataArray.isnull(*args, **kwargs) Detect missing values (NaN in numeric arrays, None/NaN in object arrays)
DataArray.notnull(*args, **kwargs) Replacement for numpy.isfinite / -numpy.isnan which is suitable for use on object arrays.
Comparisons
DataArray.equals(other) True if two DataArrays have the same dimensions, coordinates and values; otherwise False.
DataArray.identical(other) Like equals, but also checks DataArray names and attributes, and attributes on their coordinates.
IO / Conversion
DataArray.to_dataset([name]) Convert a DataArray to a Dataset
DataArray.to_dataframe() Convert this array into a pandas.DataFrame.
DataArray.to_series() Convert this array into a pandas.Series.
DataArray.to_index() Convert this variable to a pandas.Index.
DataArray.from_series(series) Convert a pandas.Series into an xray.DataArray
DataArray.copy([deep]) Returns a copy of this array.
DataArray.load_data() Manually trigger loading of this array’s data from disk or a remote source into memory and return this array.

Top-level functions

align(*objects[, join, copy]) Given any number of Dataset and/or DataArray objects, returns new objects with aligned coordinates.

What’s New

v0.2.0 (14 August 2014)

This is major release that includes some new features and quite a few bug fixes. Here are the highlights:

  • There is now a direct constructor for DataArray objects, which makes it possible to create a DataArray without using a Dataset. This is highlighted in the refreshed Tutorial.
  • You can perform aggregation operations like mean directly on Dataset objects, thanks to Joe Hamman. These aggregation methods also worked on grouped datasets.
  • xray now works on Python 2.6, thanks to Anna Kuznetsova.
  • A number of methods and attributes were given more sensible (usually shorter) names: labeled -> sel, indexed -> isel, select -> select_vars, unselect -> drop_vars, dimensions -> dims, coordinates -> coords, attributes -> attrs.
  • New load_data() and close() methods for datasets facilitate lower level of control of data loaded from disk.

v0.1.1 (20 May 2014)

xray 0.1.1 is a bug-fix release that includes changes that should be almost entirely backwards compatible with v0.1:

  • Python 3 support (GH53)
  • Required numpy version relaxed to 1.7 (GH129)
  • Return numpy.datetime64 arrays for non-standard calendars (GH126)
  • Support for opening datasets associated with NetCDF4 groups (GH127)
  • Bug-fixes for concatenating datetime arrays (GH134)

Special thanks to new contributors Thomas Kluyver, Joe Hamman and Alistair Miles.

v0.1 (2 May 2014)

Initial release.

Get in touch

xray is an ambitious project and we have a lot of work to do make it as powerful as it should be. We would love to hear your thoughts!

License

xray is available under the open source Apache License.

History

xray is an evolution of an internal tool developed at The Climate Corporation, and was originally written by current and former Climate Corp researchers Stephan Hoyer, Alex Kleeman and Eugene Brevdo.