xray: N-D labeled arrays and datasets in Python¶
xray is an open source project and Python package that aims to bring the labeled data power of pandas to the physical sciences, by providing N-dimensional variants of the core pandas data structures, Series and DataFrame: the xray DataArray and Dataset.
Our goal is to provide a pandas-like and pandas-compatible toolkit for analytics on multi-dimensional arrays, rather than the tabular data for which pandas excels. Our approach adopts the Common Data Model for self- describing scientific data in widespread use in the Earth sciences (e.g., netCDF and OPeNDAP): xray.Dataset is an in-memory representation of a netCDF file.
Documentation¶
Why xray?¶
Features¶
Adding dimensions names and coordinate indexes to numpy’s ndarray makes many powerful array operations possible:
- Apply operations over dimensions by name: x.sum('time').
- Select values by label instead of integer location: x.loc['2014-01-01'] or x.sel(time='2014-01-01').
- Mathematical operations (e.g., x - y) vectorize across multiple dimensions (array broadcasting) based on dimension names, not shape.
- Flexible split-apply-combine operations with groupby: x.groupby('time.dayofyear').mean().
- Database like aligment based on coordinate labels that smoothly handles missing values: x, y = xray.align(x, y, join='outer').
- Keep track of arbitrary metadata in the form of a Python dictionary: x.attrs.
pandas provides many of these features, but it does not make use of dimension names, and its core data structures are fixed dimensional arrays.
The N-dimensional nature of xray’s data structures makes it suitable for dealing with multi-dimensional scientific data, and its use of dimension names instead of axis labels (dim='time' instead of axis=0) makes such arrays much more manageable than the raw numpy ndarray: with xray, you don’t need to keep track of the order of arrays dimensions or insert dummy dimensions (e.g., np.newaxis) to align arrays.
Core data structures¶
xray has two core data structures. Both are fundamentally N-dimensional:
- DataArray is our implementation of a labeled, N-dimensional array. It is an N-D generalization of a pandas.Series. The name DataArray itself is borrowed from Fernando Perez’s datarray project, which prototyped a similar data structure.
- Dataset is a multi-dimensional, in-memory array database. It is a dict-like container of DataArray objects aligned along any number of shared dimensions, and serves a similar purpose in xray to the pandas.DataFrame.
The value of attaching labels to numpy’s numpy.ndarray may be fairly obvious, but the dataset may need more motivation.
The power of the dataset over a plain dictionary is that, in addition to pulling out arrays by name, it is possible to select or combine data along a dimension across all arrays simultaneously. Like a DataFrame, datasets facilitate array operations with heterogeneous data – the difference is that the arrays in a dataset can not only have different data types, but can also have different numbers of dimensions.
This data model is borrowed from the netCDF file format, which also provides xray with a natural and portable serialization format. NetCDF is very popular in the geosciences, and there are existing libraries for reading and writing netCDF in many programming languages, including Python.
xray distinguishes itself from most other tools for working with netCDF data in-so-far as it provides data structures for in-memory analytics that both utilize and preserve labels. You only need to do the tedious work of adding metadata once, not every time you save a file.
Goals and aspirations¶
pandas excels at working with tabular data. That suffices for many statistical analyses, but physical scientists rely on N-dimensional arrays – which is where xray comes in.
xray aims to provide a data analysis toolkit as powerful as pandas but designed for working with homogeneous N-dimensional arrays instead of tabular data. When possible, we copy the pandas API and rely on pandas’s highly optimized internals (in particular, for fast indexing).
Importantly, xray has robust support for converting its objects to and from a numpy ndarray or a pandas DataFrame or Series, providing compatibility with the full PyData ecosystem.
Our target audience is anyone who needs N-dimensional labeled arrays, but we are particularly focused on the data analysis needs of physical scientists – especially geoscientists who already know and love netCDF.
Warning
xray is a relatively new project and is still under heavy development. Although we will make a best effort to maintain compatibility with the current API, inevitably the API will change in future versions as xray matures. Already anticipated changes are called out in the relevant section of the documentation.
Examples¶
Quick overview¶
Here are some quick examples of what you can do with xray’s DataArray object. Everything is explained in much more detail in the rest of the documentation.
To begin, import numpy, pandas and xray:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: import xray
Create a DataArray¶
You can make a DataArray from scratch by supplying data in the form of a numpy array or list, with optional dimensions and coordinates:
In [4]: xray.DataArray(np.random.randn(2, 3))
Out[4]:
<xray.DataArray (dim_0: 2, dim_1: 3)>
array([[-1.34431181, 0.84488514, 1.07576978],
[-0.10904998, 1.64356307, -1.46938796]])
Coordinates:
* dim_0 (dim_0) int64 0 1
* dim_1 (dim_1) int64 0 1 2
In [5]: xray.DataArray(np.random.randn(2, 3), [('x', ['a', 'b']), ('y', [-2, 0, 2])])
Out[5]:
<xray.DataArray (x: 2, y: 3)>
array([[ 0.35702056, -0.6746001 , -1.77690372],
[-0.96891381, -1.29452359, 0.41373811]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
You can also pass in pandas data structures directly:
In [6]: df = pd.DataFrame(np.random.randn(2, 3), index=['a', 'b'], columns=[-2, 0, 2])
In [7]: df.index.name = 'x'
In [8]: df.columns.name = 'y'
In [9]: foo = xray.DataArray(df, name='foo')
In [10]: foo
Out[10]:
<xray.DataArray 'foo' (x: 2, y: 3)>
array([[ 0.27666171, -0.47203451, -0.01395975],
[-0.36254299, -0.00615357, -0.92306065]])
Coordinates:
* x (x) object 'a' 'b'
* y (y) int64 -2 0 2
Here are the key properties for a DataArray:
# like in pandas, values is a numpy array that you can modify in-place
In [11]: foo.values
Out[11]:
array([[ 0.27666171, -0.47203451, -0.01395975],
[-0.36254299, -0.00615357, -0.92306065]])
In [12]: foo.dims
Out[12]: ('x', 'y')
In [13]: foo.coords['y']
Out[13]:
<xray.DataArray 'y' (y: 3)>
array([-2, 0, 2])
Coordinates:
* y (y) int64 -2 0 2
# you can use this dictionary to store arbitrary metadata
In [14]: foo.attrs
Out[14]: OrderedDict()
Indexing¶
xray supports four kind of indexing. These operations are just as fast as in pandas, because we borrow pandas’ indexing machinery.
# positional and by integer label, like numpy
In [15]: foo[[0, 1]]
Out[15]:
<xray.DataArray 'foo' (x: 2, y: 3)>
array([[ 0.27666171, -0.47203451, -0.01395975],
[-0.36254299, -0.00615357, -0.92306065]])
Coordinates:
* x (x) object 'a' 'b'
* y (y) int64 -2 0 2
# positional and by coordinate label, like pandas
In [16]: foo.loc['a':'b']
Out[16]:
<xray.DataArray 'foo' (x: 2, y: 3)>
array([[ 0.27666171, -0.47203451, -0.01395975],
[-0.36254299, -0.00615357, -0.92306065]])
Coordinates:
* x (x) object 'a' 'b'
* y (y) int64 -2 0 2
# by dimension name and integer label
In [17]: foo.isel(x=slice(2))
Out[17]:
<xray.DataArray 'foo' (x: 2, y: 3)>
array([[ 0.27666171, -0.47203451, -0.01395975],
[-0.36254299, -0.00615357, -0.92306065]])
Coordinates:
* x (x) object 'a' 'b'
* y (y) int64 -2 0 2
# by dimension name and coordinate label
In [18]: foo.sel(x=['a', 'b'])
Out[18]:
<xray.DataArray 'foo' (x: 2, y: 3)>
array([[ 0.27666171, -0.47203451, -0.01395975],
[-0.36254299, -0.00615357, -0.92306065]])
Coordinates:
* x (x) object 'a' 'b'
* y (y) int64 -2 0 2
Computation¶
Data arrays work very similarly to numpy ndarrays:
In [19]: foo + 10
Out[19]:
<xray.DataArray 'foo' (x: 2, y: 3)>
array([[ 10.27666171, 9.52796549, 9.98604025],
[ 9.63745701, 9.99384643, 9.07693935]])
Coordinates:
* y (y) int64 -2 0 2
* x (x) object 'a' 'b'
In [20]: np.sin(foo)
Out[20]:
<xray.DataArray 'foo' (x: 2, y: 3)>
array([[ 0.27314584, -0.45469925, -0.0139593 ],
[-0.35465307, -0.00615353, -0.7974521 ]])
Coordinates:
* y (y) int64 -2 0 2
* x (x) object 'a' 'b'
In [21]: foo.T
Out[21]:
<xray.DataArray 'foo' (y: 3, x: 2)>
array([[ 0.27666171, -0.36254299],
[-0.47203451, -0.00615357],
[-0.01395975, -0.92306065]])
Coordinates:
* x (x) object 'a' 'b'
* y (y) int64 -2 0 2
In [22]: foo.sum()
Out[22]:
<xray.DataArray 'foo' ()>
array(-1.501089766692079)
However, aggregation operations can use dimension names instead of axis numbers:
In [23]: foo.mean(dim='x')
Out[23]:
<xray.DataArray 'foo' (y: 3)>
array([-0.04294064, -0.23909404, -0.4685102 ])
Coordinates:
* y (y) int64 -2 0 2
Arithmetic operations broadcast based on dimension name, so you don’t need to insert dummy dimensions for alignment:
In [24]: bar = xray.DataArray(np.random.randn(3), [foo.coords['y']])
In [25]: zzz = xray.DataArray(np.random.randn(4), dims='z')
In [26]: bar
Out[26]:
<xray.DataArray (y: 3)>
array([ 0.8957173 , 0.80524403, -1.20641178])
Coordinates:
* y (y) int64 -2 0 2
In [27]: zzz
Out[27]:
<xray.DataArray (z: 4)>
array([ 2.56564595, 1.43125599, 1.34030885, -1.1702988 ])
Coordinates:
* z (z) int64 0 1 2 3
In [28]: bar + zzz
Out[28]:
<xray.DataArray (y: 3, z: 4)>
array([[ 3.46136325, 2.32697329, 2.23602615, -0.27458149],
[ 3.37088997, 2.23650001, 2.14555288, -0.36505477],
[ 1.35923416, 0.2248442 , 0.13389707, -2.37671058]])
Coordinates:
* y (y) int64 -2 0 2
* z (z) int64 0 1 2 3
GroupBy¶
xray supports grouped operations using a very similar API to pandas:
In [29]: labels = xray.DataArray(['E', 'F', 'E'], [foo.coords['y']], name='labels')
In [30]: labels
Out[30]:
<xray.DataArray 'labels' (y: 3)>
array(['E', 'F', 'E'],
dtype='|S1')
Coordinates:
* y (y) int64 -2 0 2
In [31]: foo.groupby(labels).mean('y')
Out[31]:
<xray.DataArray 'foo' (x: 2, labels: 2)>
array([[ 0.13135098, -0.47203451],
[-0.64280182, -0.00615357]])
Coordinates:
* x (x) object 'a' 'b'
* labels (labels) |S1 'E' 'F'
In [32]: foo.groupby(labels).apply(lambda x: x - x.min())
Out[32]:
<xray.DataArray 'foo' (x: 2, y: 3)>
array([[ 1.19972237, 0. , 0.9091009 ],
[ 0.56051766, 0.46588094, 0. ]])
Coordinates:
* x (x) object 'a' 'b'
* y (y) int64 -2 0 2
labels (y) |S1 'E' 'F' 'E'
Convert to pandas¶
A key feature of xray is robust conversion to and from pandas objects:
In [33]: foo.to_series()
Out[33]:
x y
a -2 0.276662
0 -0.472035
2 -0.013960
b -2 -0.362543
0 -0.006154
2 -0.923061
Name: foo, dtype: float64
In [34]: foo.to_pandas()
Out[34]:
y -2 0 2
x
a 0.276662 -0.472035 -0.013960
b -0.362543 -0.006154 -0.923061
Working with weather data¶
Here is an example of how to easily manipulate a toy weather dataset using xray and other recommended Python libraries:
Shared setup:
import xray
import numpy as np
import pandas as pd
import seaborn as sns # pandas aware plotting library
np.random.seed(123)
times = pd.date_range('2000-01-01', '2001-12-31', name='time')
annual_cycle = np.sin(2 * np.pi * (times.dayofyear / 365.25 - 0.28))
base = 10 + 15 * annual_cycle.reshape(-1, 1)
tmin_values = base + 5 * np.random.randn(annual_cycle.size, 3)
tmax_values = base + 10 + 5 * np.random.randn(annual_cycle.size, 3)
ds = xray.Dataset({'tmin': (('time', 'location'), tmin_values),
'tmax': (('time', 'location'), tmax_values)},
{'time': times, 'location': ['IA', 'IN', 'IL']})
Examine a dataset with pandas and seaborn¶
In [1]: ds
Out[1]:
<xray.Dataset>
Dimensions: (location: 3, time: 731)
Coordinates:
* location (location) |S2 'IA' 'IN' 'IL'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04 ...
Variables:
tmax (time, location) float64 18.15 2.038 7.818 -2.705 7.169 4.621 5.444 6.993 ...
tmin (time, location) float64 -10.21 0.2062 -3.366 -12.35 -7.715 3.435 -16.99 ...
In [2]: ds.to_dataframe().head()
Out[2]:
tmax tmin
location time
IA 2000-01-01 18.154567 -10.208631
2000-01-02 -2.705392 -12.353746
2000-01-03 5.444285 -16.993078
2000-01-04 -0.255829 -9.226395
2000-01-05 -2.067176 2.535651
In [3]: ds.to_dataframe().describe()
Out[3]:
tmax tmin
count 2193.000000 2193.000000
mean 20.187129 9.965787
std 11.693882 11.610836
min -9.250351 -19.214657
25% 10.294622 0.330388
50% 20.000428 9.655506
75% 30.017643 19.963175
max 48.870196 39.157475
In [4]: ds.mean(dim='location').to_dataframe().plot()
Out[4]: <matplotlib.axes._subplots.AxesSubplot at 0x7f9d9ba0fad0>
In [5]: sns.pairplot(ds[['tmin', 'tmax', 'time.month']].to_dataframe(),
...: vars=ds.vars, hue='time.month')
...:
Out[5]: <seaborn.axisgrid.PairGrid at 0x7f9d9b432390>
Probability of freeze by calendar month¶
In [6]: freeze = (ds['tmin'] <= 0).groupby('time.month').mean('time')
In [7]: freeze
Out[7]:
<xray.DataArray 'tmin' (location: 3, time.month: 12)>
array([[ 0.83870968, 0.63157895, 0.27419355, ..., 0.09677419,
0.35 , 0.74193548],
[ 0.72580645, 0.70175439, 0.19354839, ..., 0.0483871 ,
0.4 , 0.72580645],
[ 0.82258065, 0.66666667, 0.20967742, ..., 0.01612903,
0.3 , 0.67741935]])
Coordinates:
* location (location) |S2 'IA' 'IN' 'IL'
* time.month (time.month) int32 1 2 3 4 5 6 7 8 9 10 11 12
In [8]: freeze.to_series().unstack('location').plot()
Out[8]: <matplotlib.axes._subplots.AxesSubplot at 0x7f9d9ade2f50>
Monthly averaging¶
def year_month(xray_obj):
"""Given an xray object with a 'time' coordinate, return an DataArray
with values given by the first date of the month in which each time
falls.
"""
time = xray_obj.coords['time']
values = time.to_index().to_period('M').to_timestamp()
return xray.DataArray(values, [time], name='year_month')
In [9]: monthly_avg = ds.groupby(year_month(ds)).mean()
In [10]: monthly_avg.to_dataframe().plot(style='s-')
Out[10]: <matplotlib.axes._subplots.AxesSubplot at 0x7f9d9adafc50>
Calculate monthly anomalies¶
In climatology, “anomalies” refer to the difference between observations and typical weather for a particular season. Unlike observations, anomalies should not show any seasonal cycle.
In [11]: climatology = ds.groupby('time.month').mean('time')
In [12]: anomalies = ds.groupby('time.month') - climatology
In [13]: anomalies.mean('location').reset_coords(drop=True).to_dataframe().plot()
Out[13]: <matplotlib.axes._subplots.AxesSubplot at 0x7f9d9afcf0d0>
Installation¶
Required dependencies:
Optional dependencies:
- netCDF4: recommended if you want to use xray for reading or writing files
- scipy: used as a fallback for reading/writing netCDF3
- pydap: used as a fallback for accessing OPeNDAP
- cyordereddict: speeds up most internal operations
Before you install xray, be sure you have the required dependencies (numpy and pandas) installed. The easiest way to do so is to use the Anaconda python distribution.
To install xray, use pip:
pip install xray
To run the test suite after installing xray, install nose and run nosetests xray.
Data Structures¶
DataArray¶
xray.DataArray is xray’s implementation of a labeled, multi-dimensional array. It has several key properties:
- values: a numpy.ndarray holding the array’s values
- dims: dimension names for each axis (e.g., ('x', 'y', 'z'))
- coords: a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings)
- attrs: an OrderedDict to hold arbitrary metadata (attributes)
xray uses dims and coords to enable its core metadata aware operations. Dimensions provide names that xray uses instead of the axis argument found in many numpy functions. Coordinates enable fast label based indexing and alignment, building on the functionality of the index found on a pandas DataFrame or Series.
DataArray objects also can have a name and can hold arbitrary metadata in the form of their attrs property (an ordered dictionary). Names and attributes are strictly for users and user-written code: xray makes no attempt to interpret them, and propagates them only in unambiguous cases (see FAQ, What is your approach to metadata?).
Creating a DataArray¶
The DataArray constructor takes a multi-dimensional array of values (e.g., a numpy ndarray), a list or dictionary of coordinates label and a list of dimension names:
In [1]: data = np.random.rand(4, 3)
In [2]: locs = ['IA', 'IL', 'IN']
In [3]: times = pd.date_range('2000-01-01', periods=4)
In [4]: foo = xray.DataArray(data, coords=[times, locs], dims=['time', 'space'])
In [5]: foo
Out[5]:
<xray.DataArray (time: 4, space: 3)>
array([[ 0.12696983, 0.96671784, 0.26047601],
[ 0.89723652, 0.37674972, 0.33622174],
[ 0.45137647, 0.84025508, 0.12310214],
[ 0.5430262 , 0.37301223, 0.44799682]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
All of these arguments (except for data) are optional, and will be filled in with default values:
In [6]: xray.DataArray(data)
Out[6]:
<xray.DataArray (dim_0: 4, dim_1: 3)>
array([[ 0.12696983, 0.96671784, 0.26047601],
[ 0.89723652, 0.37674972, 0.33622174],
[ 0.45137647, 0.84025508, 0.12310214],
[ 0.5430262 , 0.37301223, 0.44799682]])
Coordinates:
* dim_0 (dim_0) int64 0 1 2 3
* dim_1 (dim_1) int64 0 1 2
As you can see, dimensions and coordinate arrays corresponding to each dimension are always present. This behavior is similar to pandas, which fills in index values in the same way.
The data array constructor also supports supplying coords as a list of (dim, ticks[, attrs]) pairs with length equal to the number of dimensions:
In [7]: xray.DataArray(data, coords=[('time', times), ('space', locs)])
Out[7]:
<xray.DataArray (time: 4, space: 3)>
array([[ 0.12696983, 0.96671784, 0.26047601],
[ 0.89723652, 0.37674972, 0.33622174],
[ 0.45137647, 0.84025508, 0.12310214],
[ 0.5430262 , 0.37301223, 0.44799682]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Yet another option is to supply coords in the form of a dictionary where the values are scaler values, 1D arrays or tuples (in the same form as the dataarray constructor). This form lets you supply other coordinates than those corresponding to dimensions (more on these later):
In [8]: xray.DataArray(data, coords={'time': times, 'space': locs, 'const': 42,
...: 'ranking': ('space', [1, 2, 3])},
...: dims=['time', 'space'])
...:
Out[8]:
<xray.DataArray (time: 4, space: 3)>
array([[ 0.12696983, 0.96671784, 0.26047601],
[ 0.89723652, 0.37674972, 0.33622174],
[ 0.45137647, 0.84025508, 0.12310214],
[ 0.5430262 , 0.37301223, 0.44799682]])
Coordinates:
ranking (space) int64 1 2 3
* space (space) |S2 'IA' 'IL' 'IN'
const int64 42
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
You can also create a DataArray by supplying a pandas Series, DataFrame or Panel, in which case any non-specified arguments in the DataArray constructor will be filled in from the pandas object:
In [9]: df = pd.DataFrame({'x': [0, 1], 'y': [2, 3]}, index=['a', 'b'])
In [10]: df.index.name = 'abc'
In [11]: df.columns.name = 'xyz'
In [12]: df
Out[12]:
xyz x y
abc
a 0 2
b 1 3
In [13]: xray.DataArray(df)
Out[13]:
<xray.DataArray (abc: 2, xyz: 2)>
array([[0, 2],
[1, 3]])
Coordinates:
* abc (abc) object 'a' 'b'
* xyz (xyz) object 'x' 'y'
xray does not (yet!) support labeling coordinate values with a pandas.MultiIndex (see GH164). However, the alternate from_series constructor will automatically unpack any hierarchical indexes it encounters by expanding the series into a multi-dimensional array, as described in Working with pandas.
DataArray properties¶
Let’s take a look at the important properties on our array:
In [14]: foo.values
Out[14]:
array([[ 0.12696983, 0.96671784, 0.26047601],
[ 0.89723652, 0.37674972, 0.33622174],
[ 0.45137647, 0.84025508, 0.12310214],
[ 0.5430262 , 0.37301223, 0.44799682]])
In [15]: foo.dims
Out[15]: ('time', 'space')
In [16]: foo.coords
Out[16]:
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
In [17]: foo.attrs
Out[17]: OrderedDict()
In [18]: print(foo.name)
None
You can even modify values inplace:
In [19]: foo.values = 1.0 * foo.values
Note
The array values in a DataArray have a single (homogeneous) data type. To work with heterogeneous or structured data types in xray, use coordinates, or put separate DataArray objects in a single Dataset.
Now fill in some of that missing metadata:
In [20]: foo.name = 'foo'
In [21]: foo.attrs['units'] = 'meters'
In [22]: foo
Out[22]:
<xray.DataArray 'foo' (time: 4, space: 3)>
array([[ 0.12696983, 0.96671784, 0.26047601],
[ 0.89723652, 0.37674972, 0.33622174],
[ 0.45137647, 0.84025508, 0.12310214],
[ 0.5430262 , 0.37301223, 0.44799682]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Attributes:
units: meters
The rename() method is another option, returning a new data array:
In [23]: foo.rename('bar')
Out[23]:
<xray.DataArray 'bar' (time: 4, space: 3)>
array([[ 0.12696983, 0.96671784, 0.26047601],
[ 0.89723652, 0.37674972, 0.33622174],
[ 0.45137647, 0.84025508, 0.12310214],
[ 0.5430262 , 0.37301223, 0.44799682]])
Coordinates:
* space (space) |S2 'IA' 'IL' 'IN'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
Attributes:
units: meters
DataArray Coordinates¶
The coords property is dict like. Individual coordinates can be accessed from the coordinates by name, or even by indexing the data array itself:
In [24]: foo.coords['time']
Out[24]:
<xray.DataArray 'time' (time: 4)>
array(['1999-12-31T18:00:00.000000000-0600',
'2000-01-01T18:00:00.000000000-0600',
'2000-01-02T18:00:00.000000000-0600',
'2000-01-03T18:00:00.000000000-0600'], dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
In [25]: foo['time']
Out[25]:
<xray.DataArray 'time' (time: 4)>
array(['1999-12-31T18:00:00.000000000-0600',
'2000-01-01T18:00:00.000000000-0600',
'2000-01-02T18:00:00.000000000-0600',
'2000-01-03T18:00:00.000000000-0600'], dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
These are also DataArray objects, which contain tick-labels for each dimension.
Coordinates can also be set or removed by using the dictionary like syntax:
In [26]: foo['ranking'] = ('space', [1, 2, 3])
In [27]: foo.coords
Out[27]:
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
ranking (space) int64 1 2 3
In [28]: del foo['ranking']
In [29]: foo.coords
Out[29]:
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
xray also supports a notion of “virtual” or “derived” coordinates for datetime components implemented by pandas, including “year”, “month”, “day”, “hour”, “minute”, “second”, “dayofyear”, “week”, “dayofweek”, “weekday” and “quarter”:
In [30]: foo['time.month']
Out[30]:
<xray.DataArray 'time.month' (time: 4)>
array([1, 1, 1, 1], dtype=int32)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
In [31]: foo['time.dayofyear']
Out[31]:
<xray.DataArray 'time.dayofyear' (time: 4)>
array([1, 2, 3, 4], dtype=int32)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
Dataset¶
xray.Dataset is xray’s multi-dimensional equivalent of a DataFrame. It is a dict-like container of labeled arrays (DataArray objects) with aligned dimensions. It is designed as an in-memory representation of the data model from the netCDF file format.
In addition to the dict-like interface of the dataset itself, which can be used to access any array in a dataset, datasets have four key properties:
- dims: a dictionary mapping from dimension names to the fixed length of each dimension (e.g., {'x': 6, 'y': 6, 'time': 8})
- vars: a dict-like container of arrays (variables)
- coords: another dict-like container of arrays intended to label points used in vars (e.g., 1-dimensional arrays of numbers, datetime objects or strings)
- attrs: an OrderedDict to hold arbitrary metadata
The distinction between whether an array falls in variables or coordinates is mostly semantic: coordinates are intended for constant/fixed/independent quantities, unlike the varying/measured/dependent quantities that belong in variables. Dictionary like access on a dataset will supply arrays found in either category. However, the distinction does have important implications for indexing and compution.
Here is an example of how we might structure a dataset for a weather forecast:
In this example, it would be natural to call temperature and precipitation “variables” and all the other arrays “coordinates” because they label the points along the dimensions. (see [1] for more background on this example).
Creating a Dataset¶
To make an Dataset from scratch, supply dictionaries for any variables coordinates and attributes you would like to insert into the dataset.
For the vars and coords arguments, keys should be the name of the variable or coordinate, and values should be scalars, 1d arrays or tuples of the form (dims, data[, attrs]) sufficient to label each array:
- dims should be a sequence of strings.
- data should be a numpy.ndarray (or array-like object) that has a dimensionality equal to the length of dims.
- attrs is an arbitrary Python dictionary for storing metadata associated with a particular array.
Let’s create some fake data for the example we show above:
In [32]: temp = 15 + 8 * np.random.randn(2, 2, 3)
In [33]: precip = 10 * np.random.rand(2, 2, 3)
In [34]: lon = [[-99.83, -99.32], [-99.79, -99.23]]
In [35]: lat = [[42.25, 42.21], [42.63, 42.59]]
# for real use cases, its good practice to supply array attributes such as
# units, but we won't bother here for the sake of brevity
In [36]: ds = xray.Dataset({'temperature': (['x', 'y', 'time'], temp),
....: 'precipitation': (['x', 'y', 'time'], precip)},
....: coords={'lon': (['x', 'y'], lon),
....: 'lat': (['x', 'y'], lat),
....: 'time': pd.date_range('2014-09-06', periods=3),
....: 'reference_time': pd.Timestamp('2014-09-05')})
....:
In [37]: ds
Out[37]:
<xray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
Variables:
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 0.3777 8.615 7.536 ...
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 11.6 19.54 ...
Notice that we did not explicitly include coordinates for the “x” or “y” dimensions, so they were filled in array of ascending integers of the proper length.
We can also pass xray.DataArray objects as values in the dictionary instead of tuples:
In [38]: xray.Dataset({'bar': foo})
Out[38]:
<xray.Dataset>
Dimensions: (space: 3, time: 4)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Variables:
bar (time, space) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 0.4514 ...
Or directly convert a data array into a dataset:
You can also create an dataset from a pandas.DataFrame with Dataset.from_dataframe or from a netCDF file on disk with open_dataset(). See Working with pandas and Serialization and IO.
Dataset contents¶
Dataset implements the Python dictionary interface, with values given by xray.DataArray objects:
In [39]: 'temperature' in ds
Out[39]: True
In [40]: ds.keys()
Out[40]:
['precipitation',
'temperature',
'lat',
'reference_time',
'lon',
'time',
'x',
'y']
In [41]: ds['temperature']
Out[41]:
<xray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[ 11.04056581, 23.57443046, 20.7724413 ],
[ 9.34583093, 6.68340012, 17.17487908]],
[[ 11.60022136, 19.5361628 , 17.20985615],
[ 6.30079447, 9.61048234, 15.90918728]]])
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* y (y) int64 0 1
* x (x) int64 0 1
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
The valid keys include each listed coordinate and variable.
Variables and coordinates are also contained separately in the vars and coords dictionary-like attributes:
In [42]: ds.vars
Out[42]:
Variables:
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 0.3777 8.615 7.536 ...
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 11.6 19.54 ...
In [43]: ds.coords
Out[43]:
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
Finally, like data arrays, datasets also store arbitrary metadata in the form of attributes:
In [44]: ds.attrs
Out[44]: OrderedDict()
In [45]: ds.attrs['title'] = 'example attribute'
In [46]: ds
Out[46]:
<xray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
Variables:
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 0.3777 8.615 7.536 ...
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 11.6 19.54 ...
Attributes:
title: example attribute
xray does not enforce any restrictions on attributes, but serialization to some file formats may fail if you use objects that are not strings, numbers or numpy.ndarray objects.
Dictionary like methods¶
We can update a dataset in-place using Python’s standard dictionary syntax. For example, to create this example dataset from scratch, we could have written:
In [47]: ds = xray.Dataset()
In [48]: ds['temperature'] = (('x', 'y', 'time'), temp)
In [49]: ds['precipitation'] = (('x', 'y', 'time'), precip)
In [50]: ds.coords['lat'] = (('x', 'y'), lat)
In [51]: ds.coords['lon'] = (('x', 'y'), lon)
In [52]: ds.coords['time'] = pd.date_range('2014-09-06', periods=3)
In [53]: ds.coords['reference_time'] = pd.Timestamp('2014-09-05')
To change the variables in a Dataset, you can use all the standard dictionary methods, including values, items, __delitem__, get and update`().
Creating modified modified datasets¶
You can copy a Dataset by using the copy() method:
In [54]: ds2 = ds.copy()
In [55]: del ds2['time']
In [56]: ds2
Out[56]:
<xray.Dataset>
Dimensions: (x: 2, y: 2)
Coordinates:
* x (x) int64 0 1
* y (y) int64 0 1
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
Variables:
*empty*
By default, the copy is shallow, so only the container will be copied: the arrays in the Dataset will still be stored in the same underlying numpy.ndarray objects. You can copy all data by supplying the argument deep=True.
You also can select and drop an explicit list of variables by using the by indexing with a list of names or using the drop_vars() methods to return a new Dataset. These operations keep around coordinates:
In [57]: ds[['temperature']]
Out[57]:
<xray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* y (y) int64 0 1
* x (x) int64 0 1
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
Variables:
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 11.6 19.54 ...
In [58]: ds[['x']]
Out[58]:
<xray.Dataset>
Dimensions: (x: 2)
Coordinates:
* x (x) int64 0 1
reference_time datetime64[ns] 2014-09-05
Variables:
*empty*
If a dimension name is given as an argument to drop_vars, it also drops all variables that use that dimension:
In [59]: ds.drop_vars('time')
Out[59]:
<xray.Dataset>
Dimensions: (x: 2, y: 2)
Coordinates:
* x (x) int64 0 1
* y (y) int64 0 1
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
Variables:
*empty*
Another useful option is the ability to rename the variables in a dataset:
In [60]: ds.rename({'temperature': 'temp', 'precipitation': 'precip'})
Out[60]:
<xray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
Variables:
temp (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 11.6 19.54 ...
precip (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 0.3777 8.615 7.536 ...
Coordinates¶
Coordinates are ancilliary arrays stored for DataArray and Dataset objects in the coords attribute:
In [61]: ds.coords
Out[61]:
Coordinates:
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
Unlike attributes, xray does interpret and persist coordinates in operations that transform xray objects.
One dimensional coordinates with a name equal to their sole dimension (marked by * when printing a dataset or data array) take on a special meaning in xray. They are used for label based indexing and alignment, like the index found on a pandas DataFrame or Series. Indeed, these “dimension” coordinates use a pandas.Index internally to store their values.
Other than for indexing, xray does not make any direct use of the values associated with coordinates. Coordinates with names not matching a dimension are not used for alignment or indexing, nor are they required to match when doing arithmetic (see Alignment and coordinates).
Converting to pandas.Index¶
To convert a coordinate (or any DataArray) into an actual pandas.Index, use the to_index() method:
In [62]: ds['time'].to_index()
Out[62]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-09-06, ..., 2014-09-08]
Length: 3, Freq: D, Timezone: None
A useful shortcut is the indexes property (on both DataArray and Dataset), which lazily constructs a dictionary whose keys are given by each dimension and whose the values are Index objects:
In [63]: ds.indexes
Out[63]:
time: <class 'pandas.tseries.index.DatetimeIndex'>
[2014-09-06, ..., 2014-09-08]
Length: 3, Freq: D, Timezone: None
x: Int64Index([0, 1], dtype='int64')
y: Int64Index([0, 1], dtype='int64')
Switching between coordinates and variables¶
To entirely add or removing coordinate arrays, you can use dictionary like syntax, as shown above. To convert back and forth between coordinates and variables, use the the set_coords() and reset_coords() methods:
In [64]: ds.reset_coords()
Out[64]:
<xray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
Variables:
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 11.6 19.54 ...
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 0.3777 8.615 7.536 ...
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
In [65]: ds.set_coords(['temperature', 'precipitation'])
Out[65]:
<xray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 11.6 19.54 ...
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 0.3777 8.615 7.536 ...
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
Variables:
*empty*
In [66]: ds['temperature'].reset_coords(drop=True)
Out[66]:
<xray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[ 11.04056581, 23.57443046, 20.7724413 ],
[ 9.34583093, 6.68340012, 17.17487908]],
[[ 11.60022136, 19.5361628 , 17.20985615],
[ 6.30079447, 9.61048234, 15.90918728]]])
Coordinates:
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* y (y) int64 0 1
* x (x) int64 0 1
Notice that these operations skip coordinates with names given by dimensions, as used for indexing. This mostly because we are not entirely sure how to design the interface around the fact that xray cannot store a coordinate and variable with the name but different values in the same dictionary. But we do recognize that supporting something like this would be useful.
Converting into datasets¶
Coordinates objects also have a few useful methods, mostly for converting them into dataset objects:
In [67]: ds.coords.to_dataset()
Out[67]:
<xray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* y (y) int64 0 1
* x (x) int64 0 1
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
Variables:
*empty*
The merge method is particularly interesting, because it implements the same logic used for merging coordinates in arithmetic operations (see Computation):
In [68]: alt = xray.Dataset(coords={'z': [10], 'lat': 0, 'lon': 0})
In [69]: ds.coords.merge(alt.coords)
Out[69]:
<xray.Dataset>
Dimensions: (time: 3, x: 2, y: 2, z: 1)
Coordinates:
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* y (y) int64 0 1
* x (x) int64 0 1
reference_time datetime64[ns] 2014-09-05
* z (z) int64 10
Variables:
*empty*
The coords.merge method may be useful if you want to implement your own binary operations that act on xray objects. In the future, we hope to write more helper functions so that you can easily make your functions act like xray’s built-in arithemtic.
[1] | Latitude and longitude are 2D arrays because the dataset uses projected coordinates. reference_time refers to the reference time at which the forecast was made, rather than time which is the valid time for which the forecast applies. |
Indexing and selecting data¶
Similarly to pandas objects, xray objects support both integer and label based lookups along each dimension. However, xray objects also have named dimensions, so you can optionally use dimension names instead of relying on the positional ordering of dimensions.
This in total, xray supports four different kinds of indexing, as described below and summarized in this table:
Dimension lookup | Index lookup | DataArray syntax | Dataset syntax |
---|---|---|---|
Positional | By integer | arr[:, 0] | not available |
Positional | By label | arr.loc[:, 'IA'] | not available |
By name | By integer | arr.isel(space=0) | ds.isel(space=0) |
By name | By label | arr.sel(space='IA') | ds.sel(space='IA') |
Positional indexing¶
Indexing a DataArray directly works (mostly) just like it does for numpy arrays, except that the returned object is always another DataArray:
In [1]: arr = xray.DataArray(np.random.rand(4, 3),
...: [('time', pd.date_range('2000-01-01', periods=4)),
...: ('space', ['IA', 'IL', 'IN'])])
...:
In [2]: arr[:2]
Out[2]:
<xray.DataArray (time: 2, space: 3)>
array([[ 0.12696983, 0.96671784, 0.26047601],
[ 0.89723652, 0.37674972, 0.33622174]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
* space (space) |S2 'IA' 'IL' 'IN'
In [3]: arr[0, 0]
Out[3]:
<xray.DataArray ()>
array(0.12696983303810094)
Coordinates:
time datetime64[ns] 2000-01-01
space |S2 'IA'
In [4]: arr[:, [2, 1]]
Out[4]:
<xray.DataArray (time: 4, space: 2)>
array([[ 0.26047601, 0.96671784],
[ 0.33622174, 0.37674972],
[ 0.12310214, 0.84025508],
[ 0.44799682, 0.37301223]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IN' 'IL'
xray also supports label-based indexing, just like pandas. Because we use a pandas.Index under the hood, label based indexing is very fast. To do label based indexing, use the loc attribute:
In [5]: arr.loc['2000-01-01':'2000-01-02', 'IA']
Out[5]:
<xray.DataArray (time: 2)>
array([ 0.12696983, 0.89723652])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
space |S2 'IA'
You can perform any of the label indexing operations supported by pandas, including indexing with individual, slices and arrays of labels, as well as indexing with boolean arrays. Like pandas, label based indexing in xray is inclusive of both the start and stop bounds.
Setting values with label based indexing is also supported:
In [6]: arr.loc['2000-01-01', ['IL', 'IN']] = -10
In [7]: arr
Out[7]:
<xray.DataArray (time: 4, space: 3)>
array([[ 0.12696983, -10. , -10. ],
[ 0.89723652, 0.37674972, 0.33622174],
[ 0.45137647, 0.84025508, 0.12310214],
[ 0.5430262 , 0.37301223, 0.44799682]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Indexing with labeled dimensions¶
With labeled dimensions, we do not have to rely on dimension order and can use them explicitly to slice data. There are two ways to do this:
Use a dictionary as the argument for array positional or label based array indexing:
# index by integer array indices In [8]: arr[dict(space=0, time=slice(None, 2))] Out[8]: <xray.DataArray (time: 2)> array([ 0.12696983, 0.89723652]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 space |S2 'IA' # index by dimension coordinate labels In [9]: arr.loc[dict(time=slice('2000-01-01', '2000-01-02'))] Out[9]: <xray.DataArray (time: 2, space: 3)> array([[ 0.12696983, -10. , -10. ], [ 0.89723652, 0.37674972, 0.33622174]]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 * space (space) |S2 'IA' 'IL' 'IN'
Use the sel() and isel() convenience methods:
# index by integer array indices In [10]: arr.isel(space=0, time=slice(None, 2)) Out[10]: <xray.DataArray (time: 2)> array([ 0.12696983, 0.89723652]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 space |S2 'IA' # index by dimension coordinate labels In [11]: arr.sel(time=slice('2000-01-01', '2000-01-02')) Out[11]: <xray.DataArray (time: 2, space: 3)> array([[ 0.12696983, -10. , -10. ], [ 0.89723652, 0.37674972, 0.33622174]]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 * space (space) |S2 'IA' 'IL' 'IN'
The arguments to these methods can be any objects that could index the array along the dimension given by the keyword, e.g., labels for an individual value, Python slice() objects or 1-dimensional arrays.
Note
We would love to be able to do indexing with labeled dimension names inside brackets, but unfortunately, Python does yet not support indexing with keyword arguments like arr[space=0]
Warning
Do not try to assign values when using isel or sel:
# DO NOT do this
arr.isel(space=0) = 0
Depending on whether the underlying numpy indexing returns a copy or a view, the method will fail, and when it fails, it will fail silently. Instead, you should use normal index assignment:
# this is safe
arr[dict(space=0)] = 0
Dataset indexing¶
We can also use these methods to index all variables in a dataset simultaneously, returning a new dataset:
In [12]: ds = arr.to_dataset()
In [13]: ds.isel(space=[0], time=[0])
Out[13]:
<xray.Dataset>
Dimensions: (space: 1, time: 1)
Coordinates:
* time (time) datetime64[ns] 2000-01-01
* space (space) |S2 'IA'
Variables:
None (time, space) float64 0.127
In [14]: ds.sel(time='2000-01-01')
Out[14]:
<xray.Dataset>
Dimensions: (space: 3)
Coordinates:
time datetime64[ns] 2000-01-01
* space (space) |S2 'IA' 'IL' 'IN'
Variables:
None (space) float64 0.127 -10.0 -10.0
Positional indexing on a dataset is not supported because the ordering of dimensions in a dataset is somewhat ambiguous (it can vary between different arrays). However, you can do normal indexing with labeled dimensions:
In [15]: ds[dict(space=[0], time=[0])]
Out[15]:
<xray.Dataset>
Dimensions: (space: 1, time: 1)
Coordinates:
* time (time) datetime64[ns] 2000-01-01
* space (space) |S2 'IA'
Variables:
None (time, space) float64 0.127
In [16]: ds.loc[dict(time='2000-01-01')]
Out[16]:
<xray.Dataset>
Dimensions: (space: 3)
Coordinates:
time datetime64[ns] 2000-01-01
* space (space) |S2 'IA' 'IL' 'IN'
Variables:
None (space) float64 0.127 -10.0 -10.0
Using indexing to assign values to a subset of dataset (e.g., ds[dict(space=0)] = 1) is not yet supported.
Indexing details¶
Like pandas, whether array indexing returns a view or a copy of the underlying data depends entirely on numpy:
- Indexing with a single label or a slice returns a view.
- Indexing with a vector of array labels returns a copy.
Attributes are persisted in array indexing:
In [17]: arr2 = arr.copy()
In [18]: arr2.attrs['units'] = 'meters'
In [19]: arr2[0, 0].attrs
Out[19]: OrderedDict([('units', 'meters')])
Indexing with xray objects has one important difference from indexing numpy arrays: you can only use one-dimensional arrays to index xray objects, and each indexer is applied “orthogonally” along independent axes, instead of using numpy’s advanced broadcasting. This means you can do indexing like this, which would require slightly more awkward syntax with numpy arrays:
In [20]: arr[arr['time.day'] > 1, arr['space'] != 'IL']
Out[20]:
<xray.DataArray (time: 3, space: 2)>
array([[ 0.89723652, 0.33622174],
[ 0.45137647, 0.12310214],
[ 0.5430262 , 0.44799682]])
Coordinates:
* time (time) datetime64[ns] 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IN'
This is a much simpler model than numpy’s advanced indexing, and is basically the only model that works for labeled arrays. If you would like to do array indexing, you can always index .values directly instead:
In [21]: arr.values[arr.values > 0.5]
Out[21]: array([ 0.89723652, 0.84025508, 0.5430262 ])
Align and reindex¶
xray’s reindex, reindex_like and align impose a DataArray or Dataset onto a new set of coordinates corresponding to dimensions. The original values are subset to the index labels still found in the new labels, and values corresponding to new labels not found in the original object are in-filled with NaN.
To reindex a particular dimension, use reindex():
In [22]: arr.reindex(space=['IA', 'CA'])
Out[22]:
<xray.DataArray (time: 4, space: 2)>
array([[ 0.12696983, nan],
[ 0.89723652, nan],
[ 0.45137647, nan],
[ 0.5430262 , nan]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'CA'
The reindex_like() method is a useful shortcut. To demonstrate, we will make a subset DataArray with new values:
In [23]: foo = arr.rename('foo')
In [24]: baz = (10 * arr[:2, :2]).rename('baz')
In [25]: baz
Out[25]:
<xray.DataArray 'baz' (time: 2, space: 2)>
array([[ 1.26969833, -100. ],
[ 8.97236524, 3.76749716]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
* space (space) |S2 'IA' 'IL'
Reindexing foo with baz selects out the first two values along each dimension:
In [26]: foo.reindex_like(baz)
Out[26]:
<xray.DataArray 'foo' (time: 2, space: 2)>
array([[ 0.12696983, -10. ],
[ 0.89723652, 0.37674972]])
Coordinates:
* space (space) object 'IA' 'IL'
* time (time) datetime64[ns] 2000-01-01 2000-01-02
The opposite operation asks us to reindex to a larger shape, so we fill in the missing values with NaN:
In [27]: baz.reindex_like(foo)
Out[27]:
<xray.DataArray 'baz' (time: 4, space: 3)>
array([[ 1.26969833, -100. , nan],
[ 8.97236524, 3.76749716, nan],
[ nan, nan, nan],
[ nan, nan, nan]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) object 'IA' 'IL' 'IN'
The align() function lets us perform more flexible database-like 'inner', 'outer', 'left' and 'right' joins:
In [28]: xray.align(foo, baz, join='inner')
Out[28]:
(<xray.DataArray 'foo' (time: 2, space: 2)>
array([[ 0.12696983, -10. ],
[ 0.89723652, 0.37674972]])
Coordinates:
* space (space) object 'IA' 'IL'
* time (time) datetime64[ns] 2000-01-01 2000-01-02,
<xray.DataArray 'baz' (time: 2, space: 2)>
array([[ 1.26969833, -100. ],
[ 8.97236524, 3.76749716]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
* space (space) object 'IA' 'IL')
In [29]: xray.align(foo, baz, join='outer')
Out[29]:
(<xray.DataArray 'foo' (time: 4, space: 3)>
array([[ 0.12696983, -10. , -10. ],
[ 0.89723652, 0.37674972, 0.33622174],
[ 0.45137647, 0.84025508, 0.12310214],
[ 0.5430262 , 0.37301223, 0.44799682]])
Coordinates:
* space (space) object 'IA' 'IL' 'IN'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04,
<xray.DataArray 'baz' (time: 4, space: 3)>
array([[ 1.26969833, -100. , nan],
[ 8.97236524, 3.76749716, nan],
[ nan, nan, nan],
[ nan, nan, nan]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) object 'IA' 'IL' 'IN')
Both reindex_like and align work interchangeably between DataArray and Dataset objects, and with any number of matching dimension names:
In [30]: ds
Out[30]:
<xray.Dataset>
Dimensions: (space: 3, time: 4)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Variables:
None (time, space) float64 0.127 -10.0 -10.0 0.8972 0.3767 0.3362 0.4514 0.8403 ...
In [31]: ds.reindex_like(baz)
Out[31]:
<xray.Dataset>
Dimensions: (space: 2, time: 2)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
* space (space) object 'IA' 'IL'
Variables:
None (time, space) float64 0.127 -10.0 0.8972 0.3767
In [32]: other = xray.DataArray(['a', 'b', 'c'], dims='other')
# this is a no-op, because there are no shared dimension names
In [33]: ds.reindex_like(other)
Out[33]:
<xray.Dataset>
Dimensions: (space: 3, time: 4)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Variables:
None (time, space) float64 0.127 -10.0 -10.0 0.8972 0.3767 0.3362 0.4514 0.8403 ...
Computation¶
The labels associated with DataArray and Dataset objects enables some powerful shortcuts for computation, noteably including aggregation and broadcasting by dimension names.
Basic array math¶
Arithemtic operations with a single DataArray automatically vectorize (like numpy) over all array values:
In [1]: arr = xray.DataArray(np.random.randn(2, 3),
...: [('x', ['a', 'b']), ('y', [10, 20, 30])])
...:
In [2]: arr - 3
Out[2]:
<xray.DataArray (x: 2, y: 3)>
array([[-2.5308877 , -3.28286334, -4.5090585 ],
[-4.13563237, -1.78788797, -3.17321465]])
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
In [3]: abs(arr)
Out[3]:
<xray.DataArray (x: 2, y: 3)>
array([[ 0.4691123 , 0.28286334, 1.5090585 ],
[ 1.13563237, 1.21211203, 0.17321465]])
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
You can also use any of numpy’s or scipy’s many ufunc functions directly on a DataArray:
In [4]: np.sin(arr)
Out[4]:
<xray.DataArray (x: 2, y: 3)>
array([[ 0.45209466, -0.27910634, -0.99809483],
[-0.90680094, 0.9363595 , -0.17234978]])
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
Data arrays also implement many numpy.ndarray methods:
In [5]: arr.round(2)
Out[5]:
<xray.DataArray (x: 2, y: 3)>
array([[ 0.47, -0.28, -1.51],
[-1.14, 1.21, -0.17]])
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
In [6]: arr.T
Out[6]:
<xray.DataArray (y: 3, x: 2)>
array([[ 0.4691123 , -1.13563237],
[-0.28286334, 1.21211203],
[-1.5090585 , -0.17321465]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
Missing values¶
xray objects borrow the isnull(), notnull(), count() and dropna() methods for working with missing data from pandas:
In [7]: x = xray.DataArray([0, 1, np.nan, np.nan, 2], dims=['x'])
In [8]: x.isnull()
Out[8]:
<xray.DataArray (x: 5)>
array([False, False, True, True, False], dtype=bool)
Coordinates:
* x (x) int64 0 1 2 3 4
In [9]: x.notnull()
Out[9]:
<xray.DataArray (x: 5)>
array([ True, True, False, False, True], dtype=bool)
Coordinates:
* x (x) int64 0 1 2 3 4
In [10]: x.count()
Out[10]:
<xray.DataArray ()>
array(3)
In [11]: x.dropna(dim='x')
Out[11]:
<xray.DataArray (x: 3)>
array([ 0., 1., 2.])
Coordinates:
* x (x) int64 0 1 4
Aggregation¶
Aggregation methods from ndarray have been updated to take a dim argument instead of axis. This allows for very intuitive syntax for aggregation methods that are applied along particular dimension(s):
In [12]: arr.sum(dim='x')
Out[12]:
<xray.DataArray (y: 3)>
array([-0.66652007, 0.92924868, -1.68227315])
Coordinates:
* y (y) int64 10 20 30
In [13]: arr.std(['x', 'y'])
Out[13]:
<xray.DataArray ()>
array(0.9156385956757354)
In [14]: arr.min()
Out[14]:
<xray.DataArray ()>
array(-1.5090585031735124)
If you need to figure out the axis number for a dimension yourself (say, for wrapping code designed to work with numpy arrays), you can use the get_axis_num() method:
In [15]: arr.get_axis_num('y')
Out[15]: 1
To perform a NA skipping aggregations, pass the NA aware numpy functions directly to reduce method:
In [16]: arr.reduce(np.nanmean, dim='y')
Out[16]:
<xray.DataArray (x: 2)>
array([-0.44093652, -0.032245 ])
Coordinates:
* x (x) |S1 'a' 'b'
Warning
Currently, xray uses the standard ndarray methods which do not automatically skip missing values, but we expect to switch the default to NA skipping versions (like pandas) in a future version (GH130).
Broadcasting by dimension name¶
DataArray objects are automatically align themselves (“broadcasting” in the numpy parlance) by dimension name instead of axis order. With xray, you do not need to transpose arrays or insert dimensions of length 1 to get array operations to work, as commonly done in numpy with np.reshape() or np.newaxis.
This is best illustrated by a few examples. Consider two one-dimensional arrays with different sizes aligned along different dimensions:
In [17]: a = xray.DataArray([1, 2], [('x', ['a', 'b'])])
In [18]: a
Out[18]:
<xray.DataArray (x: 2)>
array([1, 2])
Coordinates:
* x (x) |S1 'a' 'b'
In [19]: b = xray.DataArray([-1, -2, -3], [('y', [10, 20, 30])])
In [20]: b
Out[20]:
<xray.DataArray (y: 3)>
array([-1, -2, -3])
Coordinates:
* y (y) int64 10 20 30
With xray, we can apply binary mathematical operations to these arrays, and their dimensions are expanded automatically:
In [21]: a * b
Out[21]:
<xray.DataArray (x: 2, y: 3)>
array([[-1, -2, -3],
[-2, -4, -6]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
Moreover, dimensions are always reordered to the order in which they first appeared:
In [22]: c = xray.DataArray(np.arange(6).reshape(3, 2), [b['y'], a['x']])
In [23]: c
Out[23]:
<xray.DataArray (y: 3, x: 2)>
array([[0, 1],
[2, 3],
[4, 5]])
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
In [24]: a + c
Out[24]:
<xray.DataArray (x: 2, y: 3)>
array([[1, 3, 5],
[3, 5, 7]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
This means, for example, that you always subtract an array from its transpose:
In [25]: c - c.T
Out[25]:
<xray.DataArray (y: 3, x: 2)>
array([[0, 0],
[0, 0],
[0, 0]])
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
Alignment and coordinates¶
For now, performing most binary operations on xray objects requires that the all index Coordinates (that is, coordinates with the same name as a dimension, marked by *) have the same values:
In [26]: arr + arr[:1]
ValueError: coordinate 'x' is not aligned
However, xray does have shortcuts (copied from pandas) that make aligning DataArray and Dataset objects easy and fast.
In [27]: a, b = xray.align(arr, arr[:1])
In [28]: a + b
Out[28]:
<xray.DataArray (x: 1, y: 3)>
array([[ 0.9382246 , -0.56572669, -3.01811701]])
Coordinates:
* y (y) int64 10 20 30
* x (x) object 'a'
See Align and reindex for more details.
Warning
pandas does index based alignment automatically when doing math, using join='outer'. xray doesn’t have automatic alignment yet, but we do intend to enable it in a future version (GH186). Unlike pandas, we expect to default to join='inner'.
Although index coordinates are required to match exactly, other coordinates are not, and if their values conflict, they will be dropped. This is necessary, for example, because indexing turns 1D coordinates into scalars:
In [29]: arr[0]
Out[29]:
<xray.DataArray (y: 3)>
array([ 0.4691123 , -0.28286334, -1.5090585 ])
Coordinates:
x |S1 'a'
* y (y) int64 10 20 30
In [30]: arr[1]
Out[30]:
<xray.DataArray (y: 3)>
array([-1.13563237, 1.21211203, -0.17321465])
Coordinates:
x |S1 'b'
* y (y) int64 10 20 30
# notice that the scalar coordinate 'x' is silently dropped
In [31]: arr[1] - arr[0]
Out[31]:
<xray.DataArray (y: 3)>
array([-1.60474467, 1.49497537, 1.33584385])
Coordinates:
* y (y) int64 10 20 30
Still, xray will persist other coordinates in arithmetic, as long as there are no conflicting values:
# only one argument has the 'x' coordinate
In [32]: arr[0] + 1
Out[32]:
<xray.DataArray (y: 3)>
array([ 1.4691123 , 0.71713666, -0.5090585 ])
Coordinates:
* y (y) int64 10 20 30
x |S1 'a'
# both arguments have the same 'x' coordinate
In [33]: arr[0] - arr[0]
Out[33]:
<xray.DataArray (y: 3)>
array([ 0., 0., 0.])
Coordinates:
* y (y) int64 10 20 30
x |S1 'a'
Math with Datasets¶
Datasets support arithmetic operations by automatically looping over all variables as well as dimensions:
In [34]: ds = xray.Dataset({'x_and_y': (('x', 'y'), np.random.randn(2, 3)),
....: 'x_only': ('x', np.random.randn(2))},
....: coords=arr.coords)
....:
In [35]: ds > 0
Out[35]:
<xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
Variables:
x_only (x) bool True False
x_and_y (x, y) bool True False False False False True
In [36]: ds.mean(dim='x')
Out[36]:
<xray.Dataset>
Dimensions: (y: 3)
Coordinates:
* y (y) int64 10 20 30
Variables:
x_only float64 0.007392
x_and_y (y) float64 -0.9927 -0.7696 0.105
Datasets have most of the same ndarray methods found on data arrays. Again, these operations loop over all dataset variables:
In [37]: abs(ds)
Out[37]:
<xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
Variables:
x_only (x) float64 0.7216 0.7068
x_and_y (x, y) float64 0.1192 1.044 0.8618 2.105 0.4949 1.072
transpose() can also be used to reorder dimensions on all variables:
In [38]: ds.transpose('y', 'x')
Out[38]:
<xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
Variables:
x_only (x) float64 0.7216 -0.7068
x_and_y (y, x) float64 0.1192 -2.105 -1.044 -0.4949 -0.8618 1.072
Unfortunately, a limitation of the current version of numpy means that we cannot override ufuncs for datasets, because datasets cannot be written as a single array [1]. apply() works around this limitation, by applying the given function to each variable in the dataset:
In [39]: ds.apply(np.sin)
Out[39]:
<xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
Variables:
x_only (x) float64 0.6606 -0.6494
x_and_y (x, y) float64 0.1189 -0.8645 -0.759 -0.8609 -0.475 0.8781
Datasets also use looping over variables for broadcasting in binary arithemtic. You can do arithemtic between any DataArray and a dataset as long as they have aligned indexes:
In [40]: ds + arr
Out[40]:
<xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
Variables:
x_only (x, y) float64 1.191 0.4387 -0.7875 -1.842 0.5053 -0.88
x_and_y (x, y) float64 0.5883 -1.327 -2.371 -3.24 0.7172 0.8986
Arithemtic between two datasets requires that the datasets also have the same variables:
In [41]: ds2 = xray.Dataset({'x_and_y': 0, 'x_only': 100})
In [42]: ds - ds2
Out[42]:
<xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
Variables:
x_only (x) float64 -99.28 -100.7
x_and_y (x, y) float64 0.1192 -1.044 -0.8618 -2.105 -0.4949 1.072
There is no shortcut similar to align for aligning variable names, but you may find rename() and drop_vars() useful.
Note
When we enable automatic alignment over indexes, we will probably enable automatic alignment between dataset variables as well.
[1] | When numpy 1.10 is released, we should be able to override ufuncs for datasets by making use of __numpy_ufunc__. |
GroupBy: split-apply-combine¶
xray supports “group by” operations with the same API as pandas to implement the split-apply-combine strategy:
- Split your data into multiple independent groups.
- Apply some function to each group.
- Combine your groups back into a single data object.
Group by operations work on both Dataset and DataArray objects. Currently, you can only group by a single one-dimensional variable (eventually, we hope to remove this limitation).
Split¶
Let’s create a simple example dataset:
In [1]: ds = xray.Dataset({'foo': (('x', 'y'), np.random.rand(4, 3))},
...: coords={'x': [10, 20, 30, 40],
...: 'letters': ('x', list('abba'))})
...:
In [2]: arr = ds['foo']
In [3]: ds
Out[3]:
<xray.Dataset>
Dimensions: (x: 4, y: 3)
Coordinates:
* x (x) int64 10 20 30 40
letters (x) |S1 'a' 'b' 'b' 'a'
* y (y) int64 0 1 2
Variables:
foo (x, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 0.4514 0.8403 0.1231 ...
If we groupby the name of a variable or coordinate in a dataset (we can also use a DataArray directly), we get back a xray.GroupBy object:
In [4]: ds.groupby('letters')
Out[4]: <xray.core.groupby.DatasetGroupBy at 0x7f9d9a327710>
This object works very similarly to a pandas GroupBy object. You can view the group indices with the groups attribute:
In [5]: ds.groupby('letters').groups
Out[5]: {'a': [0, 3], 'b': [1, 2]}
You can also iterate over over groups in (label, group) pairs:
In [6]: list(ds.groupby('letters'))
Out[6]:
[('a', <xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) int64 10 40
letters (x) |S1 'a' 'a'
* y (y) int64 0 1 2
Variables:
foo (x, y) float64 0.127 0.9667 0.2605 0.543 0.373 0.448),
('b', <xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) int64 20 30
letters (x) |S1 'b' 'b'
* y (y) int64 0 1 2
Variables:
foo (x, y) float64 0.8972 0.3767 0.3362 0.4514 0.8403 0.1231)]
Just like in pandas, creating a GroupBy object is cheap: it does not actually split the data until you access particular values.
Apply¶
To apply a function to each group, you can use the flexible xray.GroupBy.apply() method. The resulting objects are automatically concatenated back together along the group axis:
In [7]: def standardize(x):
...: return (x - x.mean()) / x.std()
...:
In [8]: arr.groupby('letters').apply(standardize)
Out[8]:
<xray.DataArray 'foo' (x: 4, y: 3)>
array([[-1.22977845, 1.93741005, -0.72624738],
[ 1.41979574, -0.46019243, -0.60657867],
[-0.19064205, 1.21397989, -1.3763625 ],
[ 0.33941723, -0.30180645, -0.01899499]])
Coordinates:
* y (y) int64 0 1 2
* x (x) int64 10 20 30 40
letters (x) |S1 'a' 'b' 'b' 'a'
GroupBy objects also have a reduce() method and methods like mean() as shortcuts for applying an aggregation function:
In [9]: arr.groupby('letters').mean(dim='x')
Out[9]:
<xray.DataArray 'foo' (y: 3, letters: 2)>
array([[ 0.33499802, 0.6743065 ],
[ 0.66986503, 0.6085024 ],
[ 0.35423642, 0.22966194]])
Coordinates:
* letters (letters) |S1 'a' 'b'
* y (y) int64 0 1 2
Using a groupby is thus also a convenient shortcut for aggregating over all dimensions other than the provided one:
In [10]: ds.groupby('x').reduce(np.nanmean)
Out[10]:
<xray.Dataset>
Dimensions: (x: 4)
Coordinates:
* x (x) int64 10 20 30 40
letters (x) |S1 'a' 'b' 'b' 'a'
Variables:
foo (x) float64 0.4514 0.5367 0.4716 0.4547
Grouped arithmetic¶
GroupBy objects also support a limited set of binary arithmetic operations, as a shortcut for mapping over all unique labels. Binary arithmetic is supported for (GroupBy, Dataset) and (GroupBy, DataArray) pairs, as long as the dataset or data array uses the unique grouped values as one of its index coordinates. For example:
In [11]: alt = arr.groupby('letters').mean()
In [12]: alt
Out[12]:
<xray.DataArray 'foo' (letters: 2)>
array([ 0.45303315, 0.50415695])
Coordinates:
* letters (letters) |S1 'a' 'b'
In [13]: ds.groupby('letters') - alt
Out[13]:
<xray.Dataset>
Dimensions: (x: 4, y: 3)
Coordinates:
* y (y) int64 0 1 2
* x (x) int64 10 20 30 40
letters (x) |S1 'a' 'b' 'b' 'a'
Variables:
foo (x, y) float64 -0.3261 0.5137 -0.1926 0.3931 -0.1274 -0.1679 -0.05278 ...
This last line is roughly equivalent to the following:
results = []
for label, group in ds.groupby('letters'):
results.append(group - alt.sel(label))
xray.concat(results, dim='letters')
Squeezing¶
When grouping over a dimension, you can control whether the dimension is squeezed out or if it should remain with length one on each group by using the squeeze parameter:
In [14]: next(iter(arr.groupby('x')))
Out[14]:
(10, <xray.DataArray 'foo' (y: 3)>
array([ 0.12696983, 0.96671784, 0.26047601])
Coordinates:
* y (y) int64 0 1 2
x int64 10
letters |S1 'a')
In [15]: next(iter(arr.groupby('x', squeeze=False)))
Out[15]:
(10, <xray.DataArray 'foo' (x: 1, y: 3)>
array([[ 0.12696983, 0.96671784, 0.26047601]])
Coordinates:
* y (y) int64 0 1 2
* x (x) int64 10
letters (x) |S1 'a')
Although xray will attempt to automatically transpose dimensions back into their original order when you use apply, it is sometimes useful to set squeeze=False to guarantee that all original dimensions remain unchanged.
You can always squeeze explicitly later with the Dataset or DataArray squeeze() methods.
Combining data¶
Concatenate¶
To combine arrays along existing or new dimension into a larger arrays, you can use concat(). concat takes an iterable of DataArray or Dataset objects, as well as a dimension name, and concatenates along that dimension:
In [1]: arr = xray.DataArray(np.random.randn(2, 3),
...: [('x', ['a', 'b']), ('y', [10, 20, 30])])
...:
In [2]: arr[:, :1]
Out[2]:
<xray.DataArray (x: 2, y: 1)>
array([[ 0.4691123 ],
[-1.13563237]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10
# this resembles how you would use np.concatenate
In [3]: xray.concat([arr[:, :1], arr[:, 1:]], dim='y')
Out[3]:
<xray.DataArray (x: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
In addition to combining along an existing dimension, concat can create a new dimension by stacking lower dimensional arrays together:
In [4]: arr[0]
Out[4]:
<xray.DataArray (y: 3)>
array([ 0.4691123 , -0.28286334, -1.5090585 ])
Coordinates:
x |S1 'a'
* y (y) int64 10 20 30
# to combine these 1d arrays into a 2d array in numpy, you would use np.array
In [5]: xray.concat([arr[0], arr[1]], 'x')
Out[5]:
<xray.DataArray (x: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
Coordinates:
* y (y) int64 10 20 30
* x (x) object 'a' 'b'
If the second argument to concat is a new dimension name, the arrays will be concatenated along that new dimension, which is always inserted as the first dimension:
In [6]: xray.concat([arr[0], arr[1]], 'new_dim')
Out[6]:
<xray.DataArray (new_dim: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
Coordinates:
* y (y) int64 10 20 30
* new_dim (new_dim) int64 0 1
x (new_dim) object 'a' 'b'
This is actually the default behavior for concat:
In [7]: xray.concat([arr[0], arr[1]])
Out[7]:
<xray.DataArray (concat_dim: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
Coordinates:
* y (y) int64 10 20 30
* concat_dim (concat_dim) int64 0 1
x (concat_dim) object 'a' 'b'
The second argument to concat can also be an Index or DataArray object as well as a string, in which case it is used to label the values along the new dimension:
In [8]: xray.concat([arr[0], arr[1]], pd.Index([-90, -100], name='new_dim'))
Out[8]:
<xray.DataArray (new_dim: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
Coordinates:
* y (y) int64 10 20 30
* new_dim (new_dim) int64 -90 -100
x (new_dim) object 'a' 'b'
Of course, concat also works on Dataset objects:
In [9]: ds = arr.to_dataset(name='foo')
In [10]: xray.concat([ds.sel(x='a'), ds.sel(x='b')], 'x')
Out[10]:
<xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* y (y) int64 10 20 30
* x (x) object 'a' 'b'
Variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
concat() has a number of options which provide deeper control over which variables and coordinates are concatenated and how it handles conflicting variables between datasets. However, these should rarely be necessary.
Merge and update¶
To combine variables and coordinates between multiple Datasets, you can use the merge() and update() methods. Merge checks for conflicting variables before merging and by default it returns a new Dataset:
In [11]: ds.merge({'hello': ('space', np.arange(3) + 10)})
Out[11]:
<xray.Dataset>
Dimensions: (space: 3, x: 2, y: 3)
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
* space (space) int64 0 1 2
Variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
hello (space) int64 10 11 12
In contrast, update modifies a dataset in-place without checking for conflicts, and will overwrite any existing variables with new values:
In [12]: ds.update({'space': ('space', [10.2, 9.4, 3.9])})
Out[12]:
<xray.Dataset>
Dimensions: (space: 3, x: 2, y: 3)
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
* space (space) float64 10.2 9.4 3.9
Variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
However, dimensions are still required to be consistent between different Dataset variables, so you cannot change the size of a dimension unless you replace all dataset variables that use it.
Equals and identical¶
xray objects can be compared by using the equals() and identical() methods. These methods are used by the optional compat argument on concat and merge.
equals checks dimension names, indexes and array values:
In [13]: arr.equals(arr.copy())
Out[13]: True
identical also checks attributes, and the name of each object:
In [14]: arr.identical(arr.rename('bar'))
Out[14]: False
In contrast, the == operation performs element-wise comparison (like numpy):
In [15]: arr == arr.copy()
Out[15]:
<xray.DataArray (x: 2, y: 3)>
array([[ True, True, True],
[ True, True, True]], dtype=bool)
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
Like pandas objects, two xray objects are still equal or identical if they have missing values marked by NaN in the same locations.
Working with pandas¶
One of the most important features of xray is the ability to convert to and from pandas objects to interact with the rest of the PyData ecosystem. For example, for plotting labeled data, we highly recommend using the visualization built in to pandas itself or provided by the pandas aware libraries such as Seaborn and ggplot.
Hierarchical and tidy data¶
Tabular data is easiest to work with when it meets the criteria for tidy data:
- Each column holds a different variable (coordinates and variables in xray’s terminology).
- Each rows holds a different observation.
In this “tidy data” format, we can represent any Dataset and DataArray in terms of pandas.DataFrame and pandas.Series, respectively (and vice-versa). The representation works by flattening non-coordinates to 1D, and turning the tensor product of coordinate indexes into a pandas.MultiIndex.
Dataset and DataFrame¶
To convert any dataset to a DataFrame in tidy form, use the Dataset.to_dataframe() method:
In [1]: ds = xray.Dataset({'foo': (('x', 'y'), np.random.randn(2, 3))},
...: coords={'x': [10, 20], 'y': ['a', 'b', 'c'],
...: 'along_x': ('x', np.random.randn(2)),
...: 'scalar': 123})
...:
In [2]: ds
Out[2]:
<xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* y (y) |S1 'a' 'b' 'c'
* x (x) int64 10 20
scalar int64 123
along_x (x) float64 0.1192 -1.044
Variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
In [3]: df = ds.to_dataframe()
In [4]: df
Out[4]:
foo scalar along_x
x y
10 a 0.469112 123 0.119209
b -0.282863 123 0.119209
c -1.509059 123 0.119209
20 a -1.135632 123 -1.044236
b 1.212112 123 -1.044236
c -0.173215 123 -1.044236
We see that each variable and coordinate in the Dataset is now a column in the DataFrame, with the exception of indexes which are in the index. To convert the DataFrame to any other convenient representation, use DataFrame methods like reset_index(), stack() and unstack().
To create a Dataset from a DataFrame, use the from_dataframe() class method:
In [5]: xray.Dataset.from_dataframe(df)
Out[5]:
<xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) int64 10 20
* y (y) object 'a' 'b' 'c'
Variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
scalar (x, y) int64 123 123 123 123 123 123
along_x (x, y) float64 0.1192 0.1192 0.1192 -1.044 -1.044 -1.044
Notice that that dimensions of variables in the Dataset have now expanded after the round-trip conversion to a DataFrame. This is because every object in a DataFrame must have the same indices, so we need to broadcast the data of each array to the full size of the new MultiIndex.
Likewise, all the coordinates (other than indexes) ended up as variables, because pandas does not distinguish non-index coordinates.
DataArray and Series¶
DataArray objects have a complementary representation in terms of a pandas.Series. Using a Series preserves the Dataset to DataArray relationship, because DataFrames are dict-like containers of Series. The methods are very similar to those for working with DataFrames:
In [6]: s = ds['foo'].to_series()
In [7]: s
Out[7]:
x y
10 a 0.469112
b -0.282863
c -1.509059
20 a -1.135632
b 1.212112
c -0.173215
Name: foo, dtype: float64
In [8]: xray.DataArray.from_series(s)
Out[8]:
<xray.DataArray 'foo' (x: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
Coordinates:
* x (x) int64 10 20
* y (y) object 'a' 'b' 'c'
Both the from_series and from_dataframe methods use reindexing, so they work even if not the hierarchical index is not a full tensor product:
In [9]: s[::2]
Out[9]:
x y
10 a 0.469112
c -1.509059
20 b 1.212112
Name: foo, dtype: float64
In [10]: xray.DataArray.from_series(s[::2])
Out[10]:
<xray.DataArray 'foo' (x: 2, y: 3)>
array([[ 0.4691123 , nan, -1.5090585 ],
[ nan, 1.21211203, nan]])
Coordinates:
* x (x) int64 10 20
* y (y) object 'a' 'b' 'c'
Multi-dimensional data¶
DataArray.to_pandas() is a shortcut that lets you convert a DataArray directly into a pandas object with the same dimensionality (i.e., a 1D array is converted to a Series, 2D to DataFrame and 3D to Panel):
In [11]: arr = xray.DataArray(np.random.randn(2, 3),
....: coords=[('x', [10, 20]), ('y', ['a', 'b', 'c'])])
....:
In [12]: df = arr.to_pandas()
In [13]: df
Out[13]:
y a b c
x
10 -0.861849 -2.104569 -0.494929
20 1.071804 0.721555 -0.706771
To perform the inverse operation of converting any pandas objects into a data array with the same shape, simply use the DataArray constructor:
In [14]: xray.DataArray(df)
Out[14]:
<xray.DataArray (x: 2, y: 3)>
array([[-0.86184896, -2.10456922, -0.49492927],
[ 1.07180381, 0.72155516, -0.70677113]])
Coordinates:
* x (x) int64 10 20
* y (y) object 'a' 'b' 'c'
xray objects do not yet support hierarchical indexes, so if your data has a hierarchical index, you will either need to unstack it first or use the from_series() or from_dataframe() constructors described above.
Serialization and IO¶
xray supports direct serialization and IO to several file formats. For more options, consider exporting your objects to pandas (see the preceeding section) and using its broad range of IO tools.
Pickle¶
The simplest way to serialize an xray object is to use Python’s built-in pickle module:
In [1]: import cPickle as pickle
In [2]: ds = xray.Dataset({'foo': (('x', 'y'), np.random.rand(4, 5))},
...: coords={'x': [10, 20, 30, 40],
...: 'y': pd.date_range('2000-01-01', periods=5),
...: 'z': ('x', list('abcd'))})
...:
# use the highest protocol (-1) because it is way faster than the default
# text based pickle format
In [3]: pkl = pickle.dumps(ds, protocol=-1)
In [4]: pickle.loads(pkl)
Out[4]:
<xray.Dataset>
Dimensions: (x: 4, y: 5)
Coordinates:
* y (y) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04 2000-01-05
* x (x) int64 10 20 30 40
z (x) |S1 'a' 'b' 'c' 'd'
Variables:
foo (x, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 0.4514 0.8403 0.1231 ...
Pickle support is important because it doesn’t require any external libraries and lets you use xray objects with Python modules like multiprocessing. However, there are two important cavaets:
- To simplify serialization, xray’s support for pickle currently loads all array values into memory before dumping an object. This means it is not suitable for serializing datasets too big to load into memory (e.g., from netCDF or OPeNDAP).
- Pickle will only work as long as the internal data structure of xray objects remains unchanged. Because the internal design of xray is still being refined, we make no guarantees (at this point) that objects pickled with this version of xray will work in future versions.
netCDF¶
Currently, the only disk based serialization format that xray directly supports is netCDF. netCDF is a file format for fully self-described datasets that is widely used in the geosciences and supported on almost all platforms. We use netCDF because xray was based on the netCDF data model, so netCDF files on disk directly correspond to Dataset objects. Recent versions netCDF are based on the even more widely used HDF5 file-format.
Reading and writing netCDF files with xray requires the Python-netCDF4 library.
We can save a Dataset to disk using the Dataset.to_netcdf method:
In [5]: ds.to_netcdf('saved_on_disk.nc')
By default, the file is saved as netCDF4.
We can load netCDF files to create a new Dataset using the open_dataset() function:
In [6]: ds_disk = xray.open_dataset('saved_on_disk.nc')
In [7]: ds_disk
Out[7]:
<xray.Dataset>
Dimensions: (x: 4, y: 5)
Index Coordinates:
x (x) int64 10 20 30 40
y (y) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04 2000-01-05
Variables:
foo (x, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 0.4514 0.8403 0.1231 ...
z (x) |S1 'a' 'b' 'c' 'd'
A dataset can also be loaded from a specific group within a netCDF file. To load from a group, pass a group keyword argument to the open_dataset function. The group can be specified as a path-like string, e.g., to access subgroup ‘bar’ within group ‘foo’ pass ‘/foo/bar’ as the group argument.
Data is loaded lazily from netCDF files. You can manipulate, slice and subset Dataset and DataArray objects, and no array values are loaded into memory until necessary. For an example of how these lazy arrays work, see the OPeNDAP section below.
Tip
xray’s lazy loading of remote or on-disk datasets is often but not always desirable. Before performing computationally intense operations, it is usually a good idea to load a dataset entirely into memory by using the load_data() method:
In [8]: ds_disk.load_data()
Datasets have a close() method to close the associated netCDF file. However, it’s often cleaner to use a with statement:
# this automatically closes the dataset after use
In [9]: with xray.open_dataset('saved_on_disk.nc') as ds:
... print(ds.keys())
Out[9]: [u'foo', u'y', u'x', u'z']
Note
Although xray provides reasonable support for incremental reads of files on disk, it does not yet support incremental writes, which is important for dealing with datasets that do not fit into memory. This is a significant shortcoming that we hope to resolve (GH199) by adding the ability to create Dataset objects directly linked to a netCDF file on disk.
NetCDF files follow some conventions for encoding datetime arrays (as numbers with a “units” attribute) and for packing and unpacking data (as described by the “scale_factor” and “_FillValue” attributes). If the argument decode_cf=True (default) is given to open_dataset, xray will attempt to automatically decode the values in the netCDF objects according to CF conventions. Sometimes this will fail, for example, if a variable has an invalid “units” or “calendar” attribute. For these cases, you can turn this decoding off manually.
You can view this encoding information and control the details of how xray serializes objects, by viewing and manipulating the DataArray.encoding attribute:
In [10]: ds_disk['y'].encoding
Out[10]:
{'calendar': u'proleptic_gregorian',
'chunksizes': None,
'complevel': 0,
'contiguous': True,
'dtype': dtype('float64'),
'fletcher32': False,
'least_significant_digit': None,
'shuffle': False,
'source': 'saved_on_disk.nc',
'units': u'days since 2000-01-01 00:00:00',
'zlib': False}
OPeNDAP¶
xray includes support for OPeNDAP (via the netCDF4 library or Pydap), which lets us access large datasets over HTTP.
For example, we can open a connetion to GBs of weather data produced by the PRISM project, and hosted by International Research Institute for Climate and Society at Columbia:
In [11]: remote_data = xray.open_dataset(
....: 'http://iridl.ldeo.columbia.edu/SOURCES/.OSU/.PRISM/.monthly/dods',
....: decode_times=False)
....:
In [12]: remote_data
Out[12]:
<xray.Dataset>
Dimensions: (T: 1422, X: 1405, Y: 621)
Coordinates:
* X (X) float32 -125.0 -124.958 -124.917 -124.875 -124.833 -124.792 -124.75 ...
* T (T) float32 -779.5 -778.5 -777.5 -776.5 -775.5 -774.5 -773.5 -772.5 -771.5 ...
* Y (Y) float32 49.9167 49.875 49.8333 49.7917 49.75 49.7083 49.6667 49.625 ...
Variables:
ppt (T, Y, X) float64 ...
tdmean (T, Y, X) float64 ...
tmax (T, Y, X) float64 ...
tmin (T, Y, X) float64 ...
Attributes:
Conventions: IRIDL
expires: 1375315200
Note
Like many real-world datasets, this dataset does not entirely follow CF conventions. Unexpected formats will usually cause xray’s automatic decoding to fail. The way to work around this is to either set decode_cf=False in open_dataset to turn off all use of CF conventions, or by only disabling the troublesome parser. In this case, we set decode_times=False because the time axis here provides the calendar attribute in a format that xray does not expect (the integer 360 instead of a string like '360_day').
We can select and slice this data any number of times, and nothing is loaded over the network until we look at particular values:
In [13]: tmax = remote_data['tmax'][:500, ::3, ::3]
In [14]: tmax
Out[14]:
<xray.DataArray 'tmax' (T: 500, Y: 207, X: 469)>
[48541500 values with dtype=float64]
Coordinates:
* Y (Y) float32 49.9167 49.7917 49.6667 49.5417 49.4167 49.2917 49.1667 ...
* X (X) float32 -125.0 -124.875 -124.75 -124.625 -124.5 -124.375 -124.25 ...
* T (T) float32 -779.5 -778.5 -777.5 -776.5 -775.5 -774.5 -773.5 -772.5 -771.5 ...
Attributes:
pointwidth: 120
standard_name: air_temperature
units: Celsius_scale
expires: 1375315200
Finally, let’s plot a small subset with matplotlib:
In [15]: tmax_ss = tmax[0]
In [16]: import matplotlib.pyplot as plt
In [17]: plt.figure(figsize=(9, 5))
In [18]: plt.gca().patch.set_color('0')
In [19]: plt.contourf(tmax_ss['X'], tmax_ss['Y'], tmax_ss.values, 20,
....: cmap='RdBu_r')
....:
In [20]: plt.colorbar(label='tmax (deg C)')
Note
We do hope to eventually add plotting methods to xray to make this easier (GH185).
API reference¶
This page provides an auto-generated summary of xray’s API. For more details and examples, refer to the relevant chapter in the main part of the documentation.
Top-level functions¶
align(*objects[, join, copy]) | Given any number of Dataset and/or DataArray objects, returns new objects with aligned indexes. |
concat(objs[, dim, indexers, mode, ...]) | Concatenate xray objects along a new or existing dimension. |
Dataset¶
Creating a dataset¶
Dataset([variables, coords, attrs]) | A multi-dimensional, in memory, array database. |
open_dataset(filename_or_obj[, decode_cf, ...]) | Load and decode a dataset from a file or file-like object. |
decode_cf(obj[, concat_characters, ...]) | Decode the given Dataset or Datastore according to CF conventions into a new Dataset. |
Attributes¶
Dataset.dims | Mapping from dimension names to lengths. |
Dataset.vars | |
Dataset.coords | Dictionary of xray.Coordinate objects used for label based indexing. |
Dataset.attrs | Dictionary of global attributes on this dataset |
Dictionary interface¶
Datasets implement the mapping interface with keys given by variable names and values given by DataArray objects.
Dataset.__getitem__(key) | Access variables or coordinates this dataset as a DataArray. |
Dataset.__setitem__(key, value) | Add an array to this dataset. |
Dataset.__delitem__(key) | Remove a variable from this dataset. |
Dataset.update(other[, inplace]) | Update this dataset’s variables and attributes with those from another dataset. |
Dataset.iteritems(...) | |
Dataset.itervalues(...) |
Dataset contents¶
Dataset.copy([deep]) | Returns a copy of this dataset. |
Dataset.merge(other[, inplace, ...]) | Merge the arrays of two datasets into a single dataset. |
Dataset.rename(name_dict[, inplace]) | Returns a new object with renamed variables and dimensions. |
Dataset.drop_vars(*names) | Returns a new dataset without the named variables. |
Dataset.set_coords(names[, inplace]) | Given names of one or more variables, set them as coordinates |
Dataset.reset_coords([names, drop, inplace]) | Given names of coordinates, reset them to become variables |
Comparisons¶
Dataset.equals(other) | Two Datasets are equal if they have matching variables and coordinates, all of which are equal. |
Dataset.identical(other) | Like equals, but also checks all dataset attributes and the attributes on all variables and coordinates. |
Indexing¶
Dataset.loc | Attribute for location based indexing. |
Dataset.isel(**indexers) | Returns a new dataset with each array indexed along the specified dimension(s). |
Dataset.sel(**indexers) | Returns a new dataset with each array indexed by tick labels along the specified dimension(s). |
Dataset.squeeze([dim]) | Returns a new dataset with squeezed data. |
Dataset.reindex([indexers, copy]) | Conform this object onto a new set of indexes, filling in missing values with NaN. |
Dataset.reindex_like(other[, copy]) | Conform this object onto the indexes of another object, filling in missing values with NaN. |
Computation¶
Dataset.apply(func[, keep_attrs, args]) | Apply a function over the variables in this dataset. |
Dataset.reduce(func[, dim, keep_attrs]) | Reduce this dataset by applying func along some dimension(s). |
Dataset.groupby(group[, squeeze]) | Returns a GroupBy object for performing grouped operations. |
Dataset.transpose(*dims) | Return a new Dataset object with all array dimensions transposed. |
Aggregation: all any argmax argmin max mean min prod sum std var
IO / Conversion¶
Dataset.to_netcdf(filepath, **kwdargs) | Dump dataset contents to a location on disk using the netCDF4 package. |
Dataset.to_dataframe() | Convert this dataset into a pandas.DataFrame. |
Dataset.from_dataframe(dataframe) | Convert a pandas.DataFrame into an xray.Dataset |
Dataset.close() | Close any files linked to this dataset |
Dataset.load_data() | Manually trigger loading of this dataset’s data from disk or a remote source into memory and return this dataset. |
Backends (experimental)¶
These backends provide a low-level interface for lazily loading data from external file-formats or protocols, and can be manually invoked to create arguments for the from_store and dump_to_store Dataset methods.
backends.NetCDF4DataStore(filename[, mode, ...]) | Store for reading and writing data via the Python-NetCDF4 library. |
backends.PydapDataStore(url) | Store for accessing OpenDAP datasets with pydap. |
backends.ScipyDataStore(filename_or_obj[, ...]) | Store for reading and writing data via scipy.io.netcdf. |
DataArray¶
DataArray(data[, coords, dims, name, attrs, ...]) | N-dimensional array with labeled coordinates and dimensions. |
Attributes¶
DataArray.values | The array’s data as a numpy.ndarray |
DataArray.coords | Dictionary-like container of coordinate arrays. |
DataArray.dims | Dimension names associated with this array. |
DataArray.name | The name of this array. |
DataArray.attrs | Dictionary storing arbitrary metadata with this array. |
DataArray.encoding | Dictionary of format-specific settings for how this array should be serialized. |
DataArray contents¶
DataArray.rename(new_name_or_name_dict) | Returns a new DataArray with renamed coordinates and/or a new name. |
DataArray.reset_coords([names, drop, inplace]) | Given names of coordinates, reset them to become variables. |
DataArray.copy([deep]) | Returns a copy of this array. |
Indexing¶
DataArray.__getitem__(key) | |
DataArray.__setitem__(key, value) | |
DataArray.loc | Attribute for location based indexing like pandas. |
DataArray.isel(**indexers) | Return a new DataArray whose dataset is given by integer indexing along the specified dimension(s). |
DataArray.sel(**indexers) | Return a new DataArray whose dataset is given by selecting index labels along the specified dimension(s). |
DataArray.squeeze([dim]) | Return a new DataArray object with squeezed data. |
DataArray.reindex([copy]) | Conform this object onto a new set of indexes, filling in missing values with NaN. |
DataArray.reindex_like(other[, copy]) | Conform this object onto the indexes of another object, filling in missing values with NaN. |
Computation¶
DataArray.reduce(func[, dim, axis, keep_attrs]) | Reduce this array by applying func along some dimension(s). |
DataArray.groupby(group[, squeeze]) | Returns a GroupBy object for performing grouped operations. |
DataArray.transpose(*dims) | Return a new DataArray object with transposed dimensions. |
DataArray.get_axis_num(dim) | Return axis number(s) corresponding to dimension(s) in this array. |
Aggregation: all any argmax argmin max mean min prod sum std var
Missing values: isnull notnull count dropna
ndarray methods: argsort clip conj conjugate searchsorted round T
Comparisons¶
DataArray.equals(other) | True if two DataArrays have the same dimensions, coordinates and values; otherwise False. |
DataArray.identical(other) | Like equals, but also checks the array name and attributes, and attributes on all coordinates. |
IO / Conversion¶
DataArray.to_dataset([name]) | Convert a DataArray to a Dataset |
DataArray.to_pandas() | Convert this array into a pandas object with the same shape. |
DataArray.to_series() | Convert this array into a pandas.Series. |
DataArray.to_dataframe() | Convert this array and its coordinates into a tidy pandas.DataFrame. |
DataArray.to_index() | Convert this variable to a pandas.Index. |
DataArray.to_cdms2() | Convert this array into a cdms2.Variable |
DataArray.from_series(series) | Convert a pandas.Series into an xray.DataArray. |
DataArray.from_cdms2(variable) | Convert a cdms2.Variable into an xray.DataArray |
DataArray.load_data() | Manually trigger loading of this array’s data from disk or a remote source into memory and return this array. |
Frequently Asked Questions¶
Why is pandas not enough?¶
pandas, thanks to its unrivaled speed and flexibility, has emerged as the premier python package for working with labeled arrays. So why are we contributing to further fragmentation in the ecosystem for working with data arrays in Python?
Sometimes, we really want to work with collections of higher dimensional arrays (ndim > 2), or arrays for which the order of dimensions (e.g., columns vs rows) shouldn’t really matter. For example, climate and weather data is often natively expressed in 4 or more dimensions: time, x, y and z.
Pandas does support N-dimensional panels, but the implementation is very limited:
- You need to create a new factory type for each dimensionality.
- You can’t do math between NDPanels with different dimensionality.
- Each dimension in a NDPanel has a name (e.g., ‘labels’, ‘items’, ‘major_axis’, etc.) but the dimension names refer to order, not their meaning. You can’t specify an operation as to be applied along the “time” axis.
Fundamentally, the N-dimensional panel is limited by its context in pandas’s tabular model, which treats a 2D DataFrame as a collections of 1D Series, a 3D Panel as a collection of 2D DataFrame, and so on. In my experience, it usually easier to work with a DataFrame with a hierarchical index rather than to use higher dimensional (N > 3) data structures in pandas.
Another use case is handling collections of arrays with different numbers of dimensions. For example, suppose you have a 2D array and a handful of associated 1D arrays that share one of the same axes. Storing these in one pandas object is possible but awkward – you can either upcast all the 1D arrays to 2D and store everything in a Panel, or put everything in a DataFrame, where the first few columns have a different meaning than the other columns. In contrast, this sort of data structure fits very naturally in an xray Dataset.
Pandas gets a lot of things right, but scientific users need fully multi- dimensional data structures.
How do xray data structures differ from those found in pandas?¶
The main distinguishing feature of xray’s DataArray over labeled arrays in pandas is that dimensions can have names (e.g., “time”, “latitude”, “longitude”). Names are much easier to keep track of than axis numbers, and xray uses dimension names for indexing, aggregation and broadcasting. Not only can you write x.sel(time='2000-01-01') and x.mean(dim='time'), but operations like x - x.mean(dim='time') always work, no matter the order of the “time” dimension. You never need to reshape arrays (e.g., with np.newaxis) to align them for arithmetic operations in xray.
Should I use xray instead of pandas?¶
It’s not an either/or choice! xray provides robust support for converting back and forth between the tabular data-structures of pandas and its own multi-dimensional data-structures.
That said, you should only bother with xray if some aspect of data is fundamentally multi-dimensional. If your data is unstructured or one-dimensional, stick with pandas, which is a more developed toolkit for doing data analysis in Python.
What is your approach to metadata?¶
We are firm believers in the power of labeled data! In addition to dimensions and coordinates, xray supports arbitrary metadata in the form of global (Dataset) and variable specific (DataArray) attributes (attrs).
Automatic interpretation of labels is powerful but also reduces flexibility. With xray, we draw a firm line between labels that the library understands (dims and coords) and labels for users and user code (attrs). For example, we do not automatically intrepret and enforce units or CF conventions. (An exception is serialization to netCDF with cf_conventions=True.)
An implication of this choice is that we do not propagate attrs through most operations unless explicitly flagged (some methods have a keep_attrs option). Similarly, xray does not check for conflicts between attrs when combining arrays and datasets, unless explicitly requested with the option compat='identical'. The guiding principle is that metadata should not be allowed to get in the way.
Does xray support out-of-core computation?¶
Not yet! Distributed and out-of-memory computation is certainly something we’re excited, but for now we have focused on making xray a full-featured tool for in-memory analytics (like pandas).
We have some ideas for what out-of-core support could look like (probably through an library like biggus or Blaze), but we’re not there yet. An intermediate step would be supporting incremental writes to a Dataset linked to a NetCDF file on disk.
What’s New¶
v0.3.2 (in development)¶
This release focused on bug-fixes, speedups and resolving some niggling inconsistencies.
There are a few cases where the behavior of xray differs from the previous version. However, I expect that in almost all cases your code will continue to run unmodified.
Warning
xray now requires pandas v0.15.0 or later. This was necessary for supporting TimedeltaIndex without too many painful hacks.
Backwards incompatible changes¶
Arrays of datetime.datetime objects are now automatically cast to datetime64[ns] arrays when stored in an xray object, using machinery borrowed from pandas:
In [1]: from datetime import datetime In [2]: xray.Dataset({'t': [datetime(2000, 1, 1)]}) Out[2]: <xray.Dataset> Dimensions: (t: 1) Coordinates: * t (t) datetime64[ns] 2000-01-01 Variables: *empty*
xray now has support (including serialization to netCDF) for TimedeltaIndex. datetime.timedelta objects are thus accordingly cast to timedelta64[ns] objects when appropriate.
Masked arrays are now properly coerced to use NaN as a sentinel value (GH259).
Enhancements¶
Due to popular demand, we have added experimental attribute style access as a shortcut for dataset variables, coordinates and attributes:
In [3]: ds = xray.Dataset({'tmin': ([], 25, {'units': 'celcius'})}) In [4]: ds.tmin.units Out[4]: 'celcius'
Tab-completion for these variables should work in editors such as IPython. However, setting variables or attributes in this fashion is not yet supported because there are some unresolved ambiguities (GH300).
You can now use a dictionary for indexing with labeled dimensions. This provides a safe way to do assignment with labeled dimensions:
In [5]: array = xray.DataArray(np.zeros(5), dims=['x']) In [6]: array[dict(x=slice(3))] = 1 In [7]: array Out[7]: <xray.DataArray (x: 5)> array([ 1., 1., 1., 0., 0.]) Coordinates: * x (x) int64 0 1 2 3 4
Non-index coordinates can now be faithfully written to and restored from netCDF files. This is done according to CF conventions when possible by using the coordinates attribute on a data variable. When not possible, xray defines a global coordinates attribute.
Preliminary support for converting xray.DataArray objects to and from CDAT cdms2 variables.
We sped up any operation that involves creating a new Dataset or DataArray (e.g., indexing, aggregation, arithmetic) by a factor of 30 to 50%. The full speed up requires cyordereddict to be installed.
Bug fixes¶
Future plans¶
- I am contemplating switching to the terms “coordinate variables” and “data variables” instead of the (currently used) “coordinates” and “variables”, following their use in CF Conventions (GH293). This would mostly have implications for the documentation, but I would also change the Dataset attribute vars to data.
- I no longer certain that automatic label alignment for arithmetic would be a good idea for xray – it is a feature from pandas that I have not missed (GH186).
- The main API breakage that I do anticipate in the next release is finally making all aggregation operations skip missing values by default (GH130). I’m pretty sick of writing ds.reduce(np.nanmean, 'time').
- The next version of xray (0.4) will remove deprecated features and aliases whose use currently raises a warning.
If you have opinions about any of these anticipated changes, I would love to hear them – please add a note to any of the referenced GitHub issues.
v0.3.1 (22 October, 2014)¶
This is mostly a bug-fix release to make xray compatible with the latest release of pandas (v0.15).
We added several features to better support working with missing values and exporting xray objects to pandas. We also reorganized the internal API for serializing and deserializing datasets, but this change should be almost entirely transparent to users.
Other than breaking the experimental DataStore API, there should be no backwards incompatible changes.
New features¶
- Added count() and dropna() methods, copied from pandas, for working with missing values (GH247, GH58).
- Added DataArray.to_pandas for converting a data array into the pandas object with the same dimensionality (1D to Series, 2D to DataFrame, etc.) (GH255).
- Support for reading gzipped netCDF3 files (GH239).
- Reduced memory usage when writing netCDF files (GH251).
- ‘missing_value’ is now supported as an alias for the ‘_FillValue’ attribute on netCDF variables (GH245).
- Trivial indexes, equivalent to range(n) where n is the length of the dimension, are no longer written to disk (GH245).
Bug fixes¶
- Compatibility fixes for pandas v0.15 (GH262).
- Fixes for display and indexing of NaT (not-a-time) (GH238, GH240)
- Fix slicing by label was an argument is a data array (GH250).
- Test data is now shipped with the source distribution (GH253).
- Ensure order does not matter when doing arithemtic with scalar data arrays (GH254).
- Order of dimensions preserved with DataArray.to_dataframe (GH260).
v0.3.0 (21 September 2014)¶
New features¶
- Revamped coordinates: “coordinates” now refer to all arrays that are not used to index a dimension. Coordinates are intended to allow for keeping track of arrays of metadata that describe the grid on which the points in “variable” arrays lie. They are preserved (when unambiguous) even though mathematical operations.
- Dataset math Dataset objects now support all arithmetic operations directly. Dataset-array operations map across all dataset variables; dataset-dataset operations act on each pair of variables with the same name.
- GroupBy math: This provides a convenient shortcut for normalizing by the average value of a group.
- The dataset __repr__ method has been entirely overhauled; dataset objects now show their values when printed.
- You can now index a dataset with a list of variables to return a new dataset: ds[['foo', 'bar']].
Backwards incompatible changes¶
- Dataset.__eq__ and Dataset.__ne__ are now element-wise operations instead of comparing all values to obtain a single boolean. Use the method equals() instead.
Deprecations¶
- Dataset.noncoords is deprecated: use Dataset.vars instead.
- Dataset.select_vars deprecated: index a Dataset with a list of variable names instead.
- DataArray.select_vars and DataArray.drop_vars deprecated: use reset_coords() instead.
v0.2.0 (14 August 2014)¶
This is major release that includes some new features and quite a few bug fixes. Here are the highlights:
- There is now a direct constructor for DataArray objects, which makes it possible to create a DataArray without using a Dataset. This is highlighted in the refreshed tutorial.
- You can perform aggregation operations like mean directly on Dataset objects, thanks to Joe Hamman. These aggregation methods also worked on grouped datasets.
- xray now works on Python 2.6, thanks to Anna Kuznetsova.
- A number of methods and attributes were given more sensible (usually shorter) names: labeled -> sel, indexed -> isel, select -> select_vars, unselect -> drop_vars, dimensions -> dims, coordinates -> coords, attributes -> attrs.
- New load_data() and close() methods for datasets facilitate lower level of control of data loaded from disk.
v0.1.1 (20 May 2014)¶
xray 0.1.1 is a bug-fix release that includes changes that should be almost entirely backwards compatible with v0.1:
- Python 3 support (GH53)
- Required numpy version relaxed to 1.7 (GH129)
- Return numpy.datetime64 arrays for non-standard calendars (GH126)
- Support for opening datasets associated with NetCDF4 groups (GH127)
- Bug-fixes for concatenating datetime arrays (GH134)
Special thanks to new contributors Thomas Kluyver, Joe Hamman and Alistair Miles.
v0.1 (2 May 2014)¶
Initial release.
Important links¶
- HTML documentation: http://xray.readthedocs.org
- Issue tracker: http://github.com/xray/xray/issues
- Source code: http://github.com/xray/xray
- PyData talk: https://www.youtube.com/watch?v=T5CZyNwBa9c
Get in touch¶
- Mailing list: https://groups.google.com/forum/#!forum/xray-dev
- Twitter: http://twitter.com/shoyer
xray is an ambitious project and we have a lot of work to do make it as powerful as it should be. We would love to hear your thoughts!
License¶
xray is available under the open source Apache License.
History¶
xray is an evolution of an internal tool developed at The Climate Corporation, and was originally written by current and former Climate Corp researchers Stephan Hoyer, Alex Kleeman and Eugene Brevdo.