N-D labeled arrays and datasets in Python¶
xray is an open source project and Python package that aims to bring the labeled data power of pandas to the physical sciences, by providing N-dimensional variants of the core pandas data structures.
Our goal is to provide a pandas-like and pandas-compatible toolkit for analytics on multi-dimensional arrays, rather than the tabular data for which pandas excels. Our approach adopts the Common Data Model for self- describing scientific data in widespread use in the Earth sciences: xray.Dataset is an in-memory representation of a netCDF file.
Documentation¶
Overview: Why xray?¶
Features¶
Adding dimensions names and coordinate indexes to numpy’s ndarray makes many powerful array operations possible:
- Apply operations over dimensions by name: x.sum('time').
- Select values by label instead of integer location: x.loc['2014-01-01'] or x.sel(time='2014-01-01').
- Mathematical operations (e.g., x - y) vectorize across multiple dimensions (array broadcasting) based on dimension names, not shape.
- Flexible split-apply-combine operations with groupby: x.groupby('time.dayofyear').mean().
- Database like alignment based on coordinate labels that smoothly handles missing values: x, y = xray.align(x, y, join='outer').
- Keep track of arbitrary metadata in the form of a Python dictionary: x.attrs.
pandas provides many of these features, but it does not make use of dimension names, and its core data structures are fixed dimensional arrays.
The N-dimensional nature of xray’s data structures makes it suitable for dealing with multi-dimensional scientific data, and its use of dimension names instead of axis labels (dim='time' instead of axis=0) makes such arrays much more manageable than the raw numpy ndarray: with xray, you don’t need to keep track of the order of arrays dimensions or insert dummy dimensions (e.g., np.newaxis) to align arrays.
Core data structures¶
xray has two core data structures. Both are fundamentally N-dimensional:
- DataArray is our implementation of a labeled, N-dimensional array. It is an N-D generalization of a pandas.Series. The name DataArray itself is borrowed from Fernando Perez’s datarray project, which prototyped a similar data structure.
- Dataset is a multi-dimensional, in-memory array database. It is a dict-like container of DataArray objects aligned along any number of shared dimensions, and serves a similar purpose in xray to the pandas.DataFrame.
The value of attaching labels to numpy’s numpy.ndarray may be fairly obvious, but the dataset may need more motivation.
The power of the dataset over a plain dictionary is that, in addition to pulling out arrays by name, it is possible to select or combine data along a dimension across all arrays simultaneously. Like a DataFrame, datasets facilitate array operations with heterogeneous data – the difference is that the arrays in a dataset can not only have different data types, but can also have different numbers of dimensions.
This data model is borrowed from the netCDF file format, which also provides xray with a natural and portable serialization format. NetCDF is very popular in the geosciences, and there are existing libraries for reading and writing netCDF in many programming languages, including Python.
xray distinguishes itself from many tools for working with netCDF data in-so-far as it provides data structures for in-memory analytics that both utilize and preserve labels. You only need to do the tedious work of adding metadata once, not every time you save a file.
Goals and aspirations¶
pandas excels at working with tabular data. That suffices for many statistical analyses, but physical scientists rely on N-dimensional arrays – which is where xray comes in.
xray aims to provide a data analysis toolkit as powerful as pandas but designed for working with homogeneous N-dimensional arrays instead of tabular data. When possible, we copy the pandas API and rely on pandas’s highly optimized internals (in particular, for fast indexing).
Importantly, xray has robust support for converting its objects to and from a numpy ndarray or a pandas DataFrame or Series, providing compatibility with the full PyData ecosystem.
Our target audience is anyone who needs N-dimensional labeled arrays, but we are particularly focused on the data analysis needs of physical scientists – especially geoscientists who already know and love netCDF.
Examples¶
Quick overview¶
Here are some quick examples of what you can do with xray.DataArray objects. Everything is explained in much more detail in the rest of the documentation.
To begin, import numpy, pandas and xray:
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: import xray
Create a DataArray¶
You can make a DataArray from scratch by supplying data in the form of a numpy array or list, with optional dimensions and coordinates:
In [4]: xray.DataArray(np.random.randn(2, 3))
Out[4]:
<xray.DataArray (dim_0: 2, dim_1: 3)>
array([[-1.344, 0.845, 1.076],
[-0.109, 1.644, -1.469]])
Coordinates:
* dim_0 (dim_0) int64 0 1
* dim_1 (dim_1) int64 0 1 2
In [5]: data = xray.DataArray(np.random.randn(2, 3), [('x', ['a', 'b']), ('y', [-2, 0, 2])])
In [6]: data
Out[6]:
<xray.DataArray (x: 2, y: 3)>
array([[ 0.357, -0.675, -1.777],
[-0.969, -1.295, 0.414]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
If you supply a pandas Series or DataFrame, metadata is copied directly:
In [7]: xray.DataArray(pd.Series(range(3), index=list('abc'), name='foo'))
Out[7]:
<xray.DataArray 'foo' (dim_0: 3)>
array([0, 1, 2])
Coordinates:
* dim_0 (dim_0) object 'a' 'b' 'c'
Here are the key properties for a DataArray:
# like in pandas, values is a numpy array that you can modify in-place
In [8]: data.values
Out[8]:
array([[ 0.357, -0.675, -1.777],
[-0.969, -1.295, 0.414]])
In [9]: data.dims
Out[9]: ('x', 'y')
In [10]: data.coords
Out[10]:
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
# you can use this dictionary to store arbitrary metadata
In [11]: data.attrs
Out[11]: OrderedDict()
Indexing¶
xray supports four kind of indexing. These operations are just as fast as in pandas, because we borrow pandas’ indexing machinery.
# positional and by integer label, like numpy
In [12]: data[[0, 1]]
Out[12]:
<xray.DataArray (x: 2, y: 3)>
array([[ 0.357, -0.675, -1.777],
[-0.969, -1.295, 0.414]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
# positional and by coordinate label, like pandas
In [13]: data.loc['a':'b']
Out[13]:
<xray.DataArray (x: 2, y: 3)>
array([[ 0.357, -0.675, -1.777],
[-0.969, -1.295, 0.414]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
# by dimension name and integer label
In [14]: data.isel(x=slice(2))
Out[14]:
<xray.DataArray (x: 2, y: 3)>
array([[ 0.357, -0.675, -1.777],
[-0.969, -1.295, 0.414]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
# by dimension name and coordinate label
In [15]: data.sel(x=['a', 'b'])
Out[15]:
<xray.DataArray (x: 2, y: 3)>
array([[ 0.357, -0.675, -1.777],
[-0.969, -1.295, 0.414]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
Computation¶
Data arrays work very similarly to numpy ndarrays:
In [16]: data + 10
Out[16]:
<xray.DataArray (x: 2, y: 3)>
array([[ 10.357, 9.325, 8.223],
[ 9.031, 8.705, 10.414]])
Coordinates:
* y (y) int64 -2 0 2
* x (x) |S1 'a' 'b'
In [17]: np.sin(data)
Out[17]:
<xray.DataArray (x: 2, y: 3)>
array([[ 0.349, -0.625, -0.979],
[-0.824, -0.962, 0.402]])
Coordinates:
* y (y) int64 -2 0 2
* x (x) |S1 'a' 'b'
In [18]: data.T
Out[18]:
<xray.DataArray (y: 3, x: 2)>
array([[ 0.357, -0.969],
[-0.675, -1.295],
[-1.777, 0.414]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
In [19]: data.sum()
Out[19]:
<xray.DataArray ()>
array(-3.9441825539138033)
However, aggregation operations can use dimension names instead of axis numbers:
In [20]: data.mean(dim='x')
Out[20]:
<xray.DataArray (y: 3)>
array([-0.306, -0.985, -0.682])
Coordinates:
* y (y) int64 -2 0 2
Arithmetic operations broadcast based on dimension name. This means you don’t need to insert dummy dimensions for alignment:
In [21]: a = xray.DataArray(np.random.randn(3), [data.coords['y']])
In [22]: b = xray.DataArray(np.random.randn(4), dims='z')
In [23]: a
Out[23]:
<xray.DataArray (y: 3)>
array([ 0.277, -0.472, -0.014])
Coordinates:
* y (y) int64 -2 0 2
In [24]: b
Out[24]:
<xray.DataArray (z: 4)>
array([-0.363, -0.006, -0.923, 0.896])
Coordinates:
* z (z) int64 0 1 2 3
In [25]: a + b
Out[25]:
<xray.DataArray (y: 3, z: 4)>
array([[-0.086, 0.271, -0.646, 1.172],
[-0.835, -0.478, -1.395, 0.424],
[-0.377, -0.02 , -0.937, 0.882]])
Coordinates:
* y (y) int64 -2 0 2
* z (z) int64 0 1 2 3
It also means that in most cases you do not need to worry about the order of dimensions:
In [26]: data - data.T
Out[26]:
<xray.DataArray (x: 2, y: 3)>
array([[ 0., 0., 0.],
[ 0., 0., 0.]])
Coordinates:
* y (y) int64 -2 0 2
* x (x) |S1 'a' 'b'
Operations also align based on index labels:
In [27]: data[:-1] - data[:1]
Out[27]:
<xray.DataArray (x: 1, y: 3)>
array([[ 0., 0., 0.]])
Coordinates:
* y (y) int64 -2 0 2
* x (x) |S1 'a'
GroupBy¶
xray supports grouped operations using a very similar API to pandas:
In [28]: labels = xray.DataArray(['E', 'F', 'E'], [data.coords['y']], name='labels')
In [29]: labels
Out[29]:
<xray.DataArray 'labels' (y: 3)>
array(['E', 'F', 'E'],
dtype='|S1')
Coordinates:
* y (y) int64 -2 0 2
In [30]: data.groupby(labels).mean('y')
Out[30]:
<xray.DataArray (x: 2, labels: 2)>
array([[-0.71 , -0.675],
[-0.278, -1.295]])
Coordinates:
* x (x) |S1 'a' 'b'
* labels (labels) object 'E' 'F'
In [31]: data.groupby(labels).apply(lambda x: x - x.min())
Out[31]:
<xray.DataArray (x: 2, y: 3)>
array([[ 2.134, 0.62 , 0. ],
[ 0.808, 0. , 2.191]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
Convert to pandas¶
A key feature of xray is robust conversion to and from pandas objects:
In [32]: data.to_series()
Out[32]:
x y
a -2 0.357021
0 -0.674600
2 -1.776904
b -2 -0.968914
0 -1.294524
2 0.413738
dtype: float64
In [33]: data.to_pandas()
Out[33]:
y -2 0 2
x
a 0.357021 -0.674600 -1.776904
b -0.968914 -1.294524 0.413738
[2 rows x 3 columns]
Datasets and NetCDF¶
xray.Dataset is a dict-like container of DataArray objects that share index labels and dimensions. It looks a lot like a netCDF file:
In [34]: ds = data.to_dataset()
In [35]: ds
Out[35]:
<xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 -2 0 2
Data variables:
None (x, y) float64 0.357 -0.6746 -1.777 -0.9689 -1.295 0.4137
You can do almost everything you can do with DataArray objects with Dataset objects if you prefer to work with multiple variables at once.
Datasets also let you easily read and write netCDF files:
In [36]: ds.to_netcdf('example.nc')
In [37]: xray.open_dataset('example.nc')
Out[37]:
<xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* y (y) int32 -2 0 2
* x (x) |S1 'a' 'b'
Data variables:
None (x, y) float64 0.357 -0.6746 -1.777 -0.9689 -1.295 0.4137
Toy weather data¶
Here is an example of how to easily manipulate a toy weather dataset using xray and other recommended Python libraries:
Shared setup:
import xray
import numpy as np
import pandas as pd
import seaborn as sns # pandas aware plotting library
np.random.seed(123)
times = pd.date_range('2000-01-01', '2001-12-31', name='time')
annual_cycle = np.sin(2 * np.pi * (times.dayofyear / 365.25 - 0.28))
base = 10 + 15 * annual_cycle.reshape(-1, 1)
tmin_values = base + 3 * np.random.randn(annual_cycle.size, 3)
tmax_values = base + 10 + 3 * np.random.randn(annual_cycle.size, 3)
ds = xray.Dataset({'tmin': (('time', 'location'), tmin_values),
'tmax': (('time', 'location'), tmax_values)},
{'time': times, 'location': ['IA', 'IN', 'IL']})
Examine a dataset with pandas and seaborn¶
In [1]: ds
Out[1]:
<xray.Dataset>
Dimensions: (location: 3, time: 731)
Coordinates:
* location (location) |S2 'IA' 'IN' 'IL'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
Data variables:
tmax (time, location) float64 12.98 3.31 6.779 0.4479 6.373 4.843 ...
tmin (time, location) float64 -8.037 -1.788 -3.932 -9.341 -6.558 ...
In [2]: df = ds.to_dataframe()
In [3]: df.head()
Out[3]:
tmax tmin
location time
IA 2000-01-01 12.980549 -8.037369
2000-01-01 0.447856 -9.341157
2000-01-01 5.322699 -12.139719
2000-01-02 1.889425 -7.492914
2000-01-02 0.791176 -0.447129
[5 rows x 2 columns]
In [4]: df.describe()
Out[4]:
tmax tmin
count 2193.000000 2193.000000
mean 20.108232 9.975426
std 11.010569 10.963228
min -3.506234 -13.395763
25% 9.853905 -0.040347
50% 19.967409 10.060403
75% 30.045588 20.083590
max 43.271148 33.456060
[8 rows x 2 columns]
In [5]: ds.mean(dim='location').to_dataframe().plot()
Out[5]: <matplotlib.axes.AxesSubplot at 0x7f524d92fe50>
In [6]: sns.pairplot(df.reset_index(), vars=ds.data_vars)
Out[6]: <seaborn.axisgrid.PairGrid at 0x7f0fd2368a10>
Probability of freeze by calendar month¶
In [7]: freeze = (ds['tmin'] <= 0).groupby('time.month').mean('time')
In [8]: freeze
Out[8]:
<xray.DataArray 'tmin' (month: 12, location: 3)>
array([[ 0.952, 0.887, 0.935],
[ 0.842, 0.719, 0.772],
[ 0.242, 0.129, 0.161],
...,
[ 0. , 0.016, 0. ],
[ 0.333, 0.35 , 0.233],
[ 0.935, 0.855, 0.823]])
Coordinates:
* month (month) int64 1 2 3 4 5 6 7 8 9 10 11 12
* location (location) |S2 'IA' 'IN' 'IL'
In [9]: freeze.to_pandas().plot()
Out[9]: <matplotlib.axes.AxesSubplot at 0x7f524ff3c450>
Monthly averaging¶
In [10]: monthly_avg = ds.resample('1MS', dim='time', how='mean')
In [11]: monthly_avg.sel(location='IA').to_dataframe().plot(style='s-')
Out[11]: <matplotlib.axes.AxesSubplot at 0x7f524d8fe290>
Note that MS here refers to Month-Start; M labels Month-End (the last day of the month).
Calculate monthly anomalies¶
In climatology, “anomalies” refer to the difference between observations and typical weather for a particular season. Unlike observations, anomalies should not show any seasonal cycle.
In [12]: climatology = ds.groupby('time.month').mean('time')
In [13]: anomalies = ds.groupby('time.month') - climatology
In [14]: anomalies.mean('location').to_dataframe()[['tmin', 'tmax']].plot()
Out[14]: <matplotlib.axes.AxesSubplot at 0x7f525c6f65d0>
Fill missing values with climatology¶
The fillna() method on grouped objects lets you easily fill missing values by group:
# throw away the first half of every month
In [15]: some_missing = ds.tmin.sel(time=ds['time.day'] > 15).reindex_like(ds)
In [16]: filled = some_missing.groupby('time.month').fillna(climatology.tmin)
In [17]: both = xray.Dataset({'some_missing': some_missing, 'filled': filled})
In [18]: both
Out[18]:
<xray.Dataset>
Dimensions: (location: 3, time: 731)
Coordinates:
* location (location) object 'IA' 'IN' 'IL'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
month (time) int32 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
Data variables:
some_missing (time, location) float64 nan nan nan nan nan nan nan nan ...
filled (time, location) float64 -5.163 -4.216 -4.681 -5.163 ...
In [19]: df = both.sel(time='2000').mean('location').reset_coords(drop=True).to_dataframe()
In [20]: df[['filled', 'some_missing']].plot()
Out[20]: <matplotlib.axes.AxesSubplot at 0x7f524fe35a10>
Calculating Seasonal Averages from Timeseries of Monthly Means¶
Author: Joe Hamman
The data for this example can be found in the xray-data repository. This example is also available in an IPython Notebook that is available here.
Suppose we have a netCDF or xray Dataset of monthly mean data and we want to calculate the seasonal average. To do this properly, we need to calculate the weighted average considering that each month has a different number of days.
%matplotlib inline
import numpy as np
import pandas as pd
import xray
from netCDF4 import num2date
import matplotlib.pyplot as plt
print("numpy version : ", np.__version__)
print("pandas version : ", pd.version.version)
print("xray version : ", xray.version.version)
numpy version : 1.9.2
pandas version : 0.16.2
xray version : 0.5.1
Some calendar information so we can support any netCDF calendar.¶
dpm = {'noleap': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
'365_day': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
'standard': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
'gregorian': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
'proleptic_gregorian': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
'all_leap': [0, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
'366_day': [0, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
'360_day': [0, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30]}
A few calendar functions to determine the number of days in each month¶
If you were just using the standard calendar, it would be easy to use the calendar.month_range function.
def leap_year(year, calendar='standard'):
"""Determine if year is a leap year"""
leap = False
if ((calendar in ['standard', 'gregorian',
'proleptic_gregorian', 'julian']) and
(year % 4 == 0)):
leap = True
if ((calendar == 'proleptic_gregorian') and
(year % 100 == 0) and
(year % 400 != 0)):
leap = False
elif ((calendar in ['standard', 'gregorian']) and
(year % 100 == 0) and (year % 400 != 0) and
(year < 1583)):
leap = False
return leap
def get_dpm(time, calendar='standard'):
"""
return a array of days per month corresponding to the months provided in `months`
"""
month_length = np.zeros(len(time), dtype=np.int)
cal_days = dpm[calendar]
for i, (month, year) in enumerate(zip(time.month, time.year)):
month_length[i] = cal_days[month]
if leap_year(year, calendar=calendar):
month_length[i] += 1
return month_length
Open the Dataset¶
monthly_mean_file = 'RASM_example_data.nc'
ds = xray.open_dataset(monthly_mean_file, decode_coords=False)
print(ds)
<xray.Dataset>
Dimensions: (time: 36, x: 275, y: 205)
Coordinates:
* time (time) datetime64[ns] 1980-09-16T12:00:00 1980-10-17 ...
* x (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
* y (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
Tair (time, y, x) float64 nan nan nan nan nan nan nan nan nan nan ...
Attributes:
title: /workspace/jhamman/processed/R1002RBRxaaa01a/lnd/temp/R1002RBRxaaa01a.vic.ha.1979-09-01.nc
institution: U.W.
source: RACM R1002RBRxaaa01a
output_frequency: daily
output_mode: averaged
convention: CF-1.4
references: Based on the initial model of Liang et al., 1994, JGR, 99, 14,415- 14,429.
comment: Output from the Variable Infiltration Capacity (VIC) model.
nco_openmp_thread_number: 1
NCO: 4.3.7
history: history deleted for brevity
Now for the heavy lifting:¶
We first have to come up with the weights, - calculate the month lengths for each monthly data record - calculate weights using groupby('time.season')
Finally, we just need to multiply our weights by the Dataset and sum allong the time dimension.
# Make a DataArray with the number of days in each month, size = len(time)
month_length = xray.DataArray(get_dpm(ds.time.to_index(),
calendar='noleap'),
coords=[ds.time], name='month_length')
# Calculate the weights by grouping by 'time.season'.
# Conversion to float type ('astype(float)') only necessary for Python 2.x
weights = month_length.groupby('time.season') / month_length.astype(float).groupby('time.season').sum()
# Test that the sum of the weights for each season is 1.0
np.testing.assert_allclose(weights.groupby('time.season').sum().values, np.ones(4))
# Calculate the weighted average
ds_weighted = (ds * weights).groupby('time.season').sum(dim='time')
print(ds_weighted)
<xray.Dataset>
Dimensions: (season: 4, x: 275, y: 205)
Coordinates:
* y (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
* x (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
* season (season) object 'DJF' 'JJA' 'MAM' 'SON'
Data variables:
Tair (season, y, x) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
# only used for comparisons
ds_unweighted = ds.groupby('time.season').mean('time')
ds_diff = ds_weighted - ds_unweighted
# Quick plot to show the results
is_null = np.isnan(ds_unweighted['Tair'][0].values)
fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(14,12))
for i, season in enumerate(('DJF', 'MAM', 'JJA', 'SON')):
plt.sca(axes[i, 0])
plt.pcolormesh(np.ma.masked_where(is_null, ds_weighted['Tair'].sel(season=season).values),
vmin=-30, vmax=30, cmap='Spectral_r')
plt.colorbar(extend='both')
plt.sca(axes[i, 1])
plt.pcolormesh(np.ma.masked_where(is_null, ds_unweighted['Tair'].sel(season=season).values),
vmin=-30, vmax=30, cmap='Spectral_r')
plt.colorbar(extend='both')
plt.sca(axes[i, 2])
plt.pcolormesh(np.ma.masked_where(is_null, ds_diff['Tair'].sel(season=season).values),
vmin=-0.1, vmax=.1, cmap='RdBu_r')
plt.colorbar(extend='both')
for j in range(3):
axes[i, j].axes.get_xaxis().set_ticklabels([])
axes[i, j].axes.get_yaxis().set_ticklabels([])
axes[i, j].axes.axis('tight')
axes[i, 0].set_ylabel(season)
axes[0, 0].set_title('Weighted by DPM')
axes[0, 1].set_title('Equal Weighting')
axes[0, 2].set_title('Difference')
plt.tight_layout()
fig.suptitle('Seasonal Surface Air Temperature', fontsize=16, y=1.02)
# Wrap it into a simple function
def season_mean(ds, calendar='standard'):
# Make a DataArray of season/year groups
year_season = xray.DataArray(ds.time.to_index().to_period(freq='Q-NOV').to_timestamp(how='E'),
coords=[ds.time], name='year_season')
# Make a DataArray with the number of days in each month, size = len(time)
month_length = xray.DataArray(get_dpm(ds.time.to_index(), calendar=calendar),
coords=[ds.time], name='month_length')
# Calculate the weights by grouping by 'time.season'
weights = month_length.groupby('time.season') / month_length.groupby('time.season').sum()
# Test that the sum of the weights for each season is 1.0
np.testing.assert_allclose(weights.groupby('time.season').sum().values, np.ones(4))
# Calculate the weighted average
return (ds * weights).groupby('time.season').sum(dim='time')
Installation¶
Optional dependencies¶
For netCDF and IO¶
For accelerating xray¶
- bottleneck: speeds up NaN-skipping aggregations by a large factor
- cyordereddict: speeds up most internal operations with xray data structures
For parallel computing¶
- dask.array: required for Out of core computation with dask.
For plotting¶
- matplotlib: required for Plotting.
Instructions¶
xray itself is a pure Python package, but its dependencies are not. The easiest way to get them installed is to use conda. You can then install xray with its recommended dependencies with the conda command line tool:
$ conda install xray dask netCDF4 bottleneck
If you don’t use conda, be sure you have the required dependencies (numpy and pandas) installed first. Then, install xray with pip:
$ pip install xray
To run the test suite after installing xray, install nose and run nosetests xray.
Data Structures¶
DataArray¶
xray.DataArray is xray’s implementation of a labeled, multi-dimensional array. It has several key properties:
- values: a numpy.ndarray holding the array’s values
- dims: dimension names for each axis (e.g., ('x', 'y', 'z'))
- coords: a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings)
- attrs: an OrderedDict to hold arbitrary metadata (attributes)
xray uses dims and coords to enable its core metadata aware operations. Dimensions provide names that xray uses instead of the axis argument found in many numpy functions. Coordinates enable fast label based indexing and alignment, building on the functionality of the index found on a pandas DataFrame or Series.
DataArray objects also can have a name and can hold arbitrary metadata in the form of their attrs property (an ordered dictionary). Names and attributes are strictly for users and user-written code: xray makes no attempt to interpret them, and propagates them only in unambiguous cases (see FAQ, What is your approach to metadata?).
Creating a DataArray¶
The DataArray constructor takes a multi-dimensional array of values (e.g., a numpy ndarray), a list or dictionary of coordinates label and a list of dimension names:
In [1]: data = np.random.rand(4, 3)
In [2]: locs = ['IA', 'IL', 'IN']
In [3]: times = pd.date_range('2000-01-01', periods=4)
In [4]: foo = xray.DataArray(data, coords=[times, locs], dims=['time', 'space'])
In [5]: foo
Out[5]:
<xray.DataArray (time: 4, space: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
All of these arguments (except for data) are optional, and will be filled in with default values:
In [6]: xray.DataArray(data)
Out[6]:
<xray.DataArray (dim_0: 4, dim_1: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
* dim_0 (dim_0) int64 0 1 2 3
* dim_1 (dim_1) int64 0 1 2
As you can see, dimensions and coordinate arrays corresponding to each dimension are always present. This behavior is similar to pandas, which fills in index values in the same way.
The data array constructor also supports supplying coords as a list of (dim, ticks[, attrs]) pairs with length equal to the number of dimensions:
In [7]: xray.DataArray(data, coords=[('time', times), ('space', locs)])
Out[7]:
<xray.DataArray (time: 4, space: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Yet another option is to supply coords in the form of a dictionary where the values are scaler values, 1D arrays or tuples (in the same form as the dataarray constructor). This form lets you supply other coordinates than those corresponding to dimensions (more on these later):
In [8]: xray.DataArray(data, coords={'time': times, 'space': locs, 'const': 42,
...: 'ranking': ('space', [1, 2, 3])},
...: dims=['time', 'space'])
...:
Out[8]:
<xray.DataArray (time: 4, space: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
ranking (space) int64 1 2 3
* space (space) |S2 'IA' 'IL' 'IN'
const int64 42
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
You can also create a DataArray by supplying a pandas Series, DataFrame or Panel, in which case any non-specified arguments in the DataArray constructor will be filled in from the pandas object:
In [9]: df = pd.DataFrame({'x': [0, 1], 'y': [2, 3]}, index=['a', 'b'])
In [10]: df.index.name = 'abc'
In [11]: df.columns.name = 'xyz'
In [12]: df
Out[12]:
xyz x y
abc
a 0 2
b 1 3
[2 rows x 2 columns]
In [13]: xray.DataArray(df)
Out[13]:
<xray.DataArray (abc: 2, xyz: 2)>
array([[0, 2],
[1, 3]])
Coordinates:
* abc (abc) object 'a' 'b'
* xyz (xyz) object 'x' 'y'
xray does not (yet!) support labeling coordinate values with a pandas.MultiIndex (see GH164). However, the alternate from_series constructor will automatically unpack any hierarchical indexes it encounters by expanding the series into a multi-dimensional array, as described in Working with pandas.
DataArray properties¶
Let’s take a look at the important properties on our array:
In [14]: foo.values
Out[14]:
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
In [15]: foo.dims
Out[15]: ('time', 'space')
In [16]: foo.coords
Out[16]:
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
In [17]: foo.attrs
Out[17]: OrderedDict()
In [18]: print(foo.name)
None
You can even modify values inplace:
In [19]: foo.values = 1.0 * foo.values
Note
The array values in a DataArray have a single (homogeneous) data type. To work with heterogeneous or structured data types in xray, use coordinates, or put separate DataArray objects in a single Dataset (see below).
Now fill in some of that missing metadata:
In [20]: foo.name = 'foo'
In [21]: foo.attrs['units'] = 'meters'
In [22]: foo
Out[22]:
<xray.DataArray 'foo' (time: 4, space: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Attributes:
units: meters
The rename() method is another option, returning a new data array:
In [23]: foo.rename('bar')
Out[23]:
<xray.DataArray 'bar' (time: 4, space: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
* space (space) |S2 'IA' 'IL' 'IN'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
Attributes:
units: meters
DataArray Coordinates¶
The coords property is dict like. Individual coordinates can be accessed from the coordinates by name, or even by indexing the data array itself:
In [24]: foo.coords['time']
Out[24]:
<xray.DataArray 'time' (time: 4)>
array(['2000-01-01T00:00:00.000000000+0000', '2000-01-02T00:00:00.000000000+0000',
'2000-01-03T00:00:00.000000000+0000', '2000-01-04T00:00:00.000000000+0000'], dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
In [25]: foo['time']
Out[25]:
<xray.DataArray 'time' (time: 4)>
array(['2000-01-01T00:00:00.000000000+0000', '2000-01-02T00:00:00.000000000+0000',
'2000-01-03T00:00:00.000000000+0000', '2000-01-04T00:00:00.000000000+0000'], dtype='datetime64[ns]')
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
These are also DataArray objects, which contain tick-labels for each dimension.
Coordinates can also be set or removed by using the dictionary like syntax:
In [26]: foo['ranking'] = ('space', [1, 2, 3])
In [27]: foo.coords
Out[27]:
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
ranking (space) int64 1 2 3
In [28]: del foo['ranking']
In [29]: foo.coords
Out[29]:
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Dataset¶
xray.Dataset is xray’s multi-dimensional equivalent of a DataFrame. It is a dict-like container of labeled arrays (DataArray objects) with aligned dimensions. It is designed as an in-memory representation of the data model from the netCDF file format.
In addition to the dict-like interface of the dataset itself, which can be used to access any variable in a dataset, datasets have four key properties:
- dims: a dictionary mapping from dimension names to the fixed length of each dimension (e.g., {'x': 6, 'y': 6, 'time': 8})
- data_vars: a dict-like container of DataArrays corresponding to variables
- coords: another dict-like container of DataArrays intended to label points used in data_vars (e.g., 1-dimensional arrays of numbers, datetime objects or strings)
- attrs: an OrderedDict to hold arbitrary metadata
The distinction between whether a variables falls in data or coordinates (borrowed from CF conventions) is mostly semantic, and you can probably get away with ignoring it if you like: dictionary like access on a dataset will supply variables found in either category. However, xray does make use of the distinction for indexing and computations. Coordinates indicate constant/fixed/independent quantities, unlike the varying/measured/dependent quantities that belong in data.
Here is an example of how we might structure a dataset for a weather forecast:
In this example, it would be natural to call temperature and precipitation “data variables” and all the other arrays “coordinate variables” because they label the points along the dimensions. (see [1] for more background on this example).
Creating a Dataset¶
To make an Dataset from scratch, supply dictionaries for any variables, coordinates and attributes you would like to insert into the dataset.
For the vars and coords arguments, keys should be the name of the variable and values should be scalars, 1d arrays or tuples of the form (dims, data[, attrs]) sufficient to label each array:
- dims should be a sequence of strings.
- data should be a numpy.ndarray (or array-like object) that has a dimensionality equal to the length of dims.
- attrs is an arbitrary Python dictionary for storing metadata associated with a particular array.
Let’s create some fake data for the example we show above:
In [30]: temp = 15 + 8 * np.random.randn(2, 2, 3)
In [31]: precip = 10 * np.random.rand(2, 2, 3)
In [32]: lon = [[-99.83, -99.32], [-99.79, -99.23]]
In [33]: lat = [[42.25, 42.21], [42.63, 42.59]]
# for real use cases, its good practice to supply array attributes such as
# units, but we won't bother here for the sake of brevity
In [34]: ds = xray.Dataset({'temperature': (['x', 'y', 'time'], temp),
....: 'precipitation': (['x', 'y', 'time'], precip)},
....: coords={'lon': (['x', 'y'], lon),
....: 'lat': (['x', 'y'], lat),
....: 'time': pd.date_range('2014-09-06', periods=3),
....: 'reference_time': pd.Timestamp('2014-09-05')})
....:
In [35]: ds
Out[35]:
<xray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
Data variables:
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
Notice that we did not explicitly include coordinates for the “x” or “y” dimensions, so they were filled in array of ascending integers of the proper length.
We can also pass xray.DataArray objects as values in the dictionary instead of tuples:
In [36]: xray.Dataset({'bar': foo})
Out[36]:
<xray.Dataset>
Dimensions: (space: 3, time: 4)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Data variables:
bar (time, space) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 ...
You can also create an dataset from a pandas.DataFrame with Dataset.from_dataframe or from a netCDF file on disk with open_dataset(). See Working with pandas and Serialization and IO.
Dataset contents¶
Dataset implements the Python dictionary interface, with values given by xray.DataArray objects:
In [37]: 'temperature' in ds
Out[37]: True
In [38]: ds.keys()
Out[38]:
['precipitation',
'temperature',
'lat',
'reference_time',
'lon',
'time',
'x',
'y']
In [39]: ds['temperature']
Out[39]:
<xray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[ 11.041, 23.574, 20.772],
[ 9.346, 6.683, 17.175]],
[[ 11.6 , 19.536, 17.21 ],
[ 6.301, 9.61 , 15.909]]])
Coordinates:
* y (y) int64 0 1
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
lat (x, y) float64 42.25 42.21 42.63 42.59
* x (x) int64 0 1
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
The valid keys include each listed coordinate and data variable.
Data and coordinate variables are also contained separately in the data_vars and coords dictionary-like attributes:
In [40]: ds.data_vars
Out[40]:
Data variables:
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 0.3777 ...
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
In [41]: ds.coords
Out[41]:
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
Finally, like data arrays, datasets also store arbitrary metadata in the form of attributes:
In [42]: ds.attrs
Out[42]: OrderedDict()
In [43]: ds.attrs['title'] = 'example attribute'
In [44]: ds
Out[44]:
<xray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
Data variables:
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
Attributes:
title: example attribute
xray does not enforce any restrictions on attributes, but serialization to some file formats may fail if you use objects that are not strings, numbers or numpy.ndarray objects.
As a useful shortcut, you can use attribute style access for reading (but not setting) variables and attributes:
In [45]: ds.temperature
Out[45]:
<xray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[ 11.041, 23.574, 20.772],
[ 9.346, 6.683, 17.175]],
[[ 11.6 , 19.536, 17.21 ],
[ 6.301, 9.61 , 15.909]]])
Coordinates:
* y (y) int64 0 1
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
lat (x, y) float64 42.25 42.21 42.63 42.59
* x (x) int64 0 1
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
This is particularly useful in an exploratory context, because you can tab-complete these variable names with tools like IPython.
Dictionary like methods¶
We can update a dataset in-place using Python’s standard dictionary syntax. For example, to create this example dataset from scratch, we could have written:
In [46]: ds = xray.Dataset()
In [47]: ds['temperature'] = (('x', 'y', 'time'), temp)
In [48]: ds['precipitation'] = (('x', 'y', 'time'), precip)
In [49]: ds.coords['lat'] = (('x', 'y'), lat)
In [50]: ds.coords['lon'] = (('x', 'y'), lon)
In [51]: ds.coords['time'] = pd.date_range('2014-09-06', periods=3)
In [52]: ds.coords['reference_time'] = pd.Timestamp('2014-09-05')
To change the variables in a Dataset, you can use all the standard dictionary methods, including values, items, __delitem__, get and update(). Note that assigning a DataArray object to a Dataset variable using __setitem__ or update will automatically align the array(s) to the original dataset’s indexes.
You can copy a Dataset by calling the copy() method. By default, the copy is shallow, so only the container will be copied: the arrays in the Dataset will still be stored in the same underlying numpy.ndarray objects. You can copy all data by calling ds.copy(deep=True).
Transforming datasets¶
In addition to dictionary-like methods (described above), xray has additional methods (like pandas) for transforming datasets into new objects.
For removing variables, you can select and drop an explicit list of variables by using the by indexing with a list of names or using the drop() methods to return a new Dataset. These operations keep around coordinates:
In [53]: list(ds[['temperature']])
Out[53]: ['temperature', 'lat', 'time', 'y', 'x', 'reference_time', 'lon']
In [54]: list(ds[['x']])
Out[54]: ['x', 'reference_time']
In [55]: list(ds.drop('temperature'))
Out[55]: ['time', 'x', 'y', 'precipitation', 'lat', 'lon', 'reference_time']
If a dimension name is given as an argument to drop, it also drops all variables that use that dimension:
In [56]: list(ds.drop('time'))
Out[56]: ['x', 'y', 'lat', 'lon', 'reference_time']
As an alternate to dictionary-like modifications, you can use assign() and assign_coords(). These methods return a new dataset with additional (or replaced) or values:
In [57]: ds.assign(temperature2 = 2 * ds.temperature)
Out[57]:
<xray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
Data variables:
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
temperature2 (x, y, time) float64 22.08 47.15 41.54 18.69 13.37 34.35 ...
There is also the pipe() method that allows you to use a method call with an external function (e.g., ds.pipe(func)) instead of simply calling it (e.g., func(ds)). This allows you to write pipelines for transforming you data (using “method chaining”) instead of writing hard to follow nested function calls:
# these lines are equivalent, but with pipe we can make the logic flow
# entirely from left to right
In [58]: plt.plot((2 * ds.temperature.sel(x=0)).mean('y'))
Out[58]: [<matplotlib.lines.Line2D at 0x7f524fa8e2d0>]
In [59]: (ds.temperature
....: .sel(x=0)
....: .pipe(lambda x: 2 * x)
....: .mean('y')
....: .pipe(plt.plot))
....:
Out[59]: [<matplotlib.lines.Line2D at 0x7f524fa8f490>]
Both pipe and assign replicate the pandas methods of the same names (DataFrame.pipe and DataFrame.assign).
With xray, there is no performance penalty for creating new datasets, even if variables are lazily loaded from a file on disk. Creating new objects instead of mutating existing objects often results in easier to understand code, so we encourage using this approach.
Renaming variables¶
Another useful option is the rename() method to rename dataset variables:
In [60]: ds.rename({'temperature': 'temp', 'precipitation': 'precip'})
Out[60]:
<xray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
Data variables:
temp (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
precip (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
Finally, you can use swap_dims() to swap dimension and non-dimension variables:
In [61]: ds.coords['day'] = ('time', [6, 7, 8])
In [62]: ds.swap_dims({'time': 'day'})
Out[62]:
<xray.Dataset>
Dimensions: (day: 3, x: 2, y: 2)
Coordinates:
time (day) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
* day (day) int64 6 7 8
Data variables:
temperature (x, y, day) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
precipitation (x, y, day) float64 5.904 2.453 3.404 9.847 9.195 0.3777 ...
Coordinates¶
Coordinates are ancillary variables stored for DataArray and Dataset objects in the coords attribute:
In [63]: ds.coords
Out[63]:
Coordinates:
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
day (time) int64 6 7 8
Unlike attributes, xray does interpret and persist coordinates in operations that transform xray objects.
One dimensional coordinates with a name equal to their sole dimension (marked by * when printing a dataset or data array) take on a special meaning in xray. They are used for label based indexing and alignment, like the index found on a pandas DataFrame or Series. Indeed, these “dimension” coordinates use a pandas.Index internally to store their values.
Other than for indexing, xray does not make any direct use of the values associated with coordinates. Coordinates with names not matching a dimension are not used for alignment or indexing, nor are they required to match when doing arithmetic (see Coordinates).
Modifying coordinates¶
To entirely add or removing coordinate arrays, you can use dictionary like syntax, as shown above.
To convert back and forth between data and coordinates, you can use the set_coords() and reset_coords() methods:
In [64]: ds.reset_coords()
Out[64]:
<xray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
Data variables:
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
day (time) int64 6 7 8
In [65]: ds.set_coords(['temperature', 'precipitation'])
Out[65]:
<xray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* x (x) int64 0 1
* y (y) int64 0 1
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
lat (x, y) float64 42.25 42.21 42.63 42.59
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
reference_time datetime64[ns] 2014-09-05
day (time) int64 6 7 8
Data variables:
*empty*
In [66]: ds['temperature'].reset_coords(drop=True)
Out[66]:
<xray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[ 11.041, 23.574, 20.772],
[ 9.346, 6.683, 17.175]],
[[ 11.6 , 19.536, 17.21 ],
[ 6.301, 9.61 , 15.909]]])
Coordinates:
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* y (y) int64 0 1
* x (x) int64 0 1
Notice that these operations skip coordinates with names given by dimensions, as used for indexing. This mostly because we are not entirely sure how to design the interface around the fact that xray cannot store a coordinate and variable with the name but different values in the same dictionary. But we do recognize that supporting something like this would be useful.
Coordinates methods¶
Coordinates objects also have a few useful methods, mostly for converting them into dataset objects:
In [67]: ds.coords.to_dataset()
Out[67]:
<xray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
lat (x, y) float64 42.25 42.21 42.63 42.59
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* y (y) int64 0 1
* x (x) int64 0 1
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
day (time) int64 6 7 8
Data variables:
*empty*
The merge method is particularly interesting, because it implements the same logic used for merging coordinates in arithmetic operations (see Computation):
In [68]: alt = xray.Dataset(coords={'z': [10], 'lat': 0, 'lon': 0})
In [69]: ds.coords.merge(alt.coords)
Out[69]:
<xray.Dataset>
Dimensions: (time: 3, x: 2, y: 2, z: 1)
Coordinates:
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
* y (y) int64 0 1
* x (x) int64 0 1
reference_time datetime64[ns] 2014-09-05
day (time) int64 6 7 8
* z (z) int64 10
Data variables:
*empty*
The coords.merge method may be useful if you want to implement your own binary operations that act on xray objects. In the future, we hope to write more helper functions so that you can easily make your functions act like xray’s built-in arithmetic.
Indexes¶
To convert a coordinate (or any DataArray) into an actual pandas.Index, use the to_index() method:
In [70]: ds['time'].to_index()
Out[70]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-09-06, ..., 2014-09-08]
Length: 3, Freq: D, Timezone: None
A useful shortcut is the indexes property (on both DataArray and Dataset), which lazily constructs a dictionary whose keys are given by each dimension and whose the values are Index objects:
In [71]: ds.indexes
Out[71]:
time: <class 'pandas.tseries.index.DatetimeIndex'>
[2014-09-06, ..., 2014-09-08]
Length: 3, Freq: D, Timezone: None
x: Int64Index([0, 1], dtype='int64')
y: Int64Index([0, 1], dtype='int64')
Converting datasets and arrays¶
To convert from a Dataset to a DataArray, use to_array():
In [72]: arr = ds.to_array()
In [73]: arr
Out[73]:
<xray.DataArray (variable: 2, x: 2, y: 2, time: 3)>
array([[[[ 11.041, 23.574, 20.772],
[ 9.346, 6.683, 17.175]],
[[ 11.6 , 19.536, 17.21 ],
[ 6.301, 9.61 , 15.909]]],
[[[ 5.904, 2.453, 3.404],
[ 9.847, 9.195, 0.378]],
[[ 8.615, 7.536, 4.052],
[ 3.435, 1.709, 3.947]]]])
Coordinates:
reference_time datetime64[ns] 2014-09-05
lat (x, y) float64 42.25 42.21 42.63 42.59
* y (y) int64 0 1
* x (x) int64 0 1
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* variable (variable) |S13 'temperature' 'precipitation'
day (time) int64 6 7 8
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
This method broadcasts all data variables in the dataset against each other, then concatenates them along a new dimension into a new array while preserving coordinates.
To convert back from a DataArray to a Dataset, use to_dataset():
In [74]: arr.to_dataset(dim='variable')
Out[74]:
<xray.Dataset>
Dimensions: (time: 3, x: 2, y: 2)
Coordinates:
reference_time datetime64[ns] 2014-09-05
lat (x, y) float64 42.25 42.21 42.63 42.59
* y (y) int64 0 1
* x (x) int64 0 1
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
day (time) int64 6 7 8
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
Data variables:
temperature (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
precipitation (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
The broadcasting behavior of to_array means that the resulting array includes the union of data variable dimensions:
In [75]: ds2 = xray.Dataset({'a': 0, 'b': ('x', [3, 4, 5])})
# the input dataset has 4 elements
In [76]: ds2
Out[76]:
<xray.Dataset>
Dimensions: (x: 3)
Coordinates:
* x (x) int64 0 1 2
Data variables:
a int64 0
b (x) int64 3 4 5
# the resulting array has 6 elements
In [77]: ds2.to_array()
Out[77]:
<xray.DataArray (variable: 2, x: 3)>
array([[0, 0, 0],
[3, 4, 5]])
Coordinates:
* variable (variable) |S1 'a' 'b'
* x (x) int64 0 1 2
Otherwise, the result could not be represented as an orthogonal array.
If you use to_dataset without supplying the dim argument, the DataArray will be converted into a Dataset of one variable:
In [78]: arr.to_dataset(name='combined')
Out[78]:
<xray.Dataset>
Dimensions: (time: 3, variable: 2, x: 2, y: 2)
Coordinates:
reference_time datetime64[ns] 2014-09-05
lon (x, y) float64 -99.83 -99.32 -99.79 -99.23
* y (y) int64 0 1
* variable (variable) |S13 'temperature' 'precipitation'
* time (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
lat (x, y) float64 42.25 42.21 42.63 42.59
* x (x) int64 0 1
day (time) int64 6 7 8
Data variables:
combined (variable, x, y, time) float64 11.04 23.57 20.77 9.346 ...
[1] | Latitude and longitude are 2D arrays because the dataset uses projected coordinates. reference_time refers to the reference time at which the forecast was made, rather than time which is the valid time for which the forecast applies. |
Indexing and selecting data¶
Similarly to pandas objects, xray objects support both integer and label based lookups along each dimension. However, xray objects also have named dimensions, so you can optionally use dimension names instead of relying on the positional ordering of dimensions.
Thus in total, xray supports four different kinds of indexing, as described below and summarized in this table:
Dimension lookup | Index lookup | DataArray syntax | Dataset syntax |
---|---|---|---|
Positional | By integer | arr[:, 0] | not available |
Positional | By label | arr.loc[:, 'IA'] | not available |
By name | By integer | arr.isel(space=0) or arr[dict(space=0)] |
ds.isel(space=0) or ds[dict(space=0)] |
By name | By label | arr.sel(space='IA') or arr.loc[dict(space='IA')] |
ds.sel(space='IA') or ds.loc[dict(space='IA')] |
Positional indexing¶
Indexing a DataArray directly works (mostly) just like it does for numpy arrays, except that the returned object is always another DataArray:
In [1]: arr = xray.DataArray(np.random.rand(4, 3),
...: [('time', pd.date_range('2000-01-01', periods=4)),
...: ('space', ['IA', 'IL', 'IN'])])
...:
In [2]: arr[:2]
Out[2]:
<xray.DataArray (time: 2, space: 3)>
array([[ 0.127, 0.967, 0.26 ],
[ 0.897, 0.377, 0.336]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
* space (space) |S2 'IA' 'IL' 'IN'
In [3]: arr[0, 0]
Out[3]:
<xray.DataArray ()>
array(0.12696983303810094)
Coordinates:
time datetime64[ns] 2000-01-01
space |S2 'IA'
In [4]: arr[:, [2, 1]]
Out[4]:
<xray.DataArray (time: 4, space: 2)>
array([[ 0.26 , 0.967],
[ 0.336, 0.377],
[ 0.123, 0.84 ],
[ 0.448, 0.373]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IN' 'IL'
Attributes are persisted in all indexing operations.
Warning
Positional indexing deviates from the NumPy when indexing with multiple arrays like arr[[0, 1], [0, 1]], as described in Orthogonal (outer) vs. vectorized indexing. See Pointwise indexing for how to achieve this functionality in xray.
xray also supports label-based indexing, just like pandas. Because we use a pandas.Index under the hood, label based indexing is very fast. To do label based indexing, use the loc attribute:
In [5]: arr.loc['2000-01-01':'2000-01-02', 'IA']
Out[5]:
<xray.DataArray (time: 2)>
array([ 0.127, 0.897])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
space |S2 'IA'
You can perform any of the label indexing operations supported by pandas, including indexing with individual, slices and arrays of labels, as well as indexing with boolean arrays. Like pandas, label based indexing in xray is inclusive of both the start and stop bounds.
Setting values with label based indexing is also supported:
In [6]: arr.loc['2000-01-01', ['IL', 'IN']] = -10
In [7]: arr
Out[7]:
<xray.DataArray (time: 4, space: 3)>
array([[ 0.127, -10. , -10. ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Indexing with labeled dimensions¶
With labeled dimensions, we do not have to rely on dimension order and can use them explicitly to slice data. There are two ways to do this:
Use a dictionary as the argument for array positional or label based array indexing:
# index by integer array indices In [8]: arr[dict(space=0, time=slice(None, 2))] Out[8]: <xray.DataArray (time: 2)> array([ 0.127, 0.897]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 space |S2 'IA' # index by dimension coordinate labels In [9]: arr.loc[dict(time=slice('2000-01-01', '2000-01-02'))] Out[9]: <xray.DataArray (time: 2, space: 3)> array([[ 0.127, -10. , -10. ], [ 0.897, 0.377, 0.336]]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 * space (space) |S2 'IA' 'IL' 'IN'
Use the sel() and isel() convenience methods:
# index by integer array indices In [10]: arr.isel(space=0, time=slice(None, 2)) Out[10]: <xray.DataArray (time: 2)> array([ 0.127, 0.897]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 space |S2 'IA' # index by dimension coordinate labels In [11]: arr.sel(time=slice('2000-01-01', '2000-01-02')) Out[11]: <xray.DataArray (time: 2, space: 3)> array([[ 0.127, -10. , -10. ], [ 0.897, 0.377, 0.336]]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 * space (space) |S2 'IA' 'IL' 'IN'
The arguments to these methods can be any objects that could index the array along the dimension given by the keyword, e.g., labels for an individual value, Python slice() objects or 1-dimensional arrays.
Note
We would love to be able to do indexing with labeled dimension names inside brackets, but unfortunately, Python does yet not support indexing with keyword arguments like arr[space=0]
Warning
Do not try to assign values when using any of the indexing methods isel, isel_points, sel or sel_points:
# DO NOT do this
arr.isel(space=0) = 0
Depending on whether the underlying numpy indexing returns a copy or a view, the method will fail, and when it fails, it will fail silently. Instead, you should use normal index assignment:
# this is safe
arr[dict(space=0)] = 0
Pointwise indexing¶
xray pointwise indexing supports the indexing along multiple labeled dimensions using list-like objects. While isel() performs orthogonal indexing, the isel_points() method provides similar numpy indexing behavior as if you were using multiple lists to index an array (e.g. arr[[0, 1], [0, 1]] ):
# index by integer array indices
In [12]: da = xray.DataArray(np.arange(56).reshape((7, 8)), dims=['x', 'y'])
In [13]: da
Out[13]:
<xray.DataArray (x: 7, y: 8)>
array([[ 0, 1, 2, ..., 5, 6, 7],
[ 8, 9, 10, ..., 13, 14, 15],
[16, 17, 18, ..., 21, 22, 23],
...,
[32, 33, 34, ..., 37, 38, 39],
[40, 41, 42, ..., 45, 46, 47],
[48, 49, 50, ..., 53, 54, 55]])
Coordinates:
* x (x) int64 0 1 2 3 4 5 6
* y (y) int64 0 1 2 3 4 5 6 7
In [14]: da.isel_points(x=[0, 1, 6], y=[0, 1, 0])
Out[14]:
<xray.DataArray (points: 3)>
array([ 0, 9, 48])
Coordinates:
y (points) int64 0 1 0
x (points) int64 0 1 6
* points (points) int64 0 1 2
There is also sel_points(), which analogously allows you to do point-wise indexing by label:
In [15]: times = pd.to_datetime(['2000-01-03', '2000-01-02', '2000-01-01'])
In [16]: arr.sel_points(space=['IA', 'IL', 'IN'], time=times)
Out[16]:
<xray.DataArray (points: 3)>
array([ 0.451, 0.377, -10. ])
Coordinates:
space (points) |S2 'IA' 'IL' 'IN'
time (points) datetime64[ns] 2000-01-03 2000-01-02 2000-01-01
* points (points) int64 0 1 2
The equivalent pandas method to sel_points is lookup().
Dataset indexing¶
We can also use these methods to index all variables in a dataset simultaneously, returning a new dataset:
In [17]: ds = arr.to_dataset()
In [18]: ds.isel(space=[0], time=[0])
Out[18]:
<xray.Dataset>
Dimensions: (space: 1, time: 1)
Coordinates:
* time (time) datetime64[ns] 2000-01-01
* space (space) |S2 'IA'
Data variables:
None (time, space) float64 0.127
In [19]: ds.sel(time='2000-01-01')
Out[19]:
<xray.Dataset>
Dimensions: (space: 3)
Coordinates:
time datetime64[ns] 2000-01-01
* space (space) |S2 'IA' 'IL' 'IN'
Data variables:
None (space) float64 0.127 -10.0 -10.0
In [20]: ds2 = da.to_dataset()
In [21]: ds2.isel_points(x=[0, 1, 6], y=[0, 1, 0], dim='points')
Out[21]:
<xray.Dataset>
Dimensions: (points: 3)
Coordinates:
y (points) int64 0 1 0
x (points) int64 0 1 6
* points (points) int64 0 1 2
Data variables:
None (points) int64 0 9 48
Positional indexing on a dataset is not supported because the ordering of dimensions in a dataset is somewhat ambiguous (it can vary between different arrays). However, you can do normal indexing with labeled dimensions:
In [22]: ds[dict(space=[0], time=[0])]
Out[22]:
<xray.Dataset>
Dimensions: (space: 1, time: 1)
Coordinates:
* time (time) datetime64[ns] 2000-01-01
* space (space) |S2 'IA'
Data variables:
None (time, space) float64 0.127
In [23]: ds.loc[dict(time='2000-01-01')]
Out[23]:
<xray.Dataset>
Dimensions: (space: 3)
Coordinates:
time datetime64[ns] 2000-01-01
* space (space) |S2 'IA' 'IL' 'IN'
Data variables:
None (space) float64 0.127 -10.0 -10.0
Using indexing to assign values to a subset of dataset (e.g., ds[dict(space=0)] = 1) is not yet supported.
Dropping labels¶
The drop() method returns a new object with the listed index labels along a dimension dropped:
In [24]: ds.drop(['IN', 'IL'], dim='space')
Out[24]:
<xray.Dataset>
Dimensions: (space: 1, time: 4)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA'
Data variables:
None (time, space) float64 0.127 0.8972 0.4514 0.543
drop is both a Dataset and DataArray method.
Nearest neighbor lookups¶
The label based selection methods sel(), reindex() and reindex_like() all support method and tolerance keyword argument. The method parameter allows for enabling nearest neighbor (inexact) lookups by use of the methods 'pad', 'backfill' or 'nearest':
In [25]: data = xray.DataArray([1, 2, 3], dims='x')
In [26]: data.sel(x=[1.1, 1.9], method='nearest')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-26-a6b321b2eef8> in <module>()
----> 1 data.sel(x=[1.1, 1.9], method='nearest')
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/core/dataarray.pyc in sel(self, method, tolerance, **indexers)
550 """
551 return self.isel(**indexing.remap_label_indexers(
--> 552 self, indexers, method=method, tolerance=tolerance))
553
554 def isel_points(self, dim='points', **indexers):
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/core/indexing.pyc in remap_label_indexers(data_obj, indexers, method, tolerance)
171 return dict((dim, convert_label_indexer(data_obj[dim].to_index(), label,
172 dim, method, tolerance))
--> 173 for dim, label in iteritems(indexers))
174
175
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/core/indexing.pyc in <genexpr>((dim, label))
171 return dict((dim, convert_label_indexer(data_obj[dim].to_index(), label,
172 dim, method, tolerance))
--> 173 for dim, label in iteritems(indexers))
174
175
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/core/indexing.pyc in convert_label_indexer(index, label, index_name, method, tolerance)
156 indexer, = np.nonzero(label)
157 else:
--> 158 indexer = index.get_indexer(label, **kwargs)
159 if np.any(indexer < 0):
160 raise KeyError('not all values found in index %r'
/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in get_indexer(self, target, method, limit)
1115 this = self.astype(object)
1116 target = target.astype(object)
-> 1117 return this.get_indexer(target, method=method, limit=limit)
1118
1119 if not self.is_unique:
/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in get_indexer(self, target, method, limit)
1132 indexer = self._engine.get_indexer(target.values)
1133 else:
-> 1134 raise ValueError('unrecognized method: %s' % method)
1135
1136 return com._ensure_platform_int(indexer)
ValueError: unrecognized method: nearest
In [27]: data.sel(x=0.1, method='backfill')
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-27-fea93a028543> in <module>()
----> 1 data.sel(x=0.1, method='backfill')
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/core/dataarray.pyc in sel(self, method, tolerance, **indexers)
550 """
551 return self.isel(**indexing.remap_label_indexers(
--> 552 self, indexers, method=method, tolerance=tolerance))
553
554 def isel_points(self, dim='points', **indexers):
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/core/indexing.pyc in remap_label_indexers(data_obj, indexers, method, tolerance)
171 return dict((dim, convert_label_indexer(data_obj[dim].to_index(), label,
172 dim, method, tolerance))
--> 173 for dim, label in iteritems(indexers))
174
175
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/core/indexing.pyc in <genexpr>((dim, label))
171 return dict((dim, convert_label_indexer(data_obj[dim].to_index(), label,
172 dim, method, tolerance))
--> 173 for dim, label in iteritems(indexers))
174
175
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/core/indexing.pyc in convert_label_indexer(index, label, index_name, method, tolerance)
152 label = np.asarray(label)
153 if label.ndim == 0:
--> 154 indexer = index.get_loc(np.asscalar(label), **kwargs)
155 elif label.dtype.kind == 'b':
156 indexer, = np.nonzero(label)
TypeError: get_loc() got an unexpected keyword argument 'method'
In [28]: data.reindex(x=[0.5, 1, 1.5, 2, 2.5], method='pad')
Out[28]:
<xray.DataArray (x: 5)>
array([1, 2, 2, 3, 3])
Coordinates:
* x (x) float64 0.5 1.0 1.5 2.0 2.5
Tolerance limits the maximum distance for valid matches with an inexact lookup:
In [29]: data.reindex(x=[1.1, 1.5], method='nearest', tolerance=0.2)
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
<ipython-input-29-c180e53b371f> in <module>()
----> 1 data.reindex(x=[1.1, 1.5], method='nearest', tolerance=0.2)
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/core/dataarray.pyc in reindex(self, method, tolerance, copy, **indexers)
662 """
663 ds = self._dataset.reindex(method=method, tolerance=tolerance,
--> 664 copy=copy, **indexers)
665 return self._with_replaced_dataset(ds)
666
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/core/dataset.pyc in reindex(self, indexers, method, tolerance, copy, **kw_indexers)
1302
1303 variables = alignment.reindex_variables(
-> 1304 self.variables, self.indexes, indexers, method, tolerance, copy=copy)
1305 return self._replace_vars_and_dims(variables)
1306
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/core/alignment.pyc in reindex_variables(variables, indexes, indexers, method, tolerance, copy)
150 if pd.__version__ < '0.17':
151 raise NotImplementedError(
--> 152 'the tolerance argument requires pandas v0.17 or newer')
153 get_indexer_kwargs['tolerance'] = tolerance
154
NotImplementedError: the tolerance argument requires pandas v0.17 or newer
Using method='nearest' or a scalar argument with .sel() requires pandas version 0.16 or newer. Using tolerance requries pandas version 0.17 or newer.
The method parameter is not yet supported if any of the arguments to .sel() is a slice object:
In [30]: data.sel(x=slice(1, 3), method='nearest')
NotImplementedError
However, you don’t need to use method to do inexact slicing. Slicing already returns all values inside the range (inclusive), as long as the index labels are monotonic increasing:
In [31]: data.sel(x=slice(0.9, 3.1))
Out[31]:
<xray.DataArray (x: 3)>
array([1, 2, 3])
Coordinates:
* x (x) int64 0 1 2
Indexing axes with monotonic decreasing labels also works, as long as the slice or .loc arguments are also decreasing:
In [32]: reversed_data = data[::-1]
In [33]: reversed_data.loc[3.1:0.9]
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-33-5f52de481386> in <module>()
----> 1 reversed_data.loc[3.1:0.9]
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/core/dataarray.pyc in __getitem__(self, key)
81
82 def __getitem__(self, key):
---> 83 return self.data_array[self._remap_key(key)]
84
85 def __setitem__(self, key, value):
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/core/dataarray.pyc in _remap_key(self, key)
78 key = indexing.expanded_indexer(key, self.data_array.ndim)
79 return tuple(lookup_positions(dim, labels) for dim, labels
---> 80 in zip(self.data_array.dims, key))
81
82 def __getitem__(self, key):
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/core/dataarray.pyc in <genexpr>((dim, labels))
77 # expand the indexer so we can handle Ellipsis
78 key = indexing.expanded_indexer(key, self.data_array.ndim)
---> 79 return tuple(lookup_positions(dim, labels) for dim, labels
80 in zip(self.data_array.dims, key))
81
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/core/dataarray.pyc in lookup_positions(dim, labels)
69 def lookup_positions(dim, labels):
70 index = self.data_array.indexes[dim]
---> 71 return indexing.convert_label_indexer(index, labels)
72
73 if utils.is_dict_like(key):
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/core/indexing.pyc in convert_label_indexer(index, label, index_name, method, tolerance)
142 indexer = index.slice_indexer(_try_get_item(label.start),
143 _try_get_item(label.stop),
--> 144 _try_get_item(label.step))
145 if not isinstance(indexer, slice):
146 # unlike pandas, in xray we never want to silently convert a slice
/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in slice_indexer(self, start, end, step)
1491 This function assumes that the data is sorted, so use at your own peril
1492 """
-> 1493 start_slice, end_slice = self.slice_locs(start, end)
1494
1495 # return a slice
/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in slice_locs(self, start, end)
1530 else:
1531 try:
-> 1532 start_slice = self.get_loc(start)
1533
1534 if not is_unique:
/usr/lib/python2.7/dist-packages/pandas/core/index.pyc in get_loc(self, key)
1015 loc : int if unique index, possibly slice or mask if not
1016 """
-> 1017 return self._engine.get_loc(_values_from_object(key))
1018
1019 def get_value(self, series, key):
index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3620)()
index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3498)()
hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6930)()
hashtable.pyx in pandas.hashtable.Int64HashTable.get_item (pandas/hashtable.c:6871)()
KeyError: 3
Masking with where¶
Indexing methods on xray objects generally return a subset of the original data. However, it is sometimes useful to select an object with the same shape as the original data, but with some elements masked. To do this type of selection in xray, use where():
In [34]: arr2 = xray.DataArray(np.arange(16).reshape(4, 4), dims=['x', 'y'])
In [35]: arr2.where(arr2.x + arr2.y < 4)
Out[35]:
<xray.DataArray (x: 4, y: 4)>
array([[ 0., 1., 2., 3.],
[ 4., 5., 6., nan],
[ 8., 9., nan, nan],
[ 12., nan, nan, nan]])
Coordinates:
* y (y) int64 0 1 2 3
* x (x) int64 0 1 2 3
This is particularly useful for ragged indexing of multi-dimensional data, e.g., to apply a 2D mask to an image. Note that where follows all the usual xray broadcasting and alignment rules for binary operations (e.g., +) between the object being indexed and the condition, as described in Computation:
In [36]: arr2.where(arr2.y < 2)
Out[36]:
<xray.DataArray (x: 4, y: 4)>
array([[ 0., 1., nan, nan],
[ 4., 5., nan, nan],
[ 8., 9., nan, nan],
[ 12., 13., nan, nan]])
Coordinates:
* y (y) int64 0 1 2 3
* x (x) int64 0 1 2 3
Multi-dimensional indexing¶
Xray does not yet support efficient routines for generalized multi-dimensional indexing or regridding. However, we are definitely interested in adding support for this in the future (see GH475 for the ongoing discussion).
Copies vs. views¶
Whether array indexing returns a view or a copy of the underlying data depends on the nature of the labels. For positional (integer) indexing, xray follows the same rules as NumPy:
- Positional indexing with only integers and slices returns a view.
- Positional indexing with arrays or lists returns a copy.
The rules for label based indexing are more complex:
- Label-based indexing with only slices returns a view.
- Label-based indexing with arrays returns a copy.
- Label-based indexing with scalars returns a view or a copy, depending upon if the corresponding positional indexer can be represented as an integer or a slice object. The exact rules are determined by pandas.
Whether data is a copy or a view is more predictable in xray than in pandas, so unlike pandas, xray does not produce SettingWithCopy warnings. However, you should still avoid assignment with chained indexing.
Orthogonal (outer) vs. vectorized indexing¶
Indexing with xray objects has one important difference from indexing numpy arrays: you can only use one-dimensional arrays to index xray objects, and each indexer is applied “orthogonally” along independent axes, instead of using numpy’s broadcasting rules to vectorize indexers. This means you can do indexing like this, which would require slightly more awkward syntax with numpy arrays:
In [37]: arr[arr['time.day'] > 1, arr['space'] != 'IL']
Out[37]:
<xray.DataArray (time: 3, space: 2)>
array([[ 0.897, 0.336],
[ 0.451, 0.123],
[ 0.543, 0.448]])
Coordinates:
* time (time) datetime64[ns] 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IN'
This is a much simpler model than numpy’s advanced indexing. If you would like to do advanced-style array indexing in xray, you have several options:
- Pointwise indexing
- Masking with where
- Index the underlying NumPy array directly using .values, e.g.,
In [38]: arr.values[arr.values > 0.5]
Out[38]: array([ 0.897, 0.84 , 0.543])
Align and reindex¶
xray’s reindex, reindex_like and align impose a DataArray or Dataset onto a new set of coordinates corresponding to dimensions. The original values are subset to the index labels still found in the new labels, and values corresponding to new labels not found in the original object are in-filled with NaN.
Xray operations that combine multiple objects generally automatically align their arguments to share the same indexes. However, manual alignment can be useful for greater control and for increased performance.
To reindex a particular dimension, use reindex():
In [39]: arr.reindex(space=['IA', 'CA'])
Out[39]:
<xray.DataArray (time: 4, space: 2)>
array([[ 0.127, nan],
[ 0.897, nan],
[ 0.451, nan],
[ 0.543, nan]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'CA'
The reindex_like() method is a useful shortcut. To demonstrate, we will make a subset DataArray with new values:
In [40]: foo = arr.rename('foo')
In [41]: baz = (10 * arr[:2, :2]).rename('baz')
In [42]: baz
Out[42]:
<xray.DataArray 'baz' (time: 2, space: 2)>
array([[ 1.27 , -100. ],
[ 8.972, 3.767]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
* space (space) |S2 'IA' 'IL'
Reindexing foo with baz selects out the first two values along each dimension:
In [43]: foo.reindex_like(baz)
Out[43]:
<xray.DataArray 'foo' (time: 2, space: 2)>
array([[ 0.127, -10. ],
[ 0.897, 0.377]])
Coordinates:
* space (space) object 'IA' 'IL'
* time (time) datetime64[ns] 2000-01-01 2000-01-02
The opposite operation asks us to reindex to a larger shape, so we fill in the missing values with NaN:
In [44]: baz.reindex_like(foo)
Out[44]:
<xray.DataArray 'baz' (time: 4, space: 3)>
array([[ 1.27 , -100. , nan],
[ 8.972, 3.767, nan],
[ nan, nan, nan],
[ nan, nan, nan]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) object 'IA' 'IL' 'IN'
The align() function lets us perform more flexible database-like 'inner', 'outer', 'left' and 'right' joins:
In [45]: xray.align(foo, baz, join='inner')
Out[45]:
(<xray.DataArray 'foo' (time: 2, space: 2)>
array([[ 0.127, -10. ],
[ 0.897, 0.377]])
Coordinates:
* space (space) object 'IA' 'IL'
* time (time) datetime64[ns] 2000-01-01 2000-01-02,
<xray.DataArray 'baz' (time: 2, space: 2)>
array([[ 1.27 , -100. ],
[ 8.972, 3.767]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
* space (space) object 'IA' 'IL')
In [46]: xray.align(foo, baz, join='outer')
Out[46]:
(<xray.DataArray 'foo' (time: 4, space: 3)>
array([[ 0.127, -10. , -10. ],
[ 0.897, 0.377, 0.336],
[ 0.451, 0.84 , 0.123],
[ 0.543, 0.373, 0.448]])
Coordinates:
* space (space) object 'IA' 'IL' 'IN'
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04,
<xray.DataArray 'baz' (time: 4, space: 3)>
array([[ 1.27 , -100. , nan],
[ 8.972, 3.767, nan],
[ nan, nan, nan],
[ nan, nan, nan]])
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) object 'IA' 'IL' 'IN')
Both reindex_like and align work interchangeably between DataArray and Dataset objects, and with any number of matching dimension names:
In [47]: ds
Out[47]:
<xray.Dataset>
Dimensions: (space: 3, time: 4)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Data variables:
None (time, space) float64 0.127 -10.0 -10.0 0.8972 0.3767 0.3362 ...
In [48]: ds.reindex_like(baz)
Out[48]:
<xray.Dataset>
Dimensions: (space: 2, time: 2)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02
* space (space) object 'IA' 'IL'
Data variables:
None (time, space) float64 0.127 -10.0 0.8972 0.3767
In [49]: other = xray.DataArray(['a', 'b', 'c'], dims='other')
# this is a no-op, because there are no shared dimension names
In [50]: ds.reindex_like(other)
Out[50]:
<xray.Dataset>
Dimensions: (space: 3, time: 4)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
* space (space) |S2 'IA' 'IL' 'IN'
Data variables:
None (time, space) float64 0.127 -10.0 -10.0 0.8972 0.3767 0.3362 ...
Computation¶
The labels associated with DataArray and Dataset objects enables some powerful shortcuts for computation, notably including aggregation and broadcasting by dimension names.
Basic array math¶
Arithmetic operations with a single DataArray automatically vectorize (like numpy) over all array values:
In [1]: arr = xray.DataArray(np.random.randn(2, 3),
...: [('x', ['a', 'b']), ('y', [10, 20, 30])])
...:
In [2]: arr - 3
Out[2]:
<xray.DataArray (x: 2, y: 3)>
array([[-2.5308877 , -3.28286334, -4.5090585 ],
[-4.13563237, -1.78788797, -3.17321465]])
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
In [3]: abs(arr)
Out[3]:
<xray.DataArray (x: 2, y: 3)>
array([[ 0.4691123 , 0.28286334, 1.5090585 ],
[ 1.13563237, 1.21211203, 0.17321465]])
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
You can also use any of numpy’s or scipy’s many ufunc functions directly on a DataArray:
In [4]: np.sin(arr)
Out[4]:
<xray.DataArray (x: 2, y: 3)>
array([[ 0.45209466, -0.27910634, -0.99809483],
[-0.90680094, 0.9363595 , -0.17234978]])
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
Data arrays also implement many numpy.ndarray methods:
In [5]: arr.round(2)
Out[5]:
<xray.DataArray (x: 2, y: 3)>
array([[ 0.47, -0.28, -1.51],
[-1.14, 1.21, -0.17]])
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
In [6]: arr.T
Out[6]:
<xray.DataArray (y: 3, x: 2)>
array([[ 0.4691123 , -1.13563237],
[-0.28286334, 1.21211203],
[-1.5090585 , -0.17321465]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
Missing values¶
xray objects borrow the isnull(), notnull(), count(), dropna() and fillna() methods for working with missing data from pandas:
In [7]: x = xray.DataArray([0, 1, np.nan, np.nan, 2], dims=['x'])
In [8]: x.isnull()
Out[8]:
<xray.DataArray (x: 5)>
array([False, False, True, True, False], dtype=bool)
Coordinates:
* x (x) int64 0 1 2 3 4
In [9]: x.notnull()
Out[9]:
<xray.DataArray (x: 5)>
array([ True, True, False, False, True], dtype=bool)
Coordinates:
* x (x) int64 0 1 2 3 4
In [10]: x.count()
Out[10]:
<xray.DataArray ()>
array(3)
In [11]: x.dropna(dim='x')
Out[11]:
<xray.DataArray (x: 3)>
array([ 0., 1., 2.])
Coordinates:
* x (x) int64 0 1 4
In [12]: x.fillna(-1)
Out[12]:
<xray.DataArray (x: 5)>
array([ 0., 1., -1., -1., 2.])
Coordinates:
* x (x) int64 0 1 2 3 4
Like pandas, xray uses the float value np.nan (not-a-number) to represent missing values.
Aggregation¶
Aggregation methods have been updated to take a dim argument instead of axis. This allows for very intuitive syntax for aggregation methods that are applied along particular dimension(s):
In [13]: arr.sum(dim='x')
Out[13]:
<xray.DataArray (y: 3)>
array([-0.66652007, 0.92924868, -1.68227315])
Coordinates:
* y (y) int64 10 20 30
In [14]: arr.std(['x', 'y'])
Out[14]:
<xray.DataArray ()>
array(0.9156385956757354)
In [15]: arr.min()
Out[15]:
<xray.DataArray ()>
array(-1.5090585031735124)
If you need to figure out the axis number for a dimension yourself (say, for wrapping code designed to work with numpy arrays), you can use the get_axis_num() method:
In [16]: arr.get_axis_num('y')
Out[16]: 1
These operations automatically skip missing values, like in pandas:
In [17]: xray.DataArray([1, 2, np.nan, 3]).mean()
Out[17]:
<xray.DataArray ()>
array(2.0)
If desired, you can disable this behavior by invoking the aggregation method with skipna=False.
Broadcasting by dimension name¶
DataArray objects are automatically align themselves (“broadcasting” in the numpy parlance) by dimension name instead of axis order. With xray, you do not need to transpose arrays or insert dimensions of length 1 to get array operations to work, as commonly done in numpy with np.reshape() or np.newaxis.
This is best illustrated by a few examples. Consider two one-dimensional arrays with different sizes aligned along different dimensions:
In [18]: a = xray.DataArray([1, 2], [('x', ['a', 'b'])])
In [19]: a
Out[19]:
<xray.DataArray (x: 2)>
array([1, 2])
Coordinates:
* x (x) |S1 'a' 'b'
In [20]: b = xray.DataArray([-1, -2, -3], [('y', [10, 20, 30])])
In [21]: b
Out[21]:
<xray.DataArray (y: 3)>
array([-1, -2, -3])
Coordinates:
* y (y) int64 10 20 30
With xray, we can apply binary mathematical operations to these arrays, and their dimensions are expanded automatically:
In [22]: a * b
Out[22]:
<xray.DataArray (x: 2, y: 3)>
array([[-1, -2, -3],
[-2, -4, -6]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
Moreover, dimensions are always reordered to the order in which they first appeared:
In [23]: c = xray.DataArray(np.arange(6).reshape(3, 2), [b['y'], a['x']])
In [24]: c
Out[24]:
<xray.DataArray (y: 3, x: 2)>
array([[0, 1],
[2, 3],
[4, 5]])
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
In [25]: a + c
Out[25]:
<xray.DataArray (x: 2, y: 3)>
array([[1, 3, 5],
[3, 5, 7]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
This means, for example, that you always subtract an array from its transpose:
In [26]: c - c.T
Out[26]:
<xray.DataArray (y: 3, x: 2)>
array([[0, 0],
[0, 0],
[0, 0]])
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
Automatic alignment¶
xray enforces alignment between index Coordinates (that is, coordinates with the same name as a dimension, marked by *) on objects used in binary operations.
Similarly to pandas, this alignment is automatic for arithmetic on binary operations. Note that unlike pandas, this the result of a binary operation is by the intersection (not the union) of coordinate labels:
In [27]: arr + arr[:1]
Out[27]:
<xray.DataArray (x: 1, y: 3)>
array([[ 0.9382246 , -0.56572669, -3.01811701]])
Coordinates:
* y (y) int64 10 20 30
* x (x) object 'a'
If the result would be empty, an error is raised instead:
In [28]: arr[:2] + arr[2:]
ValueError: no overlapping labels for some dimensions: ['x']
Before loops or performance critical code, it’s a good idea to align arrays explicitly (e.g., by putting them in the same Dataset or using align()) to avoid the overhead of repeated alignment with each operation. See Align and reindex for more details.
Note
There is no automatic alignment between arguments when performing in-place arithmetic operations such as +=. You will need to use manual alignment. This ensures in-place arithmetic never needs to modify data types.
Coordinates¶
Although index coordinates are aligned, other coordinates are not, and if their values conflict, they will be dropped. This is necessary, for example, because indexing turns 1D coordinates into scalar coordinates:
In [29]: arr[0]
Out[29]:
<xray.DataArray (y: 3)>
array([ 0.4691123 , -0.28286334, -1.5090585 ])
Coordinates:
x |S1 'a'
* y (y) int64 10 20 30
In [30]: arr[1]
Out[30]:
<xray.DataArray (y: 3)>
array([-1.13563237, 1.21211203, -0.17321465])
Coordinates:
x |S1 'b'
* y (y) int64 10 20 30
# notice that the scalar coordinate 'x' is silently dropped
In [31]: arr[1] - arr[0]
Out[31]:
<xray.DataArray (y: 3)>
array([-1.60474467, 1.49497537, 1.33584385])
Coordinates:
* y (y) int64 10 20 30
Still, xray will persist other coordinates in arithmetic, as long as there are no conflicting values:
# only one argument has the 'x' coordinate
In [32]: arr[0] + 1
Out[32]:
<xray.DataArray (y: 3)>
array([ 1.4691123 , 0.71713666, -0.5090585 ])
Coordinates:
* y (y) int64 10 20 30
x |S1 'a'
# both arguments have the same 'x' coordinate
In [33]: arr[0] - arr[0]
Out[33]:
<xray.DataArray (y: 3)>
array([ 0., 0., 0.])
Coordinates:
* y (y) int64 10 20 30
x |S1 'a'
Math with datasets¶
Datasets support arithmetic operations by automatically looping over all data variables:
In [34]: ds = xray.Dataset({'x_and_y': (('x', 'y'), np.random.randn(2, 3)),
....: 'x_only': ('x', np.random.randn(2))},
....: coords=arr.coords)
....:
In [35]: ds > 0
Out[35]:
<xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
Data variables:
x_only (x) bool True False
x_and_y (x, y) bool True False False False False True
Datasets support most of the same methods found on data arrays:
In [36]: ds.mean(dim='x')
Out[36]:
<xray.Dataset>
Dimensions: (y: 3)
Coordinates:
* y (y) int64 10 20 30
Data variables:
x_only float64 0.007392
x_and_y (y) float64 -0.9927 -0.7696 0.105
In [37]: abs(ds)
Out[37]:
<xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
Data variables:
x_only (x) float64 0.7216 0.7068
x_and_y (x, y) float64 0.1192 1.044 0.8618 2.105 0.4949 1.072
transpose() can also be used to reorder dimensions on all variables:
In [38]: ds.transpose('y', 'x')
Out[38]:
<xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
Data variables:
x_only (x) float64 0.7216 -0.7068
x_and_y (y, x) float64 0.1192 -2.105 -1.044 -0.4949 -0.8618 1.072
Unfortunately, a limitation of the current version of numpy means that we cannot override ufuncs for datasets, because datasets cannot be written as a single array [1]. apply() works around this limitation, by applying the given function to each variable in the dataset:
In [39]: ds.apply(np.sin)
Out[39]:
<xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
Data variables:
x_only (x) float64 0.6606 -0.6494
x_and_y (x, y) float64 0.1189 -0.8645 -0.759 -0.8609 -0.475 0.8781
Datasets also use looping over variables for broadcasting in binary arithmetic. You can do arithmetic between any DataArray and a dataset:
In [40]: ds + arr
Out[40]:
<xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
Data variables:
x_only (x, y) float64 1.191 0.4387 -0.7875 -1.842 0.5053 -0.88
x_and_y (x, y) float64 0.5883 -1.327 -2.371 -3.24 0.7172 0.8986
Arithmetic between two datasets matches data variables of the same name:
In [41]: ds2 = xray.Dataset({'x_and_y': 0, 'x_only': 100})
In [42]: ds - ds2
Out[42]:
<xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
Data variables:
x_only (x) float64 -99.28 -100.7
x_and_y (x, y) float64 0.1192 -1.044 -0.8618 -2.105 -0.4949 1.072
Similarly to index based alignment, the result has the intersection of all matching variables, and ValueError is raised if the result would be empty.
[1] | When numpy 1.10 is released, we should be able to override ufuncs for datasets by making use of __numpy_ufunc__. |
GroupBy: split-apply-combine¶
xray supports “group by” operations with the same API as pandas to implement the split-apply-combine strategy:
- Split your data into multiple independent groups.
- Apply some function to each group.
- Combine your groups back into a single data object.
Group by operations work on both Dataset and DataArray objects. Currently, you can only group by a single one-dimensional variable (eventually, we hope to remove this limitation).
Split¶
Let’s create a simple example dataset:
In [1]: ds = xray.Dataset({'foo': (('x', 'y'), np.random.rand(4, 3))},
...: coords={'x': [10, 20, 30, 40],
...: 'letters': ('x', list('abba'))})
...:
In [2]: arr = ds['foo']
In [3]: ds
Out[3]:
<xray.Dataset>
Dimensions: (x: 4, y: 3)
Coordinates:
* x (x) int64 10 20 30 40
letters (x) |S1 'a' 'b' 'b' 'a'
* y (y) int64 0 1 2
Data variables:
foo (x, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 0.4514 ...
If we groupby the name of a variable or coordinate in a dataset (we can also use a DataArray directly), we get back a xray.GroupBy object:
In [4]: ds.groupby('letters')
Out[4]: <xray.core.groupby.DatasetGroupBy at 0x7f524cbb1290>
This object works very similarly to a pandas GroupBy object. You can view the group indices with the groups attribute:
In [5]: ds.groupby('letters').groups
Out[5]: {'a': [0, 3], 'b': [1, 2]}
You can also iterate over over groups in (label, group) pairs:
In [6]: list(ds.groupby('letters'))
Out[6]:
[('a', <xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) int64 10 40
letters (x) |S1 'a' 'a'
* y (y) int64 0 1 2
Data variables:
foo (x, y) float64 0.127 0.9667 0.2605 0.543 0.373 0.448),
('b', <xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) int64 20 30
letters (x) |S1 'b' 'b'
* y (y) int64 0 1 2
Data variables:
foo (x, y) float64 0.8972 0.3767 0.3362 0.4514 0.8403 0.1231)]
Just like in pandas, creating a GroupBy object is cheap: it does not actually split the data until you access particular values.
Apply¶
To apply a function to each group, you can use the flexible xray.GroupBy.apply() method. The resulting objects are automatically concatenated back together along the group axis:
In [7]: def standardize(x):
...: return (x - x.mean()) / x.std()
...:
In [8]: arr.groupby('letters').apply(standardize)
Out[8]:
<xray.DataArray 'foo' (x: 4, y: 3)>
array([[-1.23 , 1.937, -0.726],
[ 1.42 , -0.46 , -0.607],
[-0.191, 1.214, -1.376],
[ 0.339, -0.302, -0.019]])
Coordinates:
* y (y) int64 0 1 2
* x (x) int64 10 20 30 40
letters (x) |S1 'a' 'b' 'b' 'a'
GroupBy objects also have a reduce() method and methods like mean() as shortcuts for applying an aggregation function:
In [9]: arr.groupby('letters').mean(dim='x')
Out[9]:
<xray.DataArray 'foo' (letters: 2, y: 3)>
array([[ 0.335, 0.67 , 0.354],
[ 0.674, 0.609, 0.23 ]])
Coordinates:
* y (y) int64 0 1 2
* letters (letters) object 'a' 'b'
Using a groupby is thus also a convenient shortcut for aggregating over all dimensions other than the provided one:
In [10]: ds.groupby('x').std()
Out[10]:
<xray.Dataset>
Dimensions: (x: 4)
Coordinates:
letters (x) |S1 'a' 'b' 'b' 'a'
* x (x) int64 10 20 30 40
Data variables:
foo (x) float64 0.3684 0.2554 0.2931 0.06957
First and last¶
There are two special aggregation operations that are currently only found on groupby objects: first and last. These provide the first or last example of values for group along the grouped dimension:
In [11]: ds.groupby('letters').first()
Out[11]:
<xray.Dataset>
Dimensions: (letters: 2, y: 3)
Coordinates:
* y (y) int64 0 1 2
* letters (letters) object 'a' 'b'
Data variables:
foo (letters, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362
By default, they skip missing values (control this with skipna).
Grouped arithmetic¶
GroupBy objects also support a limited set of binary arithmetic operations, as a shortcut for mapping over all unique labels. Binary arithmetic is supported for (GroupBy, Dataset) and (GroupBy, DataArray) pairs, as long as the dataset or data array uses the unique grouped values as one of its index coordinates. For example:
In [12]: alt = arr.groupby('letters').mean()
In [13]: alt
Out[13]:
<xray.DataArray 'foo' (letters: 2)>
array([ 0.453, 0.504])
Coordinates:
* letters (letters) object 'a' 'b'
In [14]: ds.groupby('letters') - alt
Out[14]:
<xray.Dataset>
Dimensions: (x: 4, y: 3)
Coordinates:
* y (y) int64 0 1 2
* x (x) int64 10 20 30 40
letters (x) |S1 'a' 'b' 'b' 'a'
Data variables:
foo (x, y) float64 -0.3261 0.5137 -0.1926 0.3931 -0.1274 -0.1679 ...
This last line is roughly equivalent to the following:
results = []
for label, group in ds.groupby('letters'):
results.append(group - alt.sel(x=label))
xray.concat(results, dim='x')
Squeezing¶
When grouping over a dimension, you can control whether the dimension is squeezed out or if it should remain with length one on each group by using the squeeze parameter:
In [15]: next(iter(arr.groupby('x')))
Out[15]:
(10, <xray.DataArray 'foo' (y: 3)>
array([ 0.127, 0.967, 0.26 ])
Coordinates:
* y (y) int64 0 1 2
x int64 10
letters |S1 'a')
In [16]: next(iter(arr.groupby('x', squeeze=False)))
Out[16]:
(10, <xray.DataArray 'foo' (x: 1, y: 3)>
array([[ 0.127, 0.967, 0.26 ]])
Coordinates:
* y (y) int64 0 1 2
* x (x) int64 10
letters (x) |S1 'a')
Although xray will attempt to automatically transpose dimensions back into their original order when you use apply, it is sometimes useful to set squeeze=False to guarantee that all original dimensions remain unchanged.
You can always squeeze explicitly later with the Dataset or DataArray squeeze() methods.
Combining data¶
- For combining datasets or data arrays along a dimension, see concatenate.
- For combining datasets with different variables, see merge.
Concatenate¶
To combine arrays along existing or new dimension into a larger array, you can use concat(). concat takes an iterable of DataArray or Dataset objects, as well as a dimension name, and concatenates along that dimension:
In [1]: arr = xray.DataArray(np.random.randn(2, 3),
...: [('x', ['a', 'b']), ('y', [10, 20, 30])])
...:
In [2]: arr[:, :1]
Out[2]:
<xray.DataArray (x: 2, y: 1)>
array([[ 0.4691123 ],
[-1.13563237]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10
# this resembles how you would use np.concatenate
In [3]: xray.concat([arr[:, :1], arr[:, 1:]], dim='y')
Out[3]:
<xray.DataArray (x: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
Coordinates:
* x (x) |S1 'a' 'b'
* y (y) int64 10 20 30
In addition to combining along an existing dimension, concat can create a new dimension by stacking lower dimensional arrays together:
In [4]: arr[0]
Out[4]:
<xray.DataArray (y: 3)>
array([ 0.4691123 , -0.28286334, -1.5090585 ])
Coordinates:
x |S1 'a'
* y (y) int64 10 20 30
# to combine these 1d arrays into a 2d array in numpy, you would use np.array
In [5]: xray.concat([arr[0], arr[1]], 'x')
Out[5]:
<xray.DataArray (x: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
If the second argument to concat is a new dimension name, the arrays will be concatenated along that new dimension, which is always inserted as the first dimension:
In [6]: xray.concat([arr[0], arr[1]], 'new_dim')
Out[6]:
<xray.DataArray (new_dim: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
Coordinates:
* y (y) int64 10 20 30
x (new_dim) |S1 'a' 'b'
* new_dim (new_dim) int64 0 1
The second argument to concat can also be an Index or DataArray object as well as a string, in which case it is used to label the values along the new dimension:
In [7]: xray.concat([arr[0], arr[1]], pd.Index([-90, -100], name='new_dim'))
Out[7]:
<xray.DataArray (new_dim: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
[-1.13563237, 1.21211203, -0.17321465]])
Coordinates:
* y (y) int64 10 20 30
x (new_dim) |S1 'a' 'b'
* new_dim (new_dim) int64 -90 -100
Of course, concat also works on Dataset objects:
In [8]: ds = arr.to_dataset(name='foo')
In [9]: xray.concat([ds.sel(x='a'), ds.sel(x='b')], 'x')
Out[9]:
<xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
concat() has a number of options which provide deeper control over which variables are concatenated and how it handles conflicting variables between datasets. With the default parameters, xray will load some coordinate variables into memory to compare them between datasets. This may be prohibitively expensive if you are manipulating your dataset lazily using Out of core computation with dask.
Merge¶
To combine variables and coordinates between multiple Datasets, you can use the merge() and update() methods. Merge checks for conflicting variables before merging and by default it returns a new Dataset:
In [10]: ds.merge({'hello': ('space', np.arange(3) + 10)})
Out[10]:
<xray.Dataset>
Dimensions: (space: 3, x: 2, y: 3)
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
* space (space) int64 0 1 2
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
hello (space) int64 10 11 12
If you merge another dataset (or a dictionary including data array objects), by default the resulting dataset will be aligned on the union of all index coordinates:
In [11]: other = xray.Dataset({'bar': ('x', [1, 2, 3, 4]), 'x': list('abcd')})
In [12]: ds.merge(other)
Out[12]:
<xray.Dataset>
Dimensions: (x: 4, y: 3)
Coordinates:
* y (y) int64 10 20 30
* x (x) object 'a' 'b' 'c' 'd'
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732 nan ...
bar (x) int64 1 2 3 4
This ensures that the merge is non-destructive.
The same non-destructive merging between DataArray index coordinates is used in the Dataset constructor:
In [13]: xray.Dataset({'a': arr[:-1], 'b': arr[1:]})
Out[13]:
<xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) object 'a' 'b'
* y (y) int64 10 20 30
Data variables:
a (x, y) float64 0.4691 -0.2829 -1.509 nan nan nan
b (x, y) float64 nan nan nan -1.136 1.212 -0.1732
Update¶
In contrast to merge, update modifies a dataset in-place without checking for conflicts, and will overwrite any existing variables with new values:
In [14]: ds.update({'space': ('space', [10.2, 9.4, 3.9])})
Out[14]:
<xray.Dataset>
Dimensions: (space: 3, x: 2, y: 3)
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
* space (space) float64 10.2 9.4 3.9
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
However, dimensions are still required to be consistent between different Dataset variables, so you cannot change the size of a dimension unless you replace all dataset variables that use it.
update also performs automatic alignment if necessary. Unlike merge, it maintains the alignment of the original array instead of merging indexes:
In [15]: ds.update(other)
Out[15]:
<xray.Dataset>
Dimensions: (space: 3, x: 2, y: 3)
Coordinates:
* y (y) int64 10 20 30
* x (x) object 'a' 'b'
* space (space) float64 10.2 9.4 3.9
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
bar (x) int64 1 2
The exact same alignment logic when setting a variable with __setitem__ syntax:
In [16]: ds['baz'] = xray.DataArray([9, 9, 9, 9, 9], coords=[('x', list('abcde'))])
In [17]: ds.baz
Out[17]:
<xray.DataArray 'baz' (x: 2)>
array([9, 9])
Coordinates:
* x (x) object 'a' 'b'
Equals and identical¶
xray objects can be compared by using the equals(), identical() and broadcast_equals() methods. These methods are used by the optional compat argument on concat and merge.
equals checks dimension names, indexes and array values:
In [18]: arr.equals(arr.copy())
Out[18]: True
identical also checks attributes, and the name of each object:
In [19]: arr.identical(arr.rename('bar'))
Out[19]: False
broadcast_equals does a more relaxed form of equality check that allows variables to have different dimensions, as long as values are constant along those new dimensions:
In [20]: left = xray.Dataset(coords={'x': 0})
In [21]: right = xray.Dataset({'x': [0, 0, 0]})
In [22]: left.broadcast_equals(right)
Out[22]: True
Like pandas objects, two xray objects are still equal or identical if they have missing values marked by NaN in the same locations.
In contrast, the == operation performs element-wise comparison (like numpy):
In [23]: arr == arr.copy()
Out[23]:
<xray.DataArray (x: 2, y: 3)>
array([[ True, True, True],
[ True, True, True]], dtype=bool)
Coordinates:
* y (y) int64 10 20 30
* x (x) |S1 'a' 'b'
Note that NaN does not compare equal to NaN in element-wise comparison; you may need to deal with missing values explicitly.
Time series data¶
A major use case for xray is multi-dimensional time-series data. Accordingly, we’ve copied many of features that make working with time-series data in pandas such a joy to xray. In most cases, we rely on pandas for the core functionality.
Creating datetime64 data¶
xray uses the numpy dtypes datetime64[ns] and timedelta64[ns] to represent datetime data, which offer vectorized (if sometimes buggy) operations with numpy and smooth integration with pandas.
To convert to or create regular arrays of datetime64 data, we recommend using pandas.to_datetime() and pandas.date_range():
In [1]: pd.to_datetime(['2000-01-01', '2000-02-02'])
Out[1]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2000-01-01, 2000-02-02]
Length: 2, Freq: None, Timezone: None
In [2]: pd.date_range('2000-01-01', periods=365)
Out[2]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2000-01-01, ..., 2000-12-30]
Length: 365, Freq: D, Timezone: None
Alternatively, you can supply arrays of Python datetime objects. These get converted automatically when used as arguments in xray objects:
In [3]: import datetime
In [4]: xray.Dataset({'time': datetime.datetime(2000, 1, 1)})
Out[4]:
<xray.Dataset>
Dimensions: ()
Coordinates:
*empty*
Data variables:
time datetime64[ns] 2000-01-01
When reading or writing netCDF files, xray automatically decodes datetime and timedelta arrays using CF conventions (that is, by using a units attribute like 'days since 2000-01-01').
You can manual decode arrays in this form by passing a dataset to decode_cf():
In [5]: attrs = {'units': 'hours since 2000-01-01'}
In [6]: ds = xray.Dataset({'time': ('time', [0, 1, 2, 3], attrs)})
In [7]: xray.decode_cf(ds)
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-7-eed9ce054dca> in <module>()
----> 1 xray.decode_cf(ds)
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.pyc in decode_cf(obj, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables)
888 vars, attrs, coord_names = decode_cf_variables(
889 vars, attrs, concat_characters, mask_and_scale, decode_times,
--> 890 decode_coords, drop_variables=drop_variables)
891 ds = Dataset(vars, attrs=attrs)
892 ds = ds.set_coords(coord_names.union(extra_coords))
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.pyc in decode_cf_variables(variables, attributes, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables)
823 new_vars[k] = decode_cf_variable(
824 v, concat_characters=concat, mask_and_scale=mask_and_scale,
--> 825 decode_times=decode_times)
826 if decode_coords:
827 var_attrs = new_vars[k].attrs
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.pyc in decode_cf_variable(var, concat_characters, mask_and_scale, decode_times, decode_endianness)
764 units = pop_to(attributes, encoding, 'units')
765 calendar = pop_to(attributes, encoding, 'calendar')
--> 766 data = DecodedCFDatetimeArray(data, units, calendar)
767 elif attributes['units'] in TIME_UNITS:
768 # timedelta
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.pyc in __init__(self, array, units, calendar)
389 if not PY3:
390 msg += ' Full traceback:\n' + traceback.format_exc()
--> 391 raise ValueError(msg)
392 else:
393 self._dtype = getattr(result, 'dtype', np.dtype('object'))
ValueError: unable to decode time units 'hours since 2000-01-01' with the default calendar. Try opening your dataset with decode_times=False. Full traceback:
Traceback (most recent call last):
File "/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.py", line 382, in __init__
result = decode_cf_datetime(example_value, units, calendar)
File "/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.py", line 146, in decode_cf_datetime
dates = (pd.to_timedelta(flat_num_dates, delta) + ref_date).values
File "/usr/lib/python2.7/dist-packages/pandas/tseries/timedeltas.py", line 59, in to_timedelta
return _convert_listlike(arg, box=box)
File "/usr/lib/python2.7/dist-packages/pandas/tseries/timedeltas.py", line 46, in _convert_listlike
value = np.array([ _coerce_scalar_to_timedelta_type(r, unit=unit) for r in arg ])
File "/usr/lib/python2.7/dist-packages/pandas/tseries/timedeltas.py", line 82, in _coerce_scalar_to_timedelta_type
return tslib.convert_to_timedelta(r,unit)
File "tslib.pyx", line 1186, in pandas.tslib.convert_to_timedelta (pandas/tslib.c:20014)
File "tslib.pyx", line 1241, in pandas.tslib.convert_to_timedelta64 (pandas/tslib.c:20660)
ValueError: Invalid type for timedelta scalar: <type 'numpy.float64'>
One unfortunate limitation of using datetime64[ns] is that it limits the native representation of dates to those that fall between the years 1678 and 2262. When a netCDF file contains dates outside of these bounds, dates will be returned as arrays of netcdftime.datetime objects.
Datetime indexing¶
xray borrows powerful indexing machinery from pandas (see Indexing and selecting data).
This allows for several useful and suscinct forms of indexing, particularly for datetime64 data. For example, we support indexing with strings for single items and with the slice object:
In [8]: time = pd.date_range('2000-01-01', freq='H', periods=365 * 24)
In [9]: ds = xray.Dataset({'foo': ('time', np.arange(365 * 24)), 'time': time})
In [10]: ds.sel(time='2000-01')
Out[10]:
<xray.Dataset>
Dimensions: (time: 744)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...
Data variables:
foo (time) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
In [11]: ds.sel(time=slice('2000-06-01', '2000-06-10'))
Out[11]:
<xray.Dataset>
Dimensions: (time: 240)
Coordinates:
* time (time) datetime64[ns] 2000-06-01 2000-06-01T01:00:00 ...
Data variables:
foo (time) int64 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 ...
You can also select a particular time by indexing with a datetime.time object:
In [12]: ds.sel(time=datetime.time(12))
Out[12]:
<xray.Dataset>
Dimensions: (time: 365)
Coordinates:
* time (time) datetime64[ns] 2000-01-01T12:00:00 2000-01-02T12:00:00 ...
Data variables:
foo (time) int64 12 36 60 84 108 132 156 180 204 228 252 276 300 ...
For more details, read the pandas documentation.
Datetime components¶
xray supports a notion of “virtual” or “derived” coordinates for datetime components implemented by pandas, including “year”, “month”, “day”, “hour”, “minute”, “second”, “dayofyear”, “week”, “dayofweek”, “weekday” and “quarter”:
In [13]: ds['time.month']
Out[13]:
<xray.DataArray 'month' (time: 8760)>
array([ 1, 1, 1, ..., 12, 12, 12], dtype=int32)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...
In [14]: ds['time.dayofyear']
Out[14]:
<xray.DataArray 'dayofyear' (time: 8760)>
array([ 1, 1, 1, ..., 365, 365, 365], dtype=int32)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...
xray adds 'season' to the list of datetime components supported by pandas:
In [15]: ds['time.season']
Out[15]:
<xray.DataArray 'season' (time: 8760)>
array(['DJF', 'DJF', 'DJF', ..., 'DJF', 'DJF', 'DJF'],
dtype='|S3')
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...
The set of valid seasons consists of ‘DJF’, ‘MAM’, ‘JJA’ and ‘SON’, labeled by the first letters of the corresponding months.
You can use these shortcuts with both Datasets and DataArray coordinates.
Resampling and grouped operations¶
Datetime components couple particularly well with grouped operations (see GroupBy: split-apply-combine) for analyzing features that repeat over time. Here’s how to calculate the mean by time of day:
In [16]: ds.groupby('time.hour').mean()
Out[16]:
<xray.Dataset>
Dimensions: (hour: 24)
Coordinates:
* hour (hour) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
foo (hour) float64 4.368e+03 4.369e+03 4.37e+03 4.371e+03 4.372e+03 ...
For upsampling or downsampling temporal resolutions, xray offer a resample() method building on the core functionality offered by the pandas method of the same name. Resample uses essentialy the same api as resample in pandas.
For example, we can downsample our dataset from hourly to 6-hourly:
In [17]: ds.resample('6H', dim='time', how='mean')
Out[17]:
<xray.Dataset>
Dimensions: (time: 1460)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T06:00:00 ...
Data variables:
foo (time) float64 2.5 8.5 14.5 20.5 26.5 32.5 38.5 44.5 50.5 56.5 ...
Resample also works for upsampling, in which case intervals without any values are marked by NaN:
In [18]: ds.resample('30Min', 'time')
Out[18]:
<xray.Dataset>
Dimensions: (time: 17519)
Coordinates:
* time (time) datetime64[ns] 2000-01-01 2000-01-01T00:30:00 ...
Data variables:
foo (time) float64 0.0 nan 1.0 nan 2.0 nan 3.0 nan 4.0 nan 5.0 nan ...
Of course, all of these resampling and groupby operation work on both Dataset and DataArray objects with any number of additional dimensions.
For more examples of using grouped operations on a time dimension, see Toy weather data.
Working with pandas¶
One of the most important features of xray is the ability to convert to and from pandas objects to interact with the rest of the PyData ecosystem. For example, for plotting labeled data, we highly recommend using the visualization built in to pandas itself or provided by the pandas aware libraries such as Seaborn.
Hierarchical and tidy data¶
Tabular data is easiest to work with when it meets the criteria for tidy data:
- Each column holds a different variable.
- Each rows holds a different observation.
In this “tidy data” format, we can represent any Dataset and DataArray in terms of pandas.DataFrame and pandas.Series, respectively (and vice-versa). The representation works by flattening non-coordinates to 1D, and turning the tensor product of coordinate indexes into a pandas.MultiIndex.
Dataset and DataFrame¶
To convert any dataset to a DataFrame in tidy form, use the Dataset.to_dataframe() method:
In [1]: ds = xray.Dataset({'foo': (('x', 'y'), np.random.randn(2, 3))},
...: coords={'x': [10, 20], 'y': ['a', 'b', 'c'],
...: 'along_x': ('x', np.random.randn(2)),
...: 'scalar': 123})
...:
In [2]: ds
Out[2]:
<xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* y (y) |S1 'a' 'b' 'c'
* x (x) int64 10 20
scalar int64 123
along_x (x) float64 0.1192 -1.044
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
In [3]: df = ds.to_dataframe()
In [4]: df
Out[4]:
foo scalar along_x
x y
10 a 0.469112 123 0.119209
b -0.282863 123 0.119209
c -1.509059 123 0.119209
20 a -1.135632 123 -1.044236
b 1.212112 123 -1.044236
c -0.173215 123 -1.044236
[6 rows x 3 columns]
We see that each variable and coordinate in the Dataset is now a column in the DataFrame, with the exception of indexes which are in the index. To convert the DataFrame to any other convenient representation, use DataFrame methods like reset_index(), stack() and unstack().
To create a Dataset from a DataFrame, use the from_dataframe() class method:
In [5]: xray.Dataset.from_dataframe(df)
Out[5]:
<xray.Dataset>
Dimensions: (x: 2, y: 3)
Coordinates:
* x (x) int64 10 20
* y (y) object 'a' 'b' 'c'
Data variables:
foo (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
scalar (x, y) int64 123 123 123 123 123 123
along_x (x, y) float64 0.1192 0.1192 0.1192 -1.044 -1.044 -1.044
Notice that that dimensions of variables in the Dataset have now expanded after the round-trip conversion to a DataFrame. This is because every object in a DataFrame must have the same indices, so we need to broadcast the data of each array to the full size of the new MultiIndex.
Likewise, all the coordinates (other than indexes) ended up as variables, because pandas does not distinguish non-index coordinates.
DataArray and Series¶
DataArray objects have a complementary representation in terms of a pandas.Series. Using a Series preserves the Dataset to DataArray relationship, because DataFrames are dict-like containers of Series. The methods are very similar to those for working with DataFrames:
In [6]: s = ds['foo'].to_series()
In [7]: s
Out[7]:
x y
10 a 0.469112
b -0.282863
c -1.509059
20 a -1.135632
b 1.212112
c -0.173215
Name: foo, dtype: float64
In [8]: xray.DataArray.from_series(s)
Out[8]:
<xray.DataArray 'foo' (x: 2, y: 3)>
array([[ 0.469, -0.283, -1.509],
[-1.136, 1.212, -0.173]])
Coordinates:
* x (x) int64 10 20
* y (y) object 'a' 'b' 'c'
Both the from_series and from_dataframe methods use reindexing, so they work even if not the hierarchical index is not a full tensor product:
In [9]: s[::2]
Out[9]:
x y
10 a 0.469112
c -1.509059
20 b 1.212112
Name: foo, dtype: float64
In [10]: xray.DataArray.from_series(s[::2])
Out[10]:
<xray.DataArray 'foo' (x: 2, y: 3)>
array([[ 0.469, nan, -1.509],
[ nan, 1.212, nan]])
Coordinates:
* x (x) int64 10 20
* y (y) object 'a' 'b' 'c'
Multi-dimensional data¶
DataArray.to_pandas() is a shortcut that lets you convert a DataArray directly into a pandas object with the same dimensionality (i.e., a 1D array is converted to a Series, 2D to DataFrame and 3D to Panel):
In [11]: arr = xray.DataArray(np.random.randn(2, 3),
....: coords=[('x', [10, 20]), ('y', ['a', 'b', 'c'])])
....:
In [12]: df = arr.to_pandas()
In [13]: df
Out[13]:
y a b c
x
10 -0.861849 -2.104569 -0.494929
20 1.071804 0.721555 -0.706771
[2 rows x 3 columns]
To perform the inverse operation of converting any pandas objects into a data array with the same shape, simply use the DataArray constructor:
In [14]: xray.DataArray(df)
Out[14]:
<xray.DataArray (x: 2, y: 3)>
array([[-0.862, -2.105, -0.495],
[ 1.072, 0.722, -0.707]])
Coordinates:
* x (x) int64 10 20
* y (y) object 'a' 'b' 'c'
xray objects do not yet support hierarchical indexes, so if your data has a hierarchical index, you will either need to unstack it first or use the from_series() or from_dataframe() constructors described above.
Serialization and IO¶
xray supports direct serialization and IO to several file formats. For more options, consider exporting your objects to pandas (see the preceding section) and using its broad range of IO tools.
Pickle¶
The simplest way to serialize an xray object is to use Python’s built-in pickle module:
In [1]: import cPickle as pickle
In [2]: ds = xray.Dataset({'foo': (('x', 'y'), np.random.rand(4, 5))},
...: coords={'x': [10, 20, 30, 40],
...: 'y': pd.date_range('2000-01-01', periods=5),
...: 'z': ('x', list('abcd'))})
...:
# use the highest protocol (-1) because it is way faster than the default
# text based pickle format
In [3]: pkl = pickle.dumps(ds, protocol=-1)
In [4]: pickle.loads(pkl)
Out[4]:
<xray.Dataset>
Dimensions: (x: 4, y: 5)
Coordinates:
* y (y) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04 ...
* x (x) int64 10 20 30 40
z (x) |S1 'a' 'b' 'c' 'd'
Data variables:
foo (x, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 0.4514 ...
Pickle support is important because it doesn’t require any external libraries and lets you use xray objects with Python modules like multiprocessing. However, there are two important caveats:
- To simplify serialization, xray’s support for pickle currently loads all array values into memory before dumping an object. This means it is not suitable for serializing datasets too big to load into memory (e.g., from netCDF or OPeNDAP).
- Pickle will only work as long as the internal data structure of xray objects remains unchanged. Because the internal design of xray is still being refined, we make no guarantees (at this point) that objects pickled with this version of xray will work in future versions.
netCDF¶
Currently, the only disk based serialization format that xray directly supports is netCDF. netCDF is a file format for fully self-described datasets that is widely used in the geosciences and supported on almost all platforms. We use netCDF because xray was based on the netCDF data model, so netCDF files on disk directly correspond to Dataset objects. Recent versions netCDF are based on the even more widely used HDF5 file-format.
Reading and writing netCDF files with xray requires the netCDF4-Python library or scipy to be installed.
We can save a Dataset to disk using the Dataset.to_netcdf method:
In [5]: ds.to_netcdf('saved_on_disk.nc')
By default, the file is saved as netCDF4 (assuming netCDF4-Python is installed). You can control the format and engine used to write the file with the format and engine arguments.
We can load netCDF files to create a new Dataset using open_dataset():
In [6]: ds_disk = xray.open_dataset('saved_on_disk.nc')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-6-abc36c52764b> in <module>()
----> 1 ds_disk = xray.open_dataset('saved_on_disk.nc')
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/backends/api.pyc in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, engine, chunks, lock, drop_variables)
221 lock = _default_lock(filename_or_obj, engine)
222 with close_on_error(store):
--> 223 return maybe_decode_store(store, lock)
224 else:
225 if engine is not None and engine != 'scipy':
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/backends/api.pyc in maybe_decode_store(store, lock)
156 store, mask_and_scale=mask_and_scale, decode_times=decode_times,
157 concat_characters=concat_characters, decode_coords=decode_coords,
--> 158 drop_variables=drop_variables)
159
160 if chunks is not None:
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.pyc in decode_cf(obj, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables)
888 vars, attrs, coord_names = decode_cf_variables(
889 vars, attrs, concat_characters, mask_and_scale, decode_times,
--> 890 decode_coords, drop_variables=drop_variables)
891 ds = Dataset(vars, attrs=attrs)
892 ds = ds.set_coords(coord_names.union(extra_coords))
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.pyc in decode_cf_variables(variables, attributes, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables)
823 new_vars[k] = decode_cf_variable(
824 v, concat_characters=concat, mask_and_scale=mask_and_scale,
--> 825 decode_times=decode_times)
826 if decode_coords:
827 var_attrs = new_vars[k].attrs
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.pyc in decode_cf_variable(var, concat_characters, mask_and_scale, decode_times, decode_endianness)
764 units = pop_to(attributes, encoding, 'units')
765 calendar = pop_to(attributes, encoding, 'calendar')
--> 766 data = DecodedCFDatetimeArray(data, units, calendar)
767 elif attributes['units'] in TIME_UNITS:
768 # timedelta
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.pyc in __init__(self, array, units, calendar)
389 if not PY3:
390 msg += ' Full traceback:\n' + traceback.format_exc()
--> 391 raise ValueError(msg)
392 else:
393 self._dtype = getattr(result, 'dtype', np.dtype('object'))
ValueError: unable to decode time units u'days since 2000-01-01 00:00:00' with calendar u'proleptic_gregorian'. Try opening your dataset with decode_times=False. Full traceback:
Traceback (most recent call last):
File "/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.py", line 382, in __init__
result = decode_cf_datetime(example_value, units, calendar)
File "/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.py", line 146, in decode_cf_datetime
dates = (pd.to_timedelta(flat_num_dates, delta) + ref_date).values
File "/usr/lib/python2.7/dist-packages/pandas/tseries/timedeltas.py", line 59, in to_timedelta
return _convert_listlike(arg, box=box)
File "/usr/lib/python2.7/dist-packages/pandas/tseries/timedeltas.py", line 46, in _convert_listlike
value = np.array([ _coerce_scalar_to_timedelta_type(r, unit=unit) for r in arg ])
File "/usr/lib/python2.7/dist-packages/pandas/tseries/timedeltas.py", line 82, in _coerce_scalar_to_timedelta_type
return tslib.convert_to_timedelta(r,unit)
File "tslib.pyx", line 1186, in pandas.tslib.convert_to_timedelta (pandas/tslib.c:20014)
File "tslib.pyx", line 1241, in pandas.tslib.convert_to_timedelta64 (pandas/tslib.c:20660)
ValueError: Invalid type for timedelta scalar: <type 'numpy.float64'>
In [7]: ds_disk
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-8740da34ae04> in <module>()
----> 1 ds_disk
NameError: name 'ds_disk' is not defined
A dataset can also be loaded or written to a specific group within a netCDF file. To load from a group, pass a group keyword argument to the open_dataset function. The group can be specified as a path-like string, e.g., to access subgroup ‘bar’ within group ‘foo’ pass ‘/foo/bar’ as the group argument. When writing multiple groups in one file, pass mode='a' to to_netcdf to ensure that each call does not delete the file.
Data is loaded lazily from netCDF files. You can manipulate, slice and subset Dataset and DataArray objects, and no array values are loaded into memory until you try to perform some sort of actual computation. For an example of how these lazy arrays work, see the OPeNDAP section below.
It is important to note that when you modify values of a Dataset, even one linked to files on disk, only the in-memory copy you are manipulating in xray is modified: the original file on disk is never touched.
Tip
xray’s lazy loading of remote or on-disk datasets is often but not always desirable. Before performing computationally intense operations, it is often a good idea to load a dataset entirely into memory by invoking the load() method.
Datasets have a close() method to close the associated netCDF file. However, it’s often cleaner to use a with statement:
# this automatically closes the dataset after use
In [8]: with xray.open_dataset('saved_on_disk.nc') as ds:
...: print(ds.keys())
...:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-8-1dfdd8058cda> in <module>()
----> 1 with xray.open_dataset('saved_on_disk.nc') as ds:
2 print(ds.keys())
3
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/backends/api.pyc in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, engine, chunks, lock, drop_variables)
221 lock = _default_lock(filename_or_obj, engine)
222 with close_on_error(store):
--> 223 return maybe_decode_store(store, lock)
224 else:
225 if engine is not None and engine != 'scipy':
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/backends/api.pyc in maybe_decode_store(store, lock)
156 store, mask_and_scale=mask_and_scale, decode_times=decode_times,
157 concat_characters=concat_characters, decode_coords=decode_coords,
--> 158 drop_variables=drop_variables)
159
160 if chunks is not None:
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.pyc in decode_cf(obj, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables)
888 vars, attrs, coord_names = decode_cf_variables(
889 vars, attrs, concat_characters, mask_and_scale, decode_times,
--> 890 decode_coords, drop_variables=drop_variables)
891 ds = Dataset(vars, attrs=attrs)
892 ds = ds.set_coords(coord_names.union(extra_coords))
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.pyc in decode_cf_variables(variables, attributes, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables)
823 new_vars[k] = decode_cf_variable(
824 v, concat_characters=concat, mask_and_scale=mask_and_scale,
--> 825 decode_times=decode_times)
826 if decode_coords:
827 var_attrs = new_vars[k].attrs
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.pyc in decode_cf_variable(var, concat_characters, mask_and_scale, decode_times, decode_endianness)
764 units = pop_to(attributes, encoding, 'units')
765 calendar = pop_to(attributes, encoding, 'calendar')
--> 766 data = DecodedCFDatetimeArray(data, units, calendar)
767 elif attributes['units'] in TIME_UNITS:
768 # timedelta
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.pyc in __init__(self, array, units, calendar)
389 if not PY3:
390 msg += ' Full traceback:\n' + traceback.format_exc()
--> 391 raise ValueError(msg)
392 else:
393 self._dtype = getattr(result, 'dtype', np.dtype('object'))
ValueError: unable to decode time units u'days since 2000-01-01 00:00:00' with calendar u'proleptic_gregorian'. Try opening your dataset with decode_times=False. Full traceback:
Traceback (most recent call last):
File "/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.py", line 382, in __init__
result = decode_cf_datetime(example_value, units, calendar)
File "/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.py", line 146, in decode_cf_datetime
dates = (pd.to_timedelta(flat_num_dates, delta) + ref_date).values
File "/usr/lib/python2.7/dist-packages/pandas/tseries/timedeltas.py", line 59, in to_timedelta
return _convert_listlike(arg, box=box)
File "/usr/lib/python2.7/dist-packages/pandas/tseries/timedeltas.py", line 46, in _convert_listlike
value = np.array([ _coerce_scalar_to_timedelta_type(r, unit=unit) for r in arg ])
File "/usr/lib/python2.7/dist-packages/pandas/tseries/timedeltas.py", line 82, in _coerce_scalar_to_timedelta_type
return tslib.convert_to_timedelta(r,unit)
File "tslib.pyx", line 1186, in pandas.tslib.convert_to_timedelta (pandas/tslib.c:20014)
File "tslib.pyx", line 1241, in pandas.tslib.convert_to_timedelta64 (pandas/tslib.c:20660)
ValueError: Invalid type for timedelta scalar: <type 'numpy.float64'>
Reading encoded data¶
NetCDF files follow some conventions for encoding datetime arrays (as numbers with a “units” attribute) and for packing and unpacking data (as described by the “scale_factor” and “add_offset” attributes). If the argument decode_cf=True (default) is given to open_dataset, xray will attempt to automatically decode the values in the netCDF objects according to CF conventions. Sometimes this will fail, for example, if a variable has an invalid “units” or “calendar” attribute. For these cases, you can turn this decoding off manually.
You can view this encoding information (among others) in the DataArray.encoding attribute:
In [9]: ds_disk['y'].encoding
Out[9]:
{'calendar': u'proleptic_gregorian',
'chunksizes': None,
'complevel': 0,
'contiguous': True,
'dtype': dtype('float64'),
'fletcher32': False,
'least_significant_digit': None,
'shuffle': False,
'source': 'saved_on_disk.nc',
'units': u'days since 2000-01-01 00:00:00',
'zlib': False}
Note that all operations that manipulate variables other than indexing will remove encoding information.
Writing encoded data¶
Conversely, you can customize how xray writes netCDF files on disk by providing explicit encodings for each dataset variable. The encoding argument takes a dictionary with variable names as keys and variable specific encodings as values. These encodings are saved as attributes on the netCDF variables on disk, which allows xray to faithfully read encoded data back into memory.
It is important to note that using encodings is entirely optional: if you do not supply any of these encoding options, xray will write data to disk using a default encoding, or the options in the encoding attribute, if set. This works perfectly fine in most cases, but encoding can be useful for additional control, especially for enabling compression.
In the file on disk, these encodings as saved as attributes on each variable, which allow xray and other CF-compliant tools for working with netCDF files to correctly read the data.
Scaling and type conversions¶
These encoding options work on any version of the netCDF file format:
- dtype: Any valid NumPy dtype or string convertable to a dtype, e.g., 'int16' or 'float32'. This controls the type of the data written on disk.
- _FillValue: Values of NaN in xray variables are remapped to this value when saved on disk. This is important when converting floating point with missing values to integers on disk, because NaN is not a valid dtype for integer dtypes.
- scale_factor and add_offset: Used to convert from encoded data on disk to to the decoded data in memory, according to the formula decoded = scale_factor * encoded + add_offset.
These parameters can be fruitfully combined to compress discretized data on disk. For example, to save the variable foo with a precision of 0.1 in 16-bit integers while converting NaN to -9999, we would use encoding={'foo': {'dtype': 'int16', 'scale_factor': 0.1, '_FillValue': -9999}}. Compression and decompression with such discretization is extremely fast.
Chunk based compression¶
zlib, complevel, fletcher32, continguous and chunksizes can be used for enabling netCDF4/HDF5’s chunk based compression, as described in the documentation for createVariable for netCDF4-Python. This only works for netCDF4 files and thus requires using format='netCDF4' and either engine='netcdf4' or engine='h5netcdf'.
Chunk based gzip compression can yield impressive space savings, especially for sparse data, but it comes with significant performance overhead. HDF5 libraries can only read complete chunks back into memory, and maximum decompression speed is in the range of 50-100 MB/s. Worse, HDF5’s compression and decompression currently cannot be parallelized with dask. For these reasons, we recommend trying discretization based compression (described above) first.
Time units¶
The units and calendar attributes control how xray serializes datetime64 and timedelta64 arrays to datasets on disk as numeric values. The units encoding should be a string like 'days since 1900-01-01' for datetime64 data or a string like 'days' for timedelta64 data. calendar should be one of the calendar types supported by netCDF4-python: ‘standard’, ‘gregorian’, ‘proleptic_gregorian’ ‘noleap’, ‘365_day’, ‘360_day’, ‘julian’, ‘all_leap’, ‘366_day’.
By default, xray uses the ‘proleptic_gregorian’ calendar and units of the smallest time difference between values, with a reference time of the first time value.
OPeNDAP¶
xray includes support for OPeNDAP (via the netCDF4 library or Pydap), which lets us access large datasets over HTTP.
For example, we can open a connection to GBs of weather data produced by the PRISM project, and hosted by IRI at Columbia:
In [10]: remote_data = xray.open_dataset(
....: 'http://iridl.ldeo.columbia.edu/SOURCES/.OSU/.PRISM/.monthly/dods',
....: decode_times=False)
....:
In [11]: remote_data
Out[11]:
<xray.Dataset>
Dimensions: (T: 1422, X: 1405, Y: 621)
Coordinates:
* X (X) float32 -125.0 -124.958 -124.917 -124.875 -124.833 -124.792 -124.75 ...
* T (T) float32 -779.5 -778.5 -777.5 -776.5 -775.5 -774.5 -773.5 -772.5 -771.5 ...
* Y (Y) float32 49.9167 49.875 49.8333 49.7917 49.75 49.7083 49.6667 49.625 ...
Data variables:
ppt (T, Y, X) float64 ...
tdmean (T, Y, X) float64 ...
tmax (T, Y, X) float64 ...
tmin (T, Y, X) float64 ...
Attributes:
Conventions: IRIDL
expires: 1375315200
Note
Like many real-world datasets, this dataset does not entirely follow CF conventions. Unexpected formats will usually cause xray’s automatic decoding to fail. The way to work around this is to either set decode_cf=False in open_dataset to turn off all use of CF conventions, or by only disabling the troublesome parser. In this case, we set decode_times=False because the time axis here provides the calendar attribute in a format that xray does not expect (the integer 360 instead of a string like '360_day').
We can select and slice this data any number of times, and nothing is loaded over the network until we look at particular values:
In [12]: tmax = remote_data['tmax'][:500, ::3, ::3]
In [13]: tmax
Out[13]:
<xray.DataArray 'tmax' (T: 500, Y: 207, X: 469)>
[48541500 values with dtype=float64]
Coordinates:
* Y (Y) float32 49.9167 49.7917 49.6667 49.5417 49.4167 49.2917 ...
* X (X) float32 -125.0 -124.875 -124.75 -124.625 -124.5 -124.375 ...
* T (T) float32 -779.5 -778.5 -777.5 -776.5 -775.5 -774.5 -773.5 ...
Attributes:
pointwidth: 120
standard_name: air_temperature
units: Celsius_scale
expires: 1443657600
# the data is downloaded automatically when we make the plot
In [14]: tmax[0].plot()
Combining multiple files¶
NetCDF files are often encountered in collections, e.g., with different files corresponding to different model runs. xray can straightforwardly combine such files into a single Dataset by making use of concat().
Note
Version 0.5 includes experimental support for manipulating datasets that don’t fit into memory with dask. If you have dask installed, you can open multiple files simultaneously using open_mfdataset():
xray.open_mfdataset('my/files/*.nc')
This function automatically concatenates and merges into a single xray datasets. For more details, see Reading and writing data.
For example, here’s how we could approximate MFDataset from the netCDF4 library:
from glob import glob
import xray
def read_netcdfs(files, dim):
# glob expands paths with * to a list of files, like the unix shell
paths = sorted(glob(files))
datasets = [xray.open_dataset(p) for p in paths]
combined = xray.concat(dataset, dim)
return combined
read_netcdfs('/all/my/files/*.nc', dim='time')
This function will work in many cases, but it’s not very robust. First, it never closes files, which means it will fail one you need to load more than a few thousands file. Second, it assumes that you want all the data from each file and that it can all fit into memory. In many situations, you only need a small subset or an aggregated summary of the data from each file.
Here’s a slightly more sophisticated example of how to remedy these deficiencies:
def read_netcdfs(files, dim, transform_func=None):
def process_one_path(path):
# use a context manager, to ensure the file gets closed after use
with xray.open_dataset(path) as ds:
# transform_func should do some sort of selection or
# aggregation
if transform_func is not None:
ds = transform_func(ds)
# load all data from the transformed dataset, to ensure we can
# use it after closing each original file
ds.load()
return ds
paths = sorted(glob(files))
datasets = [process_one_path(p) for p in paths]
combined = xray.concat(datasets, dim)
return combined
# here we suppose we only care about the combined mean of each file;
# you might also use indexing operations like .sel to subset datasets
read_netcdfs('/all/my/files/*.nc', dim='time',
transform_func=lambda ds: ds.mean())
This pattern works well and is very robust. We’ve used similar code to process tens of thousands of files constituting 100s of GB of data.
Out of core computation with dask¶
xray v0.5 includes experimental integration with dask to support streaming computation on datasets that don’t fit into memory.
Currently, dask is an entirely optional feature for xray. However, the benefits of using dask are sufficiently strong that we expect that dask may become a requirement for a future version of xray.
For a full example of how to use xray’s dask integration, read the blog post introducing xray and dask.
What is a dask array?¶
Dask divides arrays into many small pieces, called chunks, each of which is presumed to be small enough to fit into memory.
Unlike NumPy, which has eager evaluation, operations on dask arrays are lazy. Operations queue up a series of taks mapped over blocks, and no computation is performed until you actually ask values to be computed (e.g., to print results to your screen or write to disk). At that point, data is loaded into memory and computation proceeds in a streaming fashion, block-by-block.
The actual computation is controlled by a multi-processing or thread pool, which allows dask to take full advantage of multiple processers available on most modern computers.
For more details on dask, read its documentation.
Reading and writing data¶
The usual way to create a dataset filled with dask arrays is to load the data from a netCDF file or files. You can do this by supplying a chunks argument to open_dataset() or using the open_mfdataset() function.
In [1]: ds = xray.open_dataset('example-data.nc', chunks={'time': 10})
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-1-ed6bf118b3ec> in <module>()
----> 1 ds = xray.open_dataset('example-data.nc', chunks={'time': 10})
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/backends/api.pyc in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, engine, chunks, lock, drop_variables)
221 lock = _default_lock(filename_or_obj, engine)
222 with close_on_error(store):
--> 223 return maybe_decode_store(store, lock)
224 else:
225 if engine is not None and engine != 'scipy':
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/backends/api.pyc in maybe_decode_store(store, lock)
156 store, mask_and_scale=mask_and_scale, decode_times=decode_times,
157 concat_characters=concat_characters, decode_coords=decode_coords,
--> 158 drop_variables=drop_variables)
159
160 if chunks is not None:
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.pyc in decode_cf(obj, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables)
888 vars, attrs, coord_names = decode_cf_variables(
889 vars, attrs, concat_characters, mask_and_scale, decode_times,
--> 890 decode_coords, drop_variables=drop_variables)
891 ds = Dataset(vars, attrs=attrs)
892 ds = ds.set_coords(coord_names.union(extra_coords))
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.pyc in decode_cf_variables(variables, attributes, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables)
823 new_vars[k] = decode_cf_variable(
824 v, concat_characters=concat, mask_and_scale=mask_and_scale,
--> 825 decode_times=decode_times)
826 if decode_coords:
827 var_attrs = new_vars[k].attrs
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.pyc in decode_cf_variable(var, concat_characters, mask_and_scale, decode_times, decode_endianness)
764 units = pop_to(attributes, encoding, 'units')
765 calendar = pop_to(attributes, encoding, 'calendar')
--> 766 data = DecodedCFDatetimeArray(data, units, calendar)
767 elif attributes['units'] in TIME_UNITS:
768 # timedelta
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.pyc in __init__(self, array, units, calendar)
389 if not PY3:
390 msg += ' Full traceback:\n' + traceback.format_exc()
--> 391 raise ValueError(msg)
392 else:
393 self._dtype = getattr(result, 'dtype', np.dtype('object'))
ValueError: unable to decode time units u'days since 2015-01-01 00:00:00' with calendar u'proleptic_gregorian'. Try opening your dataset with decode_times=False. Full traceback:
Traceback (most recent call last):
File "/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.py", line 382, in __init__
result = decode_cf_datetime(example_value, units, calendar)
File "/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.py", line 146, in decode_cf_datetime
dates = (pd.to_timedelta(flat_num_dates, delta) + ref_date).values
File "/usr/lib/python2.7/dist-packages/pandas/tseries/timedeltas.py", line 59, in to_timedelta
return _convert_listlike(arg, box=box)
File "/usr/lib/python2.7/dist-packages/pandas/tseries/timedeltas.py", line 46, in _convert_listlike
value = np.array([ _coerce_scalar_to_timedelta_type(r, unit=unit) for r in arg ])
File "/usr/lib/python2.7/dist-packages/pandas/tseries/timedeltas.py", line 82, in _coerce_scalar_to_timedelta_type
return tslib.convert_to_timedelta(r,unit)
File "tslib.pyx", line 1186, in pandas.tslib.convert_to_timedelta (pandas/tslib.c:20014)
File "tslib.pyx", line 1241, in pandas.tslib.convert_to_timedelta64 (pandas/tslib.c:20660)
ValueError: Invalid type for timedelta scalar: <type 'numpy.float64'>
In [2]: ds
Out[2]:
<xray.Dataset>
Dimensions: (latitude: 180, longitude: 360, time: 365)
Coordinates:
* latitude (latitude) float64 89.5 88.5 87.5 86.5 85.5 84.5 83.5 82.5 ...
* longitude (longitude) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
* time (time) datetime64[ns] 2015-01-01 2015-01-02 2015-01-03 ...
Data variables:
temperature (time, latitude, longitude) float64 0.4691 -0.2829 -1.509 ...
In this example latitude and longitude do not appear in the chunks dict, so only one chunk will be used along those dimensions. It is also entirely equivalent to open a dataset using open_dataset and then chunk the data use the chunk method, e.g., xray.open_dataset('example-data.nc').chunk({'time': 10}).
To open multiple files simultaneously, use open_mfdataset():
xray.open_mfdataset('my/files/*.nc')
This function will automatically concatenate and merge dataset into one in the simple cases that it understands (see auto_combine() for the full disclaimer). By default, open_mfdataset will chunk each netCDF file into a single dask array; again, supply the chunks argument to control the size of the resulting dask arrays. In more complex cases, you can open each file individually using open_dataset and merge the result, as described in Combining data.
You’ll notice that printing a dataset still shows a preview of array values, even if they are actually dask arrays. We can do this quickly with dask because we only need to the compute the first few values (typically from the first block). To reveal the true nature of an array, print a DataArray:
In [3]: ds.temperature
Out[3]:
<xray.DataArray 'temperature' (time: 365, latitude: 180, longitude: 360)>
array([[[ 4.691e-01, -2.829e-01, -1.509e+00, ..., -7.983e-01, -5.577e-01, 3.814e-01],
[ 1.337e+00, -1.531e+00, 1.331e+00, ..., -3.498e-01, 8.726e-01, -1.538e+00],
[ 1.534e+00, -1.374e+00, -3.675e-01, ..., -7.090e-03, -1.671e+00, -4.820e-01],
...,
[ -8.335e-01, -1.468e+00, 1.252e+00, ..., 5.223e-01, 6.066e-01, 1.400e+00],
[ -8.951e-01, 2.964e-01, 2.394e+00, ..., -2.464e-01, -8.723e-01, -5.857e-01],
[ 1.871e-01, 8.154e-02, 2.765e-01, ..., -9.998e-01, 2.341e+00, 1.089e+00]],
[[ -9.274e-01, 1.177e+00, 5.248e-01, ..., -3.762e-01, 4.622e-01, 1.767e+00],
[ 9.798e-04, -1.524e-01, -9.568e-02, ..., 4.321e-01, -2.027e+00, 1.637e+00],
[ -8.874e-01, -4.562e-01, 9.195e-03, ..., 8.658e-01, 2.357e-01, 4.760e-01],
...,
[ -1.188e+00, -1.731e-01, 8.464e-02, ..., 4.657e-01, 2.534e+00, -3.622e-01],
[ 6.568e-01, 4.050e-01, 9.899e-01, ..., -7.719e-02, 3.289e-02, 8.704e-01],
[ 6.669e-02, 5.777e-01, -1.326e+00, ..., 1.851e+00, -1.590e+00, 2.664e-01]],
[[ 7.385e-01, 2.831e-01, -1.554e+00, ..., -4.722e-01, -1.188e+00, 3.242e+00],
[ 4.861e-01, 1.378e+00, -8.714e-01, ..., 2.499e-01, -1.463e+00, 5.002e-01],
[ -2.681e-01, -8.848e-01, -3.622e-01, ..., 9.173e-01, 1.669e-01, 5.821e-01],
...,
[ 1.239e+00, 4.044e-01, -1.164e+00, ..., 8.767e-01, 4.514e-01, -2.349e-01],
[ -1.387e+00, 4.685e-01, 3.975e-01, ..., 2.811e+00, 4.281e-01, 4.622e-01],
[ 2.226e-01, 1.766e+00, 8.652e-01, ..., 1.105e+00, 5.020e-01, -5.369e-01]],
...,
[[ 8.536e-01, -3.190e-01, -3.013e-02, ..., 3.982e-01, 8.874e-01, 8.194e-01],
[ 1.910e+00, 3.917e-01, -8.773e-01, ..., 4.058e-02, -1.372e+00, -3.196e-01],
[ -9.902e-01, 4.949e-01, -8.501e-01, ..., 4.036e-01, 1.277e+00, 1.113e+00],
...,
[ 7.072e-01, -1.462e+00, -5.395e-01, ..., 1.777e+00, 1.616e-03, -1.849e-01],
[ -1.094e+00, 1.534e+00, -1.482e+00, ..., -1.576e+00, 9.929e-01, 1.324e+00],
[ 1.328e+00, 2.287e-01, 2.520e+00, ..., -5.201e-01, 9.267e-01, -1.858e+00]],
[[ 2.174e+00, 2.591e+00, 1.733e+00, ..., 1.482e+00, -4.827e-02, -5.046e-01],
[ -5.751e-01, 1.112e-01, 1.492e+00, ..., 6.413e-01, -9.258e-01, -1.764e+00],
[ 8.866e-01, -6.739e-01, 6.512e-01, ..., 3.825e-01, 8.816e-01, 5.391e-02],
...,
[ -1.336e+00, 8.586e-02, 5.064e-01, ..., 3.651e-02, 3.836e-01, -5.352e-02],
[ 7.773e-01, -1.166e-01, -7.917e-01, ..., -1.048e+00, 2.931e-02, -5.819e-01],
[ -8.479e-01, 1.080e+00, 1.250e+00, ..., -5.437e-01, -2.411e-01, 6.152e-01]],
[[ 9.506e-01, -9.923e-01, 1.658e-01, ..., 1.600e+00, 1.716e+00, 5.525e-01],
[ 3.008e-01, 1.308e+00, -1.618e-01, ..., 6.986e-01, -5.435e-01, 1.096e+00],
[ 9.535e-01, -1.143e+00, 6.928e-01, ..., -5.598e-02, -5.274e-01, 1.118e+00],
...,
[ 8.138e-01, 5.869e-01, 1.243e+00, ..., 1.474e+00, -8.015e-01, -6.233e-01],
[ -1.965e-02, 7.169e-01, 7.959e-02, ..., -8.761e-01, 1.205e+00, 1.607e+00],
[ -5.341e-01, -7.622e-01, -8.142e-01, ..., -1.022e+00, 1.733e+00, 5.886e-03]]])
Coordinates:
* latitude (latitude) float64 89.5 88.5 87.5 86.5 85.5 84.5 83.5 82.5 ...
* longitude (longitude) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ...
* time (time) datetime64[ns] 2015-01-01 2015-01-02 2015-01-03 ...
Once you’ve manipulated a dask array, you can still write a dataset too big to fit into memory back to disk by using to_netcdf() in the usual way.
Using dask with xray¶
Nearly all existing xray methods (including those for indexing, computation, concatenating and grouped operations) have been extended to work automatically with dask arrays. When you load data as a dask array in an xray data structure, almost all xray operations will keep it as a dask array; when this is not possible, they will raise an exception rather than unexpectedly loading data into memory. Converting a dask array into memory generally requires an explicit conversion step. One noteable exception is indexing operations: to enable label based indexing, xray will automatically load coordinate labels into memory.
The easiest way to convert an xray data structure from lazy dask arrays into eager, in-memory numpy arrays is to use the load() method:
In [4]: ds.load()
Out[4]:
<xray.Dataset>
Dimensions: (latitude: 180, longitude: 360, time: 365)
Coordinates:
* latitude (latitude) float64 89.5 88.5 87.5 86.5 85.5 84.5 83.5 82.5 ...
* longitude (longitude) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
* time (time) datetime64[ns] 2015-01-01 2015-01-02 2015-01-03 ...
Data variables:
temperature (time, latitude, longitude) float64 0.4691 -0.2829 -1.509 ...
You can also access values, which will always be a numpy array:
In [5]: ds.temperature.values
Out[5]:
array([[[ 4.691e-01, -2.829e-01, ..., -5.577e-01, 3.814e-01],
[ 1.337e+00, -1.531e+00, ..., 8.726e-01, -1.538e+00],
...
# truncated for brevity
Explicit conversion by wrapping a DataArray with np.asarray also works:
In [6]: np.asarray(ds.temperature)
Out[6]:
array([[[ 4.691e-01, -2.829e-01, ..., -5.577e-01, 3.814e-01],
[ 1.337e+00, -1.531e+00, ..., 8.726e-01, -1.538e+00],
...
With the current version of dask, there is no automatic alignment of chunks when performing operations between dask arrays with different chunk sizes. If your computation involves multiple dask arrays with different chunks, you may need to explicitly rechunk each array to ensure compatibility. With xray, both converting data to a dask arrays and converting the chunk sizes of dask arrays is done with the chunk() method:
In [7]: rechunked = ds.chunk({'latitude': 100, 'longitude': 100})
You can view the size of existing chunks on an array by viewing the chunks attribute:
In [8]: rechunked.chunks
Out[8]: Frozen(SortedKeysDict({'latitude': (100, 80), 'longitude': (100, 100, 100, 60), 'time': (10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 5)}))
If there are not consistent chunksizes between all the arrays in a dataset along a particular dimension, an exception is raised when you try to access .chunks.
Note
In the future, we would like to enable automatic alignment of dask chunksizes (but not the other way around). We might also require that all arrays in a dataset share the same chunking alignment. Neither of these are currently done.
NumPy ufuncs like np.sin currently only work on eagerly evaluated arrays (this will change with the next major NumPy release). We have provided replacements that also work on all xray objects, including those that store lazy dask arrays, in the xray.ufuncs module:
In [9]: import xray.ufuncs as xu
In [10]: xu.sin(rechunked)
Out[10]:
<xray.Dataset>
Dimensions: (latitude: 180, longitude: 360, time: 365)
Coordinates:
* latitude (latitude) float64 89.5 88.5 87.5 86.5 85.5 84.5 83.5 82.5 ...
* longitude (longitude) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
* time (time) datetime64[ns] 2015-01-01 2015-01-02 2015-01-03 ...
Data variables:
temperature (time, latitude, longitude) float64 0.4521 -0.2791 -0.9981 ...
To access dask arrays directly, use the new DataArray.data attribute. This attribute exposes array data either as a dask array or as a numpy array, depending on whether it has been loaded into dask or not:
In [11]: ds.temperature.data
Out[11]: dask.array<xray-te..., shape=(365, 180, 360), dtype=float64, chunksize=(10, 180, 360)>
Note
In the future, we may extend .data to support other “computable” array backends beyond dask and numpy (e.g., to support sparse arrays).
Chunking and performance¶
The chunks parameter has critical performance implications when using dask arrays. If your chunks are too small, queueing up operations will be extremely slow, because dask will translates each operation into a huge number of operations mapped across chunks. Computation on dask arrays with small chunks can also be slow, because each operation on a chunk has some fixed overhead from the Python interpreter and the dask task executor.
Conversely, if your chunks are too big, some of your computation may be wasted, because dask only computes results one chunk at a time.
A good rule of thumb to create arrays with a minimum chunksize of at least one million elements (e.g., a 1000x1000 matrix). With large arrays (10+ GB), the cost of queueing up dask operations can be noticeable, and you may need even larger chunksizes.
Plotting¶
Introduction¶
Labeled data enables expressive computations. These same labels can also be used to easily create informative plots.
Xray’s plotting capabilities are centered around xray.DataArray objects. To plot xray.Dataset objects simply access the relevant DataArrays, ie dset['var1']. Here we focus mostly on arrays 2d or larger. If your data fits nicely into a pandas DataFrame then you’re better off using one of the more developed tools there.
Xray plotting functionality is a thin wrapper around the popular matplotlib library. Matplotlib syntax and function names were copied as much as possible, which makes for an easy transition between the two. Matplotlib must be installed before xray can plot.
For more extensive plotting applications consider the following projects:
- Seaborn: “provides a high-level interface for drawing attractive statistical graphics.” Integrates well with pandas.
- Holoviews: “Composable, declarative data structures for building even complex visualizations easily.” Works for 2d datasets.
- Cartopy: Provides cartographic tools.
Imports¶
The following imports are necessary for all of the examples.
In [1]: import numpy as np
In [2]: import pandas as pd
In [3]: import matplotlib.pyplot as plt
In [4]: import xray
For these examples we’ll use the North American air temperature dataset.
In [5]: airtemps = xray.tutorial.load_dataset('air_temperature')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-dd99824c922a> in <module>()
----> 1 airtemps = xray.tutorial.load_dataset('air_temperature')
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/tutorial.pyc in load_dataset(name, cache, cache_dir, github_url, **kws)
53 _urlretrieve(url, localfile)
54
---> 55 ds = _open_dataset(localfile, **kws).load()
56
57 if not cache:
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/backends/api.pyc in open_dataset(filename_or_obj, group, decode_cf, mask_and_scale, decode_times, concat_characters, decode_coords, engine, chunks, lock, drop_variables)
221 lock = _default_lock(filename_or_obj, engine)
222 with close_on_error(store):
--> 223 return maybe_decode_store(store, lock)
224 else:
225 if engine is not None and engine != 'scipy':
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/backends/api.pyc in maybe_decode_store(store, lock)
156 store, mask_and_scale=mask_and_scale, decode_times=decode_times,
157 concat_characters=concat_characters, decode_coords=decode_coords,
--> 158 drop_variables=drop_variables)
159
160 if chunks is not None:
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.pyc in decode_cf(obj, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables)
888 vars, attrs, coord_names = decode_cf_variables(
889 vars, attrs, concat_characters, mask_and_scale, decode_times,
--> 890 decode_coords, drop_variables=drop_variables)
891 ds = Dataset(vars, attrs=attrs)
892 ds = ds.set_coords(coord_names.union(extra_coords))
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.pyc in decode_cf_variables(variables, attributes, concat_characters, mask_and_scale, decode_times, decode_coords, drop_variables)
823 new_vars[k] = decode_cf_variable(
824 v, concat_characters=concat, mask_and_scale=mask_and_scale,
--> 825 decode_times=decode_times)
826 if decode_coords:
827 var_attrs = new_vars[k].attrs
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.pyc in decode_cf_variable(var, concat_characters, mask_and_scale, decode_times, decode_endianness)
764 units = pop_to(attributes, encoding, 'units')
765 calendar = pop_to(attributes, encoding, 'calendar')
--> 766 data = DecodedCFDatetimeArray(data, units, calendar)
767 elif attributes['units'] in TIME_UNITS:
768 # timedelta
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.pyc in __init__(self, array, units, calendar)
389 if not PY3:
390 msg += ' Full traceback:\n' + traceback.format_exc()
--> 391 raise ValueError(msg)
392 else:
393 self._dtype = getattr(result, 'dtype', np.dtype('object'))
ValueError: unable to decode time units u'hours since 1800-01-01' with calendar u'standard'. Try opening your dataset with decode_times=False. Full traceback:
Traceback (most recent call last):
File "/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.py", line 382, in __init__
result = decode_cf_datetime(example_value, units, calendar)
File "/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/conventions.py", line 146, in decode_cf_datetime
dates = (pd.to_timedelta(flat_num_dates, delta) + ref_date).values
File "/usr/lib/python2.7/dist-packages/pandas/tseries/timedeltas.py", line 59, in to_timedelta
return _convert_listlike(arg, box=box)
File "/usr/lib/python2.7/dist-packages/pandas/tseries/timedeltas.py", line 46, in _convert_listlike
value = np.array([ _coerce_scalar_to_timedelta_type(r, unit=unit) for r in arg ])
File "/usr/lib/python2.7/dist-packages/pandas/tseries/timedeltas.py", line 82, in _coerce_scalar_to_timedelta_type
return tslib.convert_to_timedelta(r,unit)
File "tslib.pyx", line 1186, in pandas.tslib.convert_to_timedelta (pandas/tslib.c:20014)
File "tslib.pyx", line 1241, in pandas.tslib.convert_to_timedelta64 (pandas/tslib.c:20660)
ValueError: Invalid type for timedelta scalar: <type 'numpy.float64'>
In [6]: airtemps
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-97bf1d613dea> in <module>()
----> 1 airtemps
NameError: name 'airtemps' is not defined
# Convert to celsius
In [7]: air = airtemps.air - 273.15
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-ddb2b18dbefa> in <module>()
----> 1 air = airtemps.air - 273.15
NameError: name 'airtemps' is not defined
One Dimension¶
Simple Example¶
Xray uses the coordinate name to label the x axis.
In [8]: air1d = air.isel(lat=10, lon=10)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-8-135df630e7d8> in <module>()
----> 1 air1d = air.isel(lat=10, lon=10)
NameError: name 'air' is not defined
In [9]: air1d.plot()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-9-d18100d4fbc9> in <module>()
----> 1 air1d.plot()
NameError: name 'air1d' is not defined
Additional Arguments¶
Additional arguments are passed directly to the matplotlib function which does the work. For example, xray.plot.line() calls matplotlib.pyplot.plot passing in the index and the array values as x and y, respectively. So to make a line plot with blue triangles a matplotlib format string can be used:
In [10]: air1d[:200].plot.line('b-^')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-10-19f2d4f35560> in <module>()
----> 1 air1d[:200].plot.line('b-^')
NameError: name 'air1d' is not defined
Note
Not all xray plotting methods support passing positional arguments to the wrapped matplotlib functions, but they do all support keyword arguments.
Keyword arguments work the same way, and are more explicit.
In [11]: air1d[:200].plot.line(color='purple', marker='o')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-11-6cf466530841> in <module>()
----> 1 air1d[:200].plot.line(color='purple', marker='o')
NameError: name 'air1d' is not defined
Adding to Existing Axis¶
To add the plot to an existing axis pass in the axis as a keyword argument ax. This works for all xray plotting methods. In this example axes is an array consisting of the left and right axes created by plt.subplots.
In [12]: fig, axes = plt.subplots(ncols=2)
In [13]: axes
Out[13]:
array([<matplotlib.axes.AxesSubplot object at 0x7f524d380e10>,
<matplotlib.axes.AxesSubplot object at 0x7f524d104b50>], dtype=object)
In [14]: air1d.plot(ax=axes[0])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-14-539e96723ed9> in <module>()
----> 1 air1d.plot(ax=axes[0])
NameError: name 'air1d' is not defined
In [15]: air1d.plot.hist(ax=axes[1])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-15-0361034e2c57> in <module>()
----> 1 air1d.plot.hist(ax=axes[1])
NameError: name 'air1d' is not defined
In [16]: plt.tight_layout()
In [17]: plt.show()
On the right is a histogram created by xray.plot.hist().
Two Dimensions¶
Simple Example¶
The default method xray.DataArray.plot() sees that the data is 2 dimensional and calls xray.plot.pcolormesh().
In [18]: air2d = air.isel(time=500)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-18-a09591e4acb9> in <module>()
----> 1 air2d = air.isel(time=500)
NameError: name 'air' is not defined
In [19]: air2d.plot()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-19-e07efde17aab> in <module>()
----> 1 air2d.plot()
NameError: name 'air2d' is not defined
All 2d plots in xray allow the use of the keyword arguments yincrease and xincrease.
In [20]: air2d.plot(yincrease=False)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-20-aee11e44ea8c> in <module>()
----> 1 air2d.plot(yincrease=False)
NameError: name 'air2d' is not defined
Note
We use xray.plot.pcolormesh() as the default two-dimensional plot method because it is more flexible than xray.plot.imshow(). However, for large arrays, imshow can be much faster than pcolormesh. If speed is important to you and you are plotting a regular mesh, consider using imshow.
Missing Values¶
Xray plots data with Missing values.
In [21]: bad_air2d = air2d.copy()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-21-422d52248f67> in <module>()
----> 1 bad_air2d = air2d.copy()
NameError: name 'air2d' is not defined
In [22]: bad_air2d[dict(lat=slice(0, 10), lon=slice(0, 25))] = np.nan
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-22-26e6fd241dca> in <module>()
----> 1 bad_air2d[dict(lat=slice(0, 10), lon=slice(0, 25))] = np.nan
NameError: name 'bad_air2d' is not defined
In [23]: bad_air2d.plot()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-23-0191722eda1f> in <module>()
----> 1 bad_air2d.plot()
NameError: name 'bad_air2d' is not defined
Nonuniform Coordinates¶
It’s not necessary for the coordinates to be evenly spaced. Both xray.plot.pcolormesh() (default) and xray.plot.contourf() can produce plots with nonuniform coordinates.
In [24]: b = air2d.copy()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-24-1a5be76ab043> in <module>()
----> 1 b = air2d.copy()
NameError: name 'air2d' is not defined
# Apply a nonlinear transformation to one of the coords
In [25]: b.coords['lat'] = np.log(b.coords['lat'])
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-25-498a04a8c870> in <module>()
----> 1 b.coords['lat'] = np.log(b.coords['lat'])
/home/docs/checkouts/readthedocs.org/user_builds/xray/envs/v0.6.1/local/lib/python2.7/site-packages/xray-0.6.1-py2.7.egg/xray/core/coordinates.pyc in __getitem__(self, key)
49 return self._dataset[key]
50 else:
---> 51 raise KeyError(key)
52
53 def __setitem__(self, key, value):
KeyError: 'lat'
In [26]: b.plot()
Out[26]: [<matplotlib.lines.Line2D at 0x7f524d8c5f10>]
Calling Matplotlib¶
Since this is a thin wrapper around matplotlib, all the functionality of matplotlib is available.
In [27]: air2d.plot(cmap=plt.cm.Blues)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-27-3ff13a5e932b> in <module>()
----> 1 air2d.plot(cmap=plt.cm.Blues)
NameError: name 'air2d' is not defined
In [28]: plt.title('These colors prove North America\nhas fallen in the ocean')
Out[28]: <matplotlib.text.Text at 0x7f525c71db50>
In [29]: plt.ylabel('latitude')
Out[29]: <matplotlib.text.Text at 0x7f525c179190>
In [30]: plt.xlabel('longitude')
Out[30]: <matplotlib.text.Text at 0x7f524fec0e10>
In [31]: plt.tight_layout()
In [32]: plt.show()
Note
Xray methods update label information and generally play around with the axes. So any kind of updates to the plot should be done after the call to the xray’s plot. In the example below, plt.xlabel effectively does nothing, since d_ylog.plot() updates the xlabel.
In [33]: plt.xlabel('Never gonna see this.')
Out[33]: <matplotlib.text.Text at 0x7f525c19d710>
In [34]: air2d.plot()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-34-e07efde17aab> in <module>()
----> 1 air2d.plot()
NameError: name 'air2d' is not defined
In [35]: plt.show()
Colormaps¶
Xray borrows logic from Seaborn to infer what kind of color map to use. For example, consider the original data in Kelvins rather than Celsius:
In [36]: airtemps.air.isel(time=0).plot()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-36-8144c2998176> in <module>()
----> 1 airtemps.air.isel(time=0).plot()
NameError: name 'airtemps' is not defined
The Celsius data contain 0, so a diverging color map was used. The Kelvins do not have 0, so the default color map was used.
Robust¶
Outliers often have an extreme effect on the output of the plot. Here we add two bad data points. This affects the color scale, washing out the plot.
In [37]: air_outliers = airtemps.air.isel(time=0).copy()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-37-33d7f88a80d6> in <module>()
----> 1 air_outliers = airtemps.air.isel(time=0).copy()
NameError: name 'airtemps' is not defined
In [38]: air_outliers[0, 0] = 100
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-38-670ca642b62f> in <module>()
----> 1 air_outliers[0, 0] = 100
NameError: name 'air_outliers' is not defined
In [39]: air_outliers[-1, -1] = 400
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-39-292cb47d553c> in <module>()
----> 1 air_outliers[-1, -1] = 400
NameError: name 'air_outliers' is not defined
In [40]: air_outliers.plot()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-40-b6909f9304af> in <module>()
----> 1 air_outliers.plot()
NameError: name 'air_outliers' is not defined
This plot shows that we have outliers. The easy way to visualize the data without the outliers is to pass the parameter robust=True. This will use the 2nd and 98th percentiles of the data to compute the color limits.
In [41]: air_outliers.plot(robust=True)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-41-fab29a3e3a92> in <module>()
----> 1 air_outliers.plot(robust=True)
NameError: name 'air_outliers' is not defined
Observe that the ranges of the color bar have changed. The arrows on the color bar indicate that the colors include data points outside the bounds.
Discrete Colormaps¶
It is often useful, when visualizing 2d data, to use a discrete colormap, rather than the default continuous colormaps that matplotlib uses. The levels keyword argument can be used to generate plots with discrete colormaps. For example, to make a plot with 8 discrete color intervals:
In [42]: air2d.plot(levels=8)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-42-1e5e1bceae75> in <module>()
----> 1 air2d.plot(levels=8)
NameError: name 'air2d' is not defined
It is also possible to use a list of levels to specify the boundaries of the discrete colormap:
In [43]: air2d.plot(levels=[0, 12, 18, 30])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-43-6ac464986fde> in <module>()
----> 1 air2d.plot(levels=[0, 12, 18, 30])
NameError: name 'air2d' is not defined
You can also specify a list of discrete colors through the colors argument:
In [44]: flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]
In [45]: air2d.plot(levels=[0, 12, 18, 30], colors=flatui)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-45-3c244310ff51> in <module>()
----> 1 air2d.plot(levels=[0, 12, 18, 30], colors=flatui)
NameError: name 'air2d' is not defined
Finally, if you have Seaborn installed, you can also specify a seaborn color palette to the cmap argument. Note that levels must be specified with seaborn color palettes if using imshow or pcolormesh (but not with contour or contourf, since levels are chosen automatically).
In [46]: air2d.plot(levels=10, cmap='husl')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-46-f3abd1bcfb90> in <module>()
----> 1 air2d.plot(levels=10, cmap='husl')
NameError: name 'air2d' is not defined
Faceting¶
Faceting here refers to splitting an array along one or two dimensions and plotting each group. Xray’s basic plotting is useful for plotting two dimensional arrays. What about three or four dimensional arrays? That’s where facets become helpful.
Consider the temperature data set. There are 4 observations per day for two years which makes for 2920 values along the time dimension. One way to visualize this data is to make a seperate plot for each time period.
The faceted dimension should not have too many values; faceting on the time dimension will produce 2920 plots. That’s too much to be helpful. To handle this situation try performing an operation that reduces the size of the data in some way. For example, we could compute the average air temperature for each month and reduce the size of this dimension from 2920 -> 12. A simpler way is to just take a slice on that dimension. So let’s use a slice to pick 6 times throughout the first year.
In [47]: t = air.isel(time=slice(0, 365 * 4, 250))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-47-836ac319e2bc> in <module>()
----> 1 t = air.isel(time=slice(0, 365 * 4, 250))
NameError: name 'air' is not defined
In [48]: t.coords
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-48-9e473e044a51> in <module>()
----> 1 t.coords
NameError: name 't' is not defined
Simple Example¶
The easiest way to create faceted plots is to pass in row or col arguments to the xray plotting methods/functions. This returns a xray.plot.FacetGrid object.
In [49]: g_simple = t.plot(x='lon', y='lat', col='time', col_wrap=3)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-49-414f9c710f04> in <module>()
----> 1 g_simple = t.plot(x='lon', y='lat', col='time', col_wrap=3)
NameError: name 't' is not defined
4 dimensional¶
For 4 dimensional arrays we can use the rows and columns of the grids. Here we create a 4 dimensional array by taking the original data and adding a fixed amount. Now we can see how the temperature maps would compare if one were much hotter.
In [50]: t2 = t.isel(time=slice(0, 2))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-50-60564aec5d20> in <module>()
----> 1 t2 = t.isel(time=slice(0, 2))
NameError: name 't' is not defined
In [51]: t4d = xray.concat([t2, t2 + 40], pd.Index(['normal', 'hot'], name='fourth_dim'))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-51-350a1261ab53> in <module>()
----> 1 t4d = xray.concat([t2, t2 + 40], pd.Index(['normal', 'hot'], name='fourth_dim'))
NameError: name 't2' is not defined
# This is a 4d array
In [52]: t4d.coords
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-52-85c7d6d4d520> in <module>()
----> 1 t4d.coords
NameError: name 't4d' is not defined
In [53]: t4d.plot(x='lon', y='lat', col='time', row='fourth_dim')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-53-0f6208e877ae> in <module>()
----> 1 t4d.plot(x='lon', y='lat', col='time', row='fourth_dim')
NameError: name 't4d' is not defined
Other features¶
Faceted plotting supports other arguments common to xray 2d plots.
In [54]: hasoutliers = t.isel(time=slice(0, 5)).copy()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-54-86184577316b> in <module>()
----> 1 hasoutliers = t.isel(time=slice(0, 5)).copy()
NameError: name 't' is not defined
In [55]: hasoutliers[0, 0, 0] = -100
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-55-371026ce7e96> in <module>()
----> 1 hasoutliers[0, 0, 0] = -100
NameError: name 'hasoutliers' is not defined
In [56]: hasoutliers[-1, -1, -1] = 400
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-56-0a4baae54127> in <module>()
----> 1 hasoutliers[-1, -1, -1] = 400
NameError: name 'hasoutliers' is not defined
In [57]: g = hasoutliers.plot.pcolormesh('lon', 'lat', col='time', col_wrap=3,
....: robust=True, cmap='viridis')
....:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-57-ffeabd3a15e9> in <module>()
----> 1 g = hasoutliers.plot.pcolormesh('lon', 'lat', col='time', col_wrap=3,
2 robust=True, cmap='viridis')
NameError: name 'hasoutliers' is not defined
FacetGrid Objects¶
xray.plot.FacetGrid is used to control the behavior of the multiple plots. It borrows an API and code from Seaborn. The structure is contained within the axes and name_dicts attributes, both 2d Numpy object arrays.
In [58]: g.axes
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-58-08aa76028cdf> in <module>()
----> 1 g.axes
NameError: name 'g' is not defined
In [59]: g.name_dicts
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-59-dcdc9a034d93> in <module>()
----> 1 g.name_dicts
NameError: name 'g' is not defined
It’s possible to select the xray.DataArray corresponding to the FacetGrid through the name_dicts.
In [60]: g.data.loc[g.name_dicts[0, 0]]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-60-22a0c8bff048> in <module>()
----> 1 g.data.loc[g.name_dicts[0, 0]]
NameError: name 'g' is not defined
Here is an example of using the lower level API and then modifying the axes after they have been plotted.
In [61]: g = t.plot.imshow('lon', 'lat', col='time', col_wrap=3, robust=True)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-61-23d6dac48bc1> in <module>()
----> 1 g = t.plot.imshow('lon', 'lat', col='time', col_wrap=3, robust=True)
NameError: name 't' is not defined
In [62]: for i, ax in enumerate(g.axes.flat):
....: ax.set_title('Air Temperature %d' % i)
....:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-62-16e93ad56fb7> in <module>()
----> 1 for i, ax in enumerate(g.axes.flat):
2 ax.set_title('Air Temperature %d' % i)
3
NameError: name 'g' is not defined
In [63]: bottomright = g.axes[-1, -1]
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-63-8d38f09ba20e> in <module>()
----> 1 bottomright = g.axes[-1, -1]
NameError: name 'g' is not defined
In [64]: bottomright.annotate('bottom right', (240, 40))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-64-6872c49f27d5> in <module>()
----> 1 bottomright.annotate('bottom right', (240, 40))
NameError: name 'bottomright' is not defined
In [65]: plt.show()
Maps¶
To follow this section you’ll need to have Cartopy installed and working.
This script will plot the air temperature on a map.
import xray
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
air = (xray.tutorial
.load_dataset('air_temperature')
.air
.isel(time=0))
ax = plt.axes(projection=ccrs.Orthographic(-80, 35))
ax.set_global()
air.plot.contourf(ax=ax, transform=ccrs.PlateCarree())
ax.coastlines()
plt.savefig('cartopy_example.png')
Here is the resulting image:
Details¶
Ways to Use¶
There are three ways to use the xray plotting functionality:
- Use plot as a convenience method for a DataArray.
- Access a specific plotting method from the plot attribute of a DataArray.
- Directly from the xray plot submodule.
These are provided for user convenience; they all call the same code.
In [66]: import xray.plot as xplt
In [67]: da = xray.DataArray(range(5))
In [68]: fig, axes = plt.subplots(ncols=2, nrows=2)
In [69]: da.plot(ax=axes[0, 0])
Out[69]: [<matplotlib.lines.Line2D at 0x7f524fcc2e90>]
In [70]: da.plot.line(ax=axes[0, 1])
Out[70]: [<matplotlib.lines.Line2D at 0x7f524fcc2790>]
In [71]: xplt.plot(da, ax=axes[1, 0])
Out[71]: [<matplotlib.lines.Line2D at 0x7f524d058bd0>]
In [72]: xplt.line(da, ax=axes[1, 1])
Out[72]: [<matplotlib.lines.Line2D at 0x7f524e2b5390>]
In [73]: plt.tight_layout()
In [74]: plt.show()
Here the output is the same. Since the data is 1 dimensional the line plot was used.
The convenience method xray.DataArray.plot() dispatches to an appropriate plotting function based on the dimensions of the DataArray and whether the coordinates are sorted and uniformly spaced. This table describes what gets plotted:
Dimensions | Plotting function |
1 | xray.plot.line() |
2 | xray.plot.pcolormesh() |
Anything else | xray.plot.hist() |
Coordinates¶
If you’d like to find out what’s really going on in the coordinate system, read on.
In [75]: a0 = xray.DataArray(np.zeros((4, 3, 2)), dims=('y', 'x', 'z'),
....: name='temperature')
....:
In [76]: a0[0, 0, 0] = 1
In [77]: a = a0.isel(z=0)
In [78]: a
Out[78]:
<xray.DataArray 'temperature' (y: 4, x: 3)>
array([[ 1., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.],
[ 0., 0., 0.]])
Coordinates:
* x (x) int64 0 1 2
* y (y) int64 0 1 2 3
z int64 0
The plot will produce an image corresponding to the values of the array. Hence the top left pixel will be a different color than the others. Before reading on, you may want to look at the coordinates and think carefully about what the limits, labels, and orientation for each of the axes should be.
In [79]: a.plot()
Out[79]: <matplotlib.collections.QuadMesh at 0x7f524d373a10>
It may seem strange that the values on the y axis are decreasing with -0.5 on the top. This is because the pixels are centered over their coordinates, and the axis labels and ranges correspond to the values of the coordinates.
API reference¶
This page provides an auto-generated summary of xray’s API. For more details and examples, refer to the relevant chapter in the main part of the documentation.
Top-level functions¶
align(*objects[, join, copy]) | Given any number of Dataset and/or DataArray objects, returns new objects with aligned indexes. |
concat(objs[, dim, data_vars, coords, ...]) | Concatenate xray objects along a new or existing dimension. |
set_options(**kwargs) | Set global state within a controlled context |
Dataset¶
Creating a dataset¶
Dataset([variables, coords, attrs, compat]) | A multi-dimensional, in memory, array database. |
decode_cf(obj[, concat_characters, ...]) | Decode the given Dataset or Datastore according to CF conventions into a new Dataset. |
Attributes¶
Dataset.dims | Mapping from dimension names to lengths. |
Dataset.data_vars | Dictionary of xray.DataArray objects corresponding to data variables |
Dataset.coords | Dictionary of xray.DataArray objects corresponding to coordinate |
Dataset.attrs | Dictionary of global attributes on this dataset |
Dictionary interface¶
Datasets implement the mapping interface with keys given by variable names and values given by DataArray objects.
Dataset.__getitem__(key) | Access variables or coordinates this dataset as a DataArray. |
Dataset.__setitem__(key, value) | Add an array to this dataset. |
Dataset.__delitem__(key) | Remove a variable from this dataset. |
Dataset.update(other[, inplace]) | Update this dataset’s variables with those from another dataset. |
Dataset.iteritems(...) | |
Dataset.itervalues(...) |
Dataset contents¶
Dataset.copy([deep]) | Returns a copy of this dataset. |
Dataset.assign(**kwargs) | Assign new data variables to a Dataset, returning a new object with all the original variables in addition to the new ones. |
Dataset.assign_coords(**kwargs) | Assign new coordinates to this object, returning a new object with all the original data in addition to the new coordinates. |
Dataset.pipe(func, *args, **kwargs) | Apply func(self, *args, **kwargs) |
Dataset.merge(other[, inplace, ...]) | Merge the arrays of two datasets into a single dataset. |
Dataset.rename(name_dict[, inplace]) | Returns a new object with renamed variables and dimensions. |
Dataset.swap_dims(dims_dict[, inplace]) | Returns a new object with swapped dimensions. |
Dataset.drop(labels[, dim]) | Drop variables or index labels from this dataset. |
Dataset.set_coords(names[, inplace]) | Given names of one or more variables, set them as coordinates |
Dataset.reset_coords([names, drop, inplace]) | Given names of coordinates, reset them to become variables |
Comparisons¶
Dataset.equals(other) | Two Datasets are equal if they have matching variables and coordinates, all of which are equal. |
Dataset.identical(other) | Like equals, but also checks all dataset attributes and the attributes on all variables and coordinates. |
Dataset.broadcast_equals(other) | Two Datasets are broadcast equal if they are equal after broadcasting all variables against each other. |
Indexing¶
Dataset.loc | Attribute for location based indexing. |
Dataset.isel(**indexers) | Returns a new dataset with each array indexed along the specified dimension(s). |
Dataset.sel([method, tolerance]) | Returns a new dataset with each array indexed by tick labels along the specified dimension(s). |
Dataset.isel_points([dim]) | Returns a new dataset with each array indexed pointwise along the specified dimension(s). |
Dataset.sel_points([dim, method, tolerance]) | Returns a new dataset with each array indexed pointwise by tick labels along the specified dimension(s). |
Dataset.squeeze([dim]) | Returns a new dataset with squeezed data. |
Dataset.reindex([indexers, method, ...]) | Conform this object onto a new set of indexes, filling in missing values with NaN. |
Dataset.reindex_like(other[, method, ...]) | Conform this object onto the indexes of another object, filling in missing values with NaN. |
Computation¶
Dataset.apply(func[, keep_attrs, args]) | Apply a function over the data variables in this dataset. |
Dataset.reduce(func[, dim, keep_attrs, ...]) | Reduce this dataset by applying func along some dimension(s). |
Dataset.groupby(group[, squeeze]) | Returns a GroupBy object for performing grouped operations. |
Dataset.resample(freq, dim[, how, skipna, ...]) | Resample this object to a new temporal resolution. |
Dataset.transpose(*dims) | Return a new Dataset object with all array dimensions transposed. |
Dataset.diff(dim[, n, label]) | Calculate the n-th order discrete difference along given axis. |
Aggregation: all any argmax argmin max mean median min prod sum std var
Missing values: isnull notnull count dropna fillna where
ndarray methods: argsort clip conj conjugate imag round real T
Grouped operations: assign assign_coords first last fillna where
DataArray¶
DataArray(data[, coords, dims, name, attrs, ...]) | N-dimensional array with labeled coordinates and dimensions. |
Attributes¶
DataArray.values | The array’s data as a numpy.ndarray |
DataArray.data | The array’s data as a dask or numpy array |
DataArray.coords | Dictionary-like container of coordinate arrays. |
DataArray.dims | Dimension names associated with this array. |
DataArray.name | The name of this array. |
DataArray.attrs | Dictionary storing arbitrary metadata with this array. |
DataArray.encoding | Dictionary of format-specific settings for how this array should be serialized. |
DataArray contents¶
DataArray.assign_coords(**kwargs) | Assign new coordinates to this object, returning a new object with all the original data in addition to the new coordinates. |
DataArray.rename(new_name_or_name_dict) | Returns a new DataArray with renamed coordinates and/or a new name. |
DataArray.swap_dims(dims_dict) | Returns a new DataArray with swapped dimensions. |
DataArray.drop(labels[, dim]) | Drop coordinates or index labels from this DataArray. |
DataArray.reset_coords([names, drop, inplace]) | Given names of coordinates, reset them to become variables. |
DataArray.copy([deep]) | Returns a copy of this array. |
Indexing¶
DataArray.__getitem__(key) | |
DataArray.__setitem__(key, value) | |
DataArray.loc | Attribute for location based indexing like pandas. |
DataArray.isel(**indexers) | Return a new DataArray whose dataset is given by integer indexing along the specified dimension(s). |
DataArray.sel([method, tolerance]) | Return a new DataArray whose dataset is given by selecting index labels along the specified dimension(s). |
DataArray.isel_points([dim]) | Return a new DataArray whose dataset is given by pointwise integer indexing along the specified dimension(s). |
DataArray.sel_points([dim, method, tolerance]) | Return a new DataArray whose dataset is given by pointwise selection of index labels along the specified dimension(s). |
DataArray.squeeze([dim]) | Return a new DataArray object with squeezed data. |
DataArray.reindex([method, tolerance, copy]) | Conform this object onto a new set of indexes, filling in missing values with NaN. |
DataArray.reindex_like(other[, method, ...]) | Conform this object onto the indexes of another object, filling in missing values with NaN. |
Computation¶
DataArray.reduce(func[, dim, axis, keep_attrs]) | Reduce this array by applying func along some dimension(s). |
DataArray.groupby(group[, squeeze]) | Returns a GroupBy object for performing grouped operations. |
DataArray.resample(freq, dim[, how, skipna, ...]) | Resample this object to a new temporal resolution. |
DataArray.transpose(*dims) | Return a new DataArray object with transposed dimensions. |
DataArray.get_axis_num(dim) | Return axis number(s) corresponding to dimension(s) in this array. |
DataArray.diff(dim[, n, label]) | Calculate the n-th order discrete difference along given axis. |
Aggregation: all any argmax argmin max mean median min prod sum std var
Missing values: isnull notnull count dropna fillna where
ndarray methods: argsort clip conj conjugate imag searchsorted round real T
Grouped operations: assign_coords first last fillna where
Comparisons¶
DataArray.equals(other) | True if two DataArrays have the same dimensions, coordinates and values; otherwise False. |
DataArray.identical(other) | Like equals, but also checks the array name and attributes, and attributes on all coordinates. |
DataArray.broadcast_equals(other) | Two DataArrays are broadcast equal if they are equal after broadcasting them against each other such that they have the same dimensions. |
Universal functions¶
This functions are copied from NumPy, but extended to work on NumPy arrays, dask arrays and all xray objects. You can find them in the xray.ufuncs module:
angle arccos arccosh arcsin arcsinh arctan arctan2 arctanh ceil conj copysign cos cosh deg2rad degrees exp expm1 fabs fix floor fmax fmin fmod fmod frexp hypot imag iscomplex isfinite isinf isnan isreal ldexp log log10 log1p log2 logaddexp logaddexp2 logical_and logical_not logical_or logical_xor maximum minimum nextafter rad2deg radians real rint sign signbit sin sinh sqrt square tan tanh trunc
IO / Conversion¶
Dataset methods¶
open_dataset(filename_or_obj[, group, ...]) | Load and decode a dataset from a file or file-like object. |
open_mfdataset(paths[, chunks, concat_dim, ...]) | Open multiple files as a single dataset. |
Dataset.to_netcdf([path, mode, format, ...]) | Write dataset contents to a netCDF file. |
save_mfdataset(datasets, paths[, mode, ...]) | Write multiple datasets to disk as netCDF files simultaneously. |
Dataset.to_array([dim, name]) | Convert this dataset into an xray.DataArray |
Dataset.to_dataframe() | Convert this dataset into a pandas.DataFrame. |
Dataset.from_dataframe(dataframe) | Convert a pandas.DataFrame into an xray.Dataset |
Dataset.close() | Close any files linked to this dataset |
Dataset.load() | Manually trigger loading of this dataset’s data from disk or a remote source into memory and return this dataset. |
Dataset.chunk([chunks, name_prefix, token, lock]) | Coerce all arrays in this dataset into dask arrays with the given chunks. |
DataArray methods¶
DataArray.to_dataset([dim, name]) | Convert a DataArray to a Dataset. |
DataArray.to_pandas() | Convert this array into a pandas object with the same shape. |
DataArray.to_series() | Convert this array into a pandas.Series. |
DataArray.to_dataframe() | Convert this array and its coordinates into a tidy pandas.DataFrame. |
DataArray.to_index() | Convert this variable to a pandas.Index. |
DataArray.to_masked_array([copy]) | Convert this array into a numpy.ma.MaskedArray |
DataArray.to_cdms2() | Convert this array into a cdms2.Variable |
DataArray.from_series(series) | Convert a pandas.Series into an xray.DataArray. |
DataArray.from_cdms2(variable) | Convert a cdms2.Variable into an xray.DataArray |
DataArray.load() | Manually trigger loading of this array’s data from disk or a remote source into memory and return this array. |
DataArray.chunk([chunks]) | Coerce this array’s data into a dask arrays with the given chunks. |
Backends (experimental)¶
These backends provide a low-level interface for lazily loading data from external file-formats or protocols, and can be manually invoked to create arguments for the from_store and dump_to_store Dataset methods.
backends.NetCDF4DataStore(filename[, mode, ...]) | Store for reading and writing data via the Python-NetCDF4 library. |
backends.H5NetCDFStore(filename[, mode, ...]) | Store for reading and writing data via h5netcdf |
backends.PydapDataStore(url) | Store for accessing OpenDAP datasets with pydap. |
backends.ScipyDataStore(filename_or_obj[, ...]) | Store for reading and writing data via scipy.io.netcdf. |
Plotting¶
plot.plot(darray[, row, col, col_wrap, ax, ...]) | Default plot of DataArray using matplotlib.pyplot. |
plot.contourf(darray[, x, y, ax, row, col, ...]) | Filled contour plot of 2d DataArray |
plot.contour(darray[, x, y, ax, row, col, ...]) | Contour plot of 2d DataArray |
plot.hist(darray[, ax]) | Histogram of DataArray |
plot.imshow(darray[, x, y, ax, row, col, ...]) | Image plot of 2d DataArray using matplotlib.pyplot |
plot.line(darray, *args, **kwargs) | Line plot of 1 dimensional DataArray index against values |
plot.pcolormesh(darray[, x, y, ax, row, ...]) | Pseudocolor plot of 2d DataArray |
plot.FacetGrid(data[, col, row, col_wrap, ...]) | Initialize the matplotlib figure and FacetGrid object. |
Frequently Asked Questions¶
Why is pandas not enough?¶
pandas, thanks to its unrivaled speed and flexibility, has emerged as the premier python package for working with labeled arrays. So why are we contributing to further fragmentation in the ecosystem for working with data arrays in Python?
Sometimes, we really want to work with collections of higher dimensional arrays (ndim > 2), or arrays for which the order of dimensions (e.g., columns vs rows) shouldn’t really matter. For example, climate and weather data is often natively expressed in 4 or more dimensions: time, x, y and z.
Pandas does support N-dimensional panels, but the implementation is very limited:
- You need to create a new factory type for each dimensionality.
- You can’t do math between NDPanels with different dimensionality.
- Each dimension in a NDPanel has a name (e.g., ‘labels’, ‘items’, ‘major_axis’, etc.) but the dimension names refer to order, not their meaning. You can’t specify an operation as to be applied along the “time” axis.
Fundamentally, the N-dimensional panel is limited by its context in pandas’s tabular model, which treats a 2D DataFrame as a collections of 1D Series, a 3D Panel as a collection of 2D DataFrame, and so on. In my experience, it usually easier to work with a DataFrame with a hierarchical index rather than to use higher dimensional (N > 3) data structures in pandas.
Another use case is handling collections of arrays with different numbers of dimensions. For example, suppose you have a 2D array and a handful of associated 1D arrays that share one of the same axes. Storing these in one pandas object is possible but awkward – you can either upcast all the 1D arrays to 2D and store everything in a Panel, or put everything in a DataFrame, where the first few columns have a different meaning than the other columns. In contrast, this sort of data structure fits very naturally in an xray Dataset.
Pandas gets a lot of things right, but scientific users need fully multi- dimensional data structures.
How do xray data structures differ from those found in pandas?¶
The main distinguishing feature of xray’s DataArray over labeled arrays in pandas is that dimensions can have names (e.g., “time”, “latitude”, “longitude”). Names are much easier to keep track of than axis numbers, and xray uses dimension names for indexing, aggregation and broadcasting. Not only can you write x.sel(time='2000-01-01') and x.mean(dim='time'), but operations like x - x.mean(dim='time') always work, no matter the order of the “time” dimension. You never need to reshape arrays (e.g., with np.newaxis) to align them for arithmetic operations in xray.
Should I use xray instead of pandas?¶
It’s not an either/or choice! xray provides robust support for converting back and forth between the tabular data-structures of pandas and its own multi-dimensional data-structures.
That said, you should only bother with xray if some aspect of data is fundamentally multi-dimensional. If your data is unstructured or one-dimensional, stick with pandas, which is a more developed toolkit for doing data analysis in Python.
What is your approach to metadata?¶
We are firm believers in the power of labeled data! In addition to dimensions and coordinates, xray supports arbitrary metadata in the form of global (Dataset) and variable specific (DataArray) attributes (attrs).
Automatic interpretation of labels is powerful but also reduces flexibility. With xray, we draw a firm line between labels that the library understands (dims and coords) and labels for users and user code (attrs). For example, we do not automatically interpret and enforce units or CF conventions. (An exception is serialization to and from netCDF files.)
An implication of this choice is that we do not propagate attrs through most operations unless explicitly flagged (some methods have a keep_attrs option). Similarly, xray does not check for conflicts between attrs when combining arrays and datasets, unless explicitly requested with the option compat='identical'. The guiding principle is that metadata should not be allowed to get in the way.
What’s New¶
v0.6.1 (21 October 2015)¶
This release contains a number of bug and compatibility fixes, as well as enhancements to plotting, indexing and writing files to disk.
Note that the minimum required version of dask for use with xray is now version 0.6.
API Changes¶
- The handling of colormaps and discrete color lists for 2D plots in plot() was changed to provide more compatibility with matplotlib’s contour and contourf functions (GH538). Now discrete lists of colors should be specified using colors keyword, rather than cmap.
Enhancements¶
Faceted plotting through FacetGrid and the plot() method. See Faceting for more details and examples.
sel() and reindex() now support the tolerance argument for controlling nearest-neighbor selection (GH629):
In [1]: array = xray.DataArray([1, 2, 3], dims='x') In [2]: array.reindex(x=[0.9, 1.5], method='nearest', tolerance=0.2) Out[2]: <xray.DataArray (x: 2)> array([ 2., nan]) Coordinates: * x (x) float64 0.9 1.5
This feature requires pandas v0.17 or newer.
New encoding argument in to_netcdf() for writing netCDF files with compression, as described in the new documentation section on Writing encoded data.
Add real and imag attributes to Dataset and DataArray (GH553).
More informative error message with from_dataframe() if the frame has duplicate columns.
xray now uses deterministic names for dask arrays it creates or opens from disk. This allows xray users to take advantage of dask’s nascent support for caching intermediate computation results. See GH555 for an example.
Bug fixes¶
- Forwards compatibility with the latest pandas release (v0.17.0). We were using some internal pandas routines for datetime conversion, which unfortunately have now changed upstream (GH569).
- Aggregation functions now correctly skip NaN for data for complex128 dtype (GH554).
- Fixed indexing 0d arrays with unicode dtype (GH568).
- name() and Dataset keys must be a string or None to be written to netCDF (GH533).
- where() now uses dask instead of numpy if either the array or other is a dask array. Previously, if other was a numpy array the method was evaluated eagerly.
- Global attributes are now handled more consistently when loading remote datasets using engine='pydap' (GH574).
- It is now possible to assign to the .data attribute of DataArray objects.
- coordinates attribute is now kept in the encoding dictionary after decoding (GH610).
- Compatibility with numpy 1.10 (GH617).
Acknowledgments¶
The following individuals contributed to this release:
- Ryan Abernathey
- Pete Cable
- Clark Fitzgerald
- Joe Hamman
- Stephan Hoyer
- Scott Sinclair
v0.6.0 (21 August 2015)¶
This release includes numerous bug fixes and enhancements. Highlights include the introduction of a plotting module and the new Dataset and DataArray methods isel_points(), sel_points(), where() and diff(). There are no breaking changes from v0.5.2.
Enhancements¶
Plotting methods have been implemented on DataArray objects plot() through integration with matplotlib (GH185). For an introduction, see Plotting.
Variables in netCDF files with multiple missing values are now decoded as NaN after issuing a warning if open_dataset is called with mask_and_scale=True.
We clarified our rules for when the result from an xray operation is a copy vs. a view (see Copies vs. views for more details).
Dataset variables are now written to netCDF files in order of appearance when using the netcdf4 backend (GH479).
Added isel_points() and sel_points() to support pointwise indexing of Datasets and DataArrays (GH475).
In [3]: da = xray.DataArray(np.arange(56).reshape((7, 8)), ...: coords={'x': list('abcdefg'), ...: 'y': 10 * np.arange(8)}, ...: dims=['x', 'y']) ...: In [4]: da Out[4]: <xray.DataArray (x: 7, y: 8)> array([[ 0, 1, 2, 3, 4, 5, 6, 7], [ 8, 9, 10, 11, 12, 13, 14, 15], [16, 17, 18, 19, 20, 21, 22, 23], [24, 25, 26, 27, 28, 29, 30, 31], [32, 33, 34, 35, 36, 37, 38, 39], [40, 41, 42, 43, 44, 45, 46, 47], [48, 49, 50, 51, 52, 53, 54, 55]]) Coordinates: * y (y) int64 0 10 20 30 40 50 60 70 * x (x) |S1 'a' 'b' 'c' 'd' 'e' 'f' 'g' # we can index by position along each dimension In [5]: da.isel_points(x=[0, 1, 6], y=[0, 1, 0], dim='points') Out[5]: <xray.DataArray (points: 3)> array([ 0, 9, 48]) Coordinates: y (points) int64 0 10 0 x (points) |S1 'a' 'b' 'g' * points (points) int64 0 1 2 # or equivalently by label In [6]: da.sel_points(x=['a', 'b', 'g'], y=[0, 10, 0], dim='points') Out[6]: <xray.DataArray (points: 3)> array([ 0, 9, 48]) Coordinates: y (points) int64 0 10 0 x (points) |S1 'a' 'b' 'g' * points (points) int64 0 1 2
New where() method for masking xray objects according to some criteria. This works particularly well with multi-dimensional data:
In [7]: ds = xray.Dataset(coords={'x': range(100), 'y': range(100)}) In [8]: ds['distance'] = np.sqrt(ds.x ** 2 + ds.y ** 2) In [9]: ds.distance.where(ds.distance < 100).plot() Out[9]: <matplotlib.collections.QuadMesh at 0x7f524c3f7e10>
Added new methods DataArray.diff and Dataset.diff for finite difference calculations along a given axis.
New to_masked_array() convenience method for returning a numpy.ma.MaskedArray.
In [10]: da = xray.DataArray(np.random.random_sample(size=(5, 4))) In [11]: da.where(da < 0.5) Out[11]: <xray.DataArray (dim_0: 5, dim_1: 4)> array([[ 0.127, nan, 0.26 , nan], [ 0.377, 0.336, 0.451, nan], [ 0.123, nan, 0.373, 0.448], [ 0.129, nan, nan, 0.352], [ 0.229, nan, nan, 0.138]]) Coordinates: * dim_0 (dim_0) int64 0 1 2 3 4 * dim_1 (dim_1) int64 0 1 2 3 In [12]: da.where(da < 0.5).to_masked_array(copy=True) Out[12]: masked_array(data = [[0.12696983303810094 -- 0.26047600586578334 --] [0.37674971618967135 0.33622174433445307 0.45137647047539964 --] [0.12310214428849964 -- 0.37301222522143085 0.4479968246859435] [0.12944067971751294 -- -- 0.35205353914802473] [0.2288873043216132 -- -- 0.1375535565632705]], mask = [[False True False True] [False False False True] [False True False False] [False True True False] [False True True False]], fill_value = 1e+20)
Added new flag “drop_variables” to open_dataset() for excluding variables from being parsed. This may be useful to drop variables with problems or inconsistent values.
Bug fixes¶
- Fixed aggregation functions (e.g., sum and mean) on big-endian arrays when bottleneck is installed (GH489).
- Dataset aggregation functions dropped variables with unsigned integer dtype (GH505).
- .any() and .all() were not lazy when used on xray objects containing dask arrays.
- Fixed an error when attempting to saving datetime64 variables to netCDF files when the first element is NaT (GH528).
- Fix pickle on DataArray objects (GH515).
- Fixed unnecessary coercion of float64 to float32 when using netcdf3 and netcdf4_classic formats (GH526).
v0.5.2 (16 July 2015)¶
This release contains bug fixes, several additional options for opening and saving netCDF files, and a backwards incompatible rewrite of the advanced options for xray.concat.
Backwards incompatible changes¶
- The optional arguments concat_over and mode in concat() have been removed and replaced by data_vars and coords. The new arguments are both more easily understood and more robustly implemented, and allowed us to fix a bug where concat accidentally loaded data into memory. If you set values for these optional arguments manually, you will need to update your code. The default behavior should be unchanged.
Enhancements¶
open_mfdataset() now supports a preprocess argument for preprocessing datasets prior to concatenaton. This is useful if datasets cannot be otherwise merged automatically, e.g., if the original datasets have conflicting index coordinates (GH443).
open_dataset() and open_mfdataset() now use a global thread lock by default for reading from netCDF files with dask. This avoids possible segmentation faults for reading from netCDF4 files when HDF5 is not configured properly for concurrent access (GH444).
Added support for serializing arrays of complex numbers with engine=’h5netcdf’.
The new save_mfdataset() function allows for saving multiple datasets to disk simultaneously. This is useful when processing large datasets with dask.array. For example, to save a dataset too big to fit into memory to one file per year, we could write:
In [13]: years, datasets = zip(*ds.groupby('time.year')) In [14]: paths = ['%s.nc' % y for y in years] In [15]: xray.save_mfdataset(datasets, paths)
Bug fixes¶
- Fixed min, max, argmin and argmax for arrays with string or unicode types (GH453).
- open_dataset() and open_mfdataset() support supplying chunks as a single integer.
- Fixed a bug in serializing scalar datetime variable to netCDF.
- Fixed a bug that could occur in serialization of 0-dimensional integer arrays.
- Fixed a bug where concatenating DataArrays was not always lazy (GH464).
- When reading datasets with h5netcdf, bytes attributes are decoded to strings. This allows conventions decoding to work properly on Python 3 (GH451).
v0.5.1 (15 June 2015)¶
This minor release fixes a few bugs and an inconsistency with pandas. It also adds the pipe method, copied from pandas.
Enhancements¶
- Added pipe(), replicating the new pandas method in version 0.16.2. See Transforming datasets for more details.
- assign() and assign_coords() now assign new variables in sorted (alphabetical) order, mirroring the behavior in pandas. Previously, the order was arbitrary.
v0.5 (1 June 2015)¶
Highlights¶
The headline feature in this release is experimental support for out-of-core computing (data that doesn’t fit into memory) with dask. This includes a new top-level function open_mfdataset() that makes it easy to open a collection of netCDF (using dask) as a single xray.Dataset object. For more on dask, read the blog post introducing xray + dask and the new documentation section Out of core computation with dask.
Dask makes it possible to harness parallelism and manipulate gigantic datasets with xray. It is currently an optional dependency, but it may become required in the future.
Backwards incompatible changes¶
The logic used for choosing which variables are concatenated with concat() has changed. Previously, by default any variables which were equal across a dimension were not concatenated. This lead to some surprising behavior, where the behavior of groupby and concat operations could depend on runtime values (GH268). For example:
In [16]: ds = xray.Dataset({'x': 0}) In [17]: xray.concat([ds, ds], dim='y') Out[17]: <xray.Dataset> Dimensions: () Coordinates: *empty* Data variables: x int64 0
Now, the default always concatenates data variables:
In [18]: xray.concat([ds, ds], dim='y') Out[18]: <xray.Dataset> Dimensions: (y: 2) Coordinates: * y (y) int64 0 1 Data variables: x (y) int64 0 0
To obtain the old behavior, supply the argument concat_over=[].
Enhancements¶
New to_array() and enhanced to_dataset() methods make it easy to switch back and forth between arrays and datasets:
In [19]: ds = xray.Dataset({'a': 1, 'b': ('x', [1, 2, 3])}, ....: coords={'c': 42}, attrs={'Conventions': 'None'}) ....: In [20]: ds.to_array() Out[20]: <xray.DataArray (variable: 2, x: 3)> array([[1, 1, 1], [1, 2, 3]]) Coordinates: * variable (variable) |S1 'a' 'b' c int64 42 * x (x) int64 0 1 2 Attributes: Conventions: None In [21]: ds.to_array().to_dataset(dim='variable') Out[21]: <xray.Dataset> Dimensions: (x: 3) Coordinates: c int64 42 * x (x) int64 0 1 2 Data variables: a (x) int64 1 1 1 b (x) int64 1 2 3 Attributes: Conventions: None
New fillna() method to fill missing values, modeled off the pandas method of the same name:
In [22]: array = xray.DataArray([np.nan, 1, np.nan, 3], dims='x') In [23]: array.fillna(0) Out[23]: <xray.DataArray (x: 4)> array([ 0., 1., 0., 3.]) Coordinates: * x (x) int64 0 1 2 3
fillna works on both Dataset and DataArray objects, and uses index based alignment and broadcasting like standard binary operations. It also can be applied by group, as illustrated in Fill missing values with climatology.
New assign() and assign_coords() methods patterned off the new DataFrame.assign method in pandas:
In [24]: ds = xray.Dataset({'y': ('x', [1, 2, 3])}) In [25]: ds.assign(z = lambda ds: ds.y ** 2) Out[25]: <xray.Dataset> Dimensions: (x: 3) Coordinates: * x (x) int64 0 1 2 Data variables: y (x) int64 1 2 3 z (x) int64 1 4 9 In [26]: ds.assign_coords(z = ('x', ['a', 'b', 'c'])) Out[26]: <xray.Dataset> Dimensions: (x: 3) Coordinates: * x (x) int64 0 1 2 z (x) |S1 'a' 'b' 'c' Data variables: y (x) int64 1 2 3
These methods return a new Dataset (or DataArray) with updated data or coordinate variables.
sel() now supports the method parameter, which works like the paramter of the same name on reindex(). It provides a simple interface for doing nearest-neighbor interpolation:
In [27]: ds.sel(x=1.1, method='nearest') Out[27]: <xray.Dataset> Dimensions: () Coordinates: x int64 1 Data variables: y int64 2 In [28]: ds.sel(x=[1.1, 2.1], method='pad') Out[28]: <xray.Dataset> Dimensions: (x: 2) Coordinates: * x (x) int64 1 2 Data variables: y (x) int64 2 3
See Nearest neighbor lookups for more details.
You can now control the underlying backend used for accessing remote datasets (via OPeNDAP) by specifying engine='netcdf4' or engine='pydap'.
xray now provides experimental support for reading and writing netCDF4 files directly via h5py with the h5netcdf package, avoiding the netCDF4-Python package. You will need to install h5netcdf and specify engine='h5netcdf' to try this feature.
Accessing data from remote datasets now has retrying logic (with exponential backoff) that should make it robust to occasional bad responses from DAP servers.
You can control the width of the Dataset repr with xray.set_options. It can be used either as a context manager, in which case the default is restored outside the context:
In [29]: ds = xray.Dataset({'x': np.arange(1000)}) In [30]: with xray.set_options(display_width=40): ....: print(ds) ....: <xray.Dataset> Dimensions: (x: 1000) Coordinates: * x (x) int64 0 1 2 3 4 5 6 ... Data variables: *empty*
Or to set a global option:
In [31]: xray.set_options(display_width=80)
The default value for the display_width option is 80.
v0.4.1 (18 March 2015)¶
The release contains bug fixes and several new features. All changes should be fully backwards compatible.
Enhancements¶
New documentation sections on Time series data and Combining multiple files.
resample() lets you resample a dataset or data array to a new temporal resolution. The syntax is the same as pandas, except you need to supply the time dimension explicitly:
In [32]: time = pd.date_range('2000-01-01', freq='6H', periods=10) In [33]: array = xray.DataArray(np.arange(10), [('time', time)]) In [34]: array.resample('1D', dim='time') Out[34]: <xray.DataArray (time: 3)> array([ 1.5, 5.5, 8.5]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
You can specify how to do the resampling with the how argument and other options such as closed and label let you control labeling:
In [35]: array.resample('1D', dim='time', how='sum', label='right') Out[35]: <xray.DataArray (time: 3)> array([ 6, 22, 17]) Coordinates: * time (time) datetime64[ns] 2000-01-02 2000-01-03 2000-01-04
If the desired temporal resolution is higher than the original data (upsampling), xray will insert missing values:
In [36]: array.resample('3H', 'time') Out[36]: <xray.DataArray (time: 19)> array([ 0., nan, 1., ..., 8., nan, 9.]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-01T03:00:00 ...
first and last methods on groupby objects let you take the first or last examples from each group along the grouped axis:
In [37]: array.groupby('time.day').first() Out[37]: <xray.DataArray (day: 3)> array([0, 4, 8]) Coordinates: * day (day) int64 1 2 3
These methods combine well with resample:
In [38]: array.resample('1D', dim='time', how='first') Out[38]: <xray.DataArray (time: 3)> array([0, 4, 8]) Coordinates: * time (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
swap_dims() allows for easily swapping one dimension out for another:
In [39]: ds = xray.Dataset({'x': range(3), 'y': ('x', list('abc'))}) In [40]: ds Out[40]: <xray.Dataset> Dimensions: (x: 3) Coordinates: * x (x) int64 0 1 2 Data variables: y (x) |S1 'a' 'b' 'c' In [41]: ds.swap_dims({'x': 'y'}) Out[41]: <xray.Dataset> Dimensions: (y: 3) Coordinates: * y (y) |S1 'a' 'b' 'c' x (y) int64 0 1 2 Data variables: *empty*
This was possible in earlier versions of xray, but required some contortions.
open_dataset() and to_netcdf() now accept an engine argument to explicitly select which underlying library (netcdf4 or scipy) is used for reading/writing a netCDF file.
Bug fixes¶
- Fixed a bug where data netCDF variables read from disk with engine='scipy' could still be associated with the file on disk, even after closing the file (GH341). This manifested itself in warnings about mmapped arrays and segmentation faults (if the data was accessed).
- Silenced spurious warnings about all-NaN slices when using nan-aware aggregation methods (GH344).
- Dataset aggregations with keep_attrs=True now preserve attributes on data variables, not just the dataset itself.
- Tests for xray now pass when run on Windows (GH360).
- Fixed a regression in v0.4 where saving to netCDF could fail with the error ValueError: could not automatically determine time units.
v0.4 (2 March, 2015)¶
This is one of the biggest releases yet for xray: it includes some major changes that may break existing code, along with the usual collection of minor enhancements and bug fixes. On the plus side, this release includes all hitherto planned breaking changes, so the upgrade path for xray should be smoother going forward.
Breaking changes¶
We now automatically align index labels in arithmetic, dataset construction, merging and updating. This means the need for manually invoking methods like align() and reindex_like() should be vastly reduced.
For arithmetic, we align based on the intersection of labels:
In [42]: lhs = xray.DataArray([1, 2, 3], [('x', [0, 1, 2])]) In [43]: rhs = xray.DataArray([2, 3, 4], [('x', [1, 2, 3])]) In [44]: lhs + rhs Out[44]: <xray.DataArray (x: 2)> array([4, 6]) Coordinates: * x (x) int64 1 2
For dataset construction and merging, we align based on the union of labels:
In [45]: xray.Dataset({'foo': lhs, 'bar': rhs}) Out[45]: <xray.Dataset> Dimensions: (x: 4) Coordinates: * x (x) int64 0 1 2 3 Data variables: foo (x) float64 1.0 2.0 3.0 nan bar (x) float64 nan 2.0 3.0 4.0
For update and __setitem__, we align based on the original object:
In [46]: lhs.coords['rhs'] = rhs In [47]: lhs Out[47]: <xray.DataArray (x: 3)> array([1, 2, 3]) Coordinates: * x (x) int64 0 1 2 rhs (x) float64 nan 2.0 3.0
Aggregations like mean or median now skip missing values by default:
In [48]: xray.DataArray([1, 2, np.nan, 3]).mean() Out[48]: <xray.DataArray ()> array(2.0)
You can turn this behavior off by supplying the keyword arugment skipna=False.
These operations are lightning fast thanks to integration with bottleneck, which is a new optional dependency for xray (numpy is used if bottleneck is not installed).
Scalar coordinates no longer conflict with constant arrays with the same value (e.g., in arithmetic, merging datasets and concat), even if they have different shape (GH243). For example, the coordinate c here persists through arithmetic, even though it has different shapes on each DataArray:
In [49]: a = xray.DataArray([1, 2], coords={'c': 0}, dims='x') In [50]: b = xray.DataArray([1, 2], coords={'c': ('x', [0, 0])}, dims='x') In [51]: (a + b).coords Out[51]: Coordinates: * x (x) int64 0 1 c (x) int64 0 0
This functionality can be controlled through the compat option, which has also been added to the Dataset constructor.
Datetime shortcuts such as 'time.month' now return a DataArray with the name 'month', not 'time.month' (GH345). This makes it easier to index the resulting arrays when they are used with groupby:
In [52]: time = xray.DataArray(pd.date_range('2000-01-01', periods=365), ....: dims='time', name='time') ....: In [53]: counts = time.groupby('time.month').count() In [54]: counts.sel(month=2) Out[54]: <xray.DataArray 'time' ()> array(29) Coordinates: month int64 2
Previously, you would need to use something like counts.sel(**{'time.month': 2}}), which is much more awkward.
The season datetime shortcut now returns an array of string labels such ‘DJF’:
In [55]: ds = xray.Dataset({'t': pd.date_range('2000-01-01', periods=12, freq='M')}) In [56]: ds['t.season'] Out[56]: <xray.DataArray 'season' (t: 12)> array(['DJF', 'DJF', 'MAM', ..., 'SON', 'SON', 'DJF'], dtype='|S3') Coordinates: * t (t) datetime64[ns] 2000-01-31 2000-02-29 2000-03-31 2000-04-30 ...
Previously, it returned numbered seasons 1 through 4.
We have updated our use of the terms of “coordinates” and “variables”. What were known in previous versions of xray as “coordinates” and “variables” are now referred to throughout the documentation as “coordinate variables” and “data variables”. This brings xray in closer alignment to CF Conventions. The only visible change besides the documentation is that Dataset.vars has been renamed Dataset.data_vars.
You will need to update your code if you have been ignoring deprecation warnings: methods and attributes that were deprecated in xray v0.3 or earlier (e.g., dimensions, attributes`) have gone away.
Enhancements¶
Support for reindex() with a fill method. This provides a useful shortcut for upsampling:
In [57]: data = xray.DataArray([1, 2, 3], dims='x') In [58]: data.reindex(x=[0.5, 1, 1.5, 2, 2.5], method='pad') Out[58]: <xray.DataArray (x: 5)> array([1, 2, 2, 3, 3]) Coordinates: * x (x) float64 0.5 1.0 1.5 2.0 2.5
This will be especially useful once pandas 0.16 is released, at which point xray will immediately support reindexing with method=’nearest’.
Use functions that return generic ndarrays with DataArray.groupby.apply and Dataset.apply (GH327 and GH329). Thanks Jeff Gerard!
Consolidated the functionality of dumps (writing a dataset to a netCDF3 bytestring) into to_netcdf() (GH333).
to_netcdf() now supports writing to groups in netCDF4 files (GH333). It also finally has a full docstring – you should read it!
open_dataset() and to_netcdf() now work on netCDF3 files when netcdf4-python is not installed as long as scipy is available (GH333).
The new Dataset.drop and DataArray.drop methods makes it easy to drop explicitly listed variables or index labels:
# drop variables In [59]: ds = xray.Dataset({'x': 0, 'y': 1}) In [60]: ds.drop('x') Out[60]: <xray.Dataset> Dimensions: () Coordinates: *empty* Data variables: y int64 1 # drop index labels In [61]: arr = xray.DataArray([1, 2, 3], coords=[('x', list('abc'))]) In [62]: arr.drop(['a', 'c'], dim='x') Out[62]: <xray.DataArray (x: 1)> array([2]) Coordinates: * x (x) |S1 'b'
broadcast_equals() has been added to correspond to the new compat option.
Long attributes are now truncated at 500 characters when printing a dataset (GH338). This should make things more convenient for working with datasets interactively.
Added a new documentation example, Calculating Seasonal Averages from Timeseries of Monthly Means. Thanks Joe Hamman!
Bug fixes¶
- Several bug fixes related to decoding time units from netCDF files (GH316, GH330). Thanks Stefan Pfenninger!
- xray no longer requires decode_coords=False when reading datasets with unparseable coordinate attributes (GH308).
- Fixed DataArray.loc indexing with ... (GH318).
- Fixed an edge case that resulting in an error when reindexing multi-dimensional variables (GH315).
- Slicing with negative step sizes (GH312).
- Invalid conversion of string arrays to numeric dtype (GH305).
- Fixed``repr()`` on dataset objects with non-standard dates (GH347).
Deprecations¶
- dump and dumps have been deprecated in favor of to_netcdf().
- drop_vars has been deprecated in favor of drop().
Future plans¶
The biggest feature I’m excited about working toward in the immediate future is supporting out-of-core operations in xray using Dask, a part of the Blaze project. For a preview of using Dask with weather data, read this blog post by Matthew Rocklin. See GH328 for more details.
v0.3.2 (23 December, 2014)¶
This release focused on bug-fixes, speedups and resolving some niggling inconsistencies.
There are a few cases where the behavior of xray differs from the previous version. However, I expect that in almost all cases your code will continue to run unmodified.
Warning
xray now requires pandas v0.15.0 or later. This was necessary for supporting TimedeltaIndex without too many painful hacks.
Backwards incompatible changes¶
Arrays of datetime.datetime objects are now automatically cast to datetime64[ns] arrays when stored in an xray object, using machinery borrowed from pandas:
In [63]: from datetime import datetime In [64]: xray.Dataset({'t': [datetime(2000, 1, 1)]}) Out[64]: <xray.Dataset> Dimensions: (t: 1) Coordinates: * t (t) datetime64[ns] 2000-01-01 Data variables: *empty*
xray now has support (including serialization to netCDF) for TimedeltaIndex. datetime.timedelta objects are thus accordingly cast to timedelta64[ns] objects when appropriate.
Masked arrays are now properly coerced to use NaN as a sentinel value (GH259).
Enhancements¶
Due to popular demand, we have added experimental attribute style access as a shortcut for dataset variables, coordinates and attributes:
In [65]: ds = xray.Dataset({'tmin': ([], 25, {'units': 'celcius'})}) In [66]: ds.tmin.units Out[66]: 'celcius'
Tab-completion for these variables should work in editors such as IPython. However, setting variables or attributes in this fashion is not yet supported because there are some unresolved ambiguities (GH300).
You can now use a dictionary for indexing with labeled dimensions. This provides a safe way to do assignment with labeled dimensions:
In [67]: array = xray.DataArray(np.zeros(5), dims=['x']) In [68]: array[dict(x=slice(3))] = 1 In [69]: array Out[69]: <xray.DataArray (x: 5)> array([ 1., 1., 1., 0., 0.]) Coordinates: * x (x) int64 0 1 2 3 4
Non-index coordinates can now be faithfully written to and restored from netCDF files. This is done according to CF conventions when possible by using the coordinates attribute on a data variable. When not possible, xray defines a global coordinates attribute.
Preliminary support for converting xray.DataArray objects to and from CDAT cdms2 variables.
We sped up any operation that involves creating a new Dataset or DataArray (e.g., indexing, aggregation, arithmetic) by a factor of 30 to 50%. The full speed up requires cyordereddict to be installed.
Bug fixes¶
Future plans¶
- I am contemplating switching to the terms “coordinate variables” and “data variables” instead of the (currently used) “coordinates” and “variables”, following their use in CF Conventions (GH293). This would mostly have implications for the documentation, but I would also change the Dataset attribute vars to data.
- I no longer certain that automatic label alignment for arithmetic would be a good idea for xray – it is a feature from pandas that I have not missed (GH186).
- The main API breakage that I do anticipate in the next release is finally making all aggregation operations skip missing values by default (GH130). I’m pretty sick of writing ds.reduce(np.nanmean, 'time').
- The next version of xray (0.4) will remove deprecated features and aliases whose use currently raises a warning.
If you have opinions about any of these anticipated changes, I would love to hear them – please add a note to any of the referenced GitHub issues.
v0.3.1 (22 October, 2014)¶
This is mostly a bug-fix release to make xray compatible with the latest release of pandas (v0.15).
We added several features to better support working with missing values and exporting xray objects to pandas. We also reorganized the internal API for serializing and deserializing datasets, but this change should be almost entirely transparent to users.
Other than breaking the experimental DataStore API, there should be no backwards incompatible changes.
New features¶
- Added count() and dropna() methods, copied from pandas, for working with missing values (GH247, GH58).
- Added DataArray.to_pandas for converting a data array into the pandas object with the same dimensionality (1D to Series, 2D to DataFrame, etc.) (GH255).
- Support for reading gzipped netCDF3 files (GH239).
- Reduced memory usage when writing netCDF files (GH251).
- ‘missing_value’ is now supported as an alias for the ‘_FillValue’ attribute on netCDF variables (GH245).
- Trivial indexes, equivalent to range(n) where n is the length of the dimension, are no longer written to disk (GH245).
Bug fixes¶
- Compatibility fixes for pandas v0.15 (GH262).
- Fixes for display and indexing of NaT (not-a-time) (GH238, GH240)
- Fix slicing by label was an argument is a data array (GH250).
- Test data is now shipped with the source distribution (GH253).
- Ensure order does not matter when doing arithmetic with scalar data arrays (GH254).
- Order of dimensions preserved with DataArray.to_dataframe (GH260).
v0.3 (21 September 2014)¶
New features¶
- Revamped coordinates: “coordinates” now refer to all arrays that are not used to index a dimension. Coordinates are intended to allow for keeping track of arrays of metadata that describe the grid on which the points in “variable” arrays lie. They are preserved (when unambiguous) even though mathematical operations.
- Dataset math Dataset objects now support all arithmetic operations directly. Dataset-array operations map across all dataset variables; dataset-dataset operations act on each pair of variables with the same name.
- GroupBy math: This provides a convenient shortcut for normalizing by the average value of a group.
- The dataset __repr__ method has been entirely overhauled; dataset objects now show their values when printed.
- You can now index a dataset with a list of variables to return a new dataset: ds[['foo', 'bar']].
Backwards incompatible changes¶
- Dataset.__eq__ and Dataset.__ne__ are now element-wise operations instead of comparing all values to obtain a single boolean. Use the method equals() instead.
Deprecations¶
- Dataset.noncoords is deprecated: use Dataset.vars instead.
- Dataset.select_vars deprecated: index a Dataset with a list of variable names instead.
- DataArray.select_vars and DataArray.drop_vars deprecated: use reset_coords() instead.
v0.2 (14 August 2014)¶
This is major release that includes some new features and quite a few bug fixes. Here are the highlights:
- There is now a direct constructor for DataArray objects, which makes it possible to create a DataArray without using a Dataset. This is highlighted in the refreshed tutorial.
- You can perform aggregation operations like mean directly on Dataset objects, thanks to Joe Hamman. These aggregation methods also worked on grouped datasets.
- xray now works on Python 2.6, thanks to Anna Kuznetsova.
- A number of methods and attributes were given more sensible (usually shorter) names: labeled -> sel, indexed -> isel, select -> select_vars, unselect -> drop_vars, dimensions -> dims, coordinates -> coords, attributes -> attrs.
- New load_data() and close() methods for datasets facilitate lower level of control of data loaded from disk.
v0.1.1 (20 May 2014)¶
xray 0.1.1 is a bug-fix release that includes changes that should be almost entirely backwards compatible with v0.1:
- Python 3 support (GH53)
- Required numpy version relaxed to 1.7 (GH129)
- Return numpy.datetime64 arrays for non-standard calendars (GH126)
- Support for opening datasets associated with NetCDF4 groups (GH127)
- Bug-fixes for concatenating datetime arrays (GH134)
Special thanks to new contributors Thomas Kluyver, Joe Hamman and Alistair Miles.
v0.1 (2 May 2014)¶
Initial release.
See also¶
- Stephan Hoyer’s SciPy2015 talk introducing xray to a general audience.
- Stephan Hoyer’s 2015 Unidata Users Workshop talk and tutorial (with answers) introducing xray to users familiar with netCDF.
- Nicolas Fauchereau’s tutorial on xray for netCDF users.
Get in touch¶
- To ask questions or discuss xray, use the mailing list.
- Report bugs, suggest feature ideas or view the source code on GitHub.
- For interactive discussion, we have a chatroom on Gitter.
- You can also get in touch on Twitter.
License¶
xray is available under the open source Apache License.
History¶
xray is an evolution of an internal tool developed at The Climate Corporation, and was originally written by current and former Climate Corp researchers Stephan Hoyer, Alex Kleeman and Eugene Brevdo.