_images/dataset-diagram-logo.png

N-D labeled arrays and datasets in Python

xarray (formerly xray) is an open source project and Python package that aims to bring the labeled data power of pandas to the physical sciences, by providing N-dimensional variants of the core pandas data structures.

Our goal is to provide a pandas-like and pandas-compatible toolkit for analytics on multi-dimensional arrays, rather than the tabular data for which pandas excels. Our approach adopts the Common Data Model for self- describing scientific data in widespread use in the Earth sciences: xarray.Dataset is an in-memory representation of a netCDF file.

Note

xray is now xarray! See the v0.7.0 release notes for more details. The preferred URL for these docs is now http://xarray.pydata.org.

Documentation

Overview: Why xarray?

Features

Adding dimensions names and coordinate indexes to numpy’s ndarray makes many powerful array operations possible:

  • Apply operations over dimensions by name: x.sum('time').
  • Select values by label instead of integer location: x.loc['2014-01-01'] or x.sel(time='2014-01-01').
  • Mathematical operations (e.g., x - y) vectorize across multiple dimensions (array broadcasting) based on dimension names, not shape.
  • Flexible split-apply-combine operations with groupby: x.groupby('time.dayofyear').mean().
  • Database like alignment based on coordinate labels that smoothly handles missing values: x, y = xr.align(x, y, join='outer').
  • Keep track of arbitrary metadata in the form of a Python dictionary: x.attrs.

pandas provides many of these features, but it does not make use of dimension names, and its core data structures are fixed dimensional arrays.

The N-dimensional nature of xarray’s data structures makes it suitable for dealing with multi-dimensional scientific data, and its use of dimension names instead of axis labels (dim='time' instead of axis=0) makes such arrays much more manageable than the raw numpy ndarray: with xarray, you don’t need to keep track of the order of arrays dimensions or insert dummy dimensions (e.g., np.newaxis) to align arrays.

Core data structures

xarray has two core data structures. Both are fundamentally N-dimensional:

  • DataArray is our implementation of a labeled, N-dimensional array. It is an N-D generalization of a pandas.Series. The name DataArray itself is borrowed from Fernando Perez’s datarray project, which prototyped a similar data structure.
  • Dataset is a multi-dimensional, in-memory array database. It is a dict-like container of DataArray objects aligned along any number of shared dimensions, and serves a similar purpose in xarray to the pandas.DataFrame.

The value of attaching labels to numpy’s numpy.ndarray may be fairly obvious, but the dataset may need more motivation.

The power of the dataset over a plain dictionary is that, in addition to pulling out arrays by name, it is possible to select or combine data along a dimension across all arrays simultaneously. Like a DataFrame, datasets facilitate array operations with heterogeneous data – the difference is that the arrays in a dataset can not only have different data types, but can also have different numbers of dimensions.

This data model is borrowed from the netCDF file format, which also provides xarray with a natural and portable serialization format. NetCDF is very popular in the geosciences, and there are existing libraries for reading and writing netCDF in many programming languages, including Python.

xarray distinguishes itself from many tools for working with netCDF data in-so-far as it provides data structures for in-memory analytics that both utilize and preserve labels. You only need to do the tedious work of adding metadata once, not every time you save a file.

Goals and aspirations

pandas excels at working with tabular data. That suffices for many statistical analyses, but physical scientists rely on N-dimensional arrays – which is where xarray comes in.

xarray aims to provide a data analysis toolkit as powerful as pandas but designed for working with homogeneous N-dimensional arrays instead of tabular data. When possible, we copy the pandas API and rely on pandas’s highly optimized internals (in particular, for fast indexing).

Importantly, xarray has robust support for converting its objects to and from a numpy ndarray or a pandas DataFrame or Series, providing compatibility with the full PyData ecosystem.

Our target audience is anyone who needs N-dimensional labeled arrays, but we are particularly focused on the data analysis needs of physical scientists – especially geoscientists who already know and love netCDF.

Examples

Quick overview

Here are some quick examples of what you can do with xarray.DataArray objects. Everything is explained in much more detail in the rest of the documentation.

To begin, import numpy, pandas and xarray using their customary abbreviations:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: import xarray as xr
Create a DataArray

You can make a DataArray from scratch by supplying data in the form of a numpy array or list, with optional dimensions and coordinates:

In [4]: xr.DataArray(np.random.randn(2, 3))
Out[4]: 
<xarray.DataArray (dim_0: 2, dim_1: 3)>
array([[-1.344,  0.845,  1.076],
       [-0.109,  1.644, -1.469]])
Coordinates:
  * dim_0    (dim_0) int64 0 1
  * dim_1    (dim_1) int64 0 1 2

In [5]: data = xr.DataArray(np.random.randn(2, 3), [('x', ['a', 'b']), ('y', [-2, 0, 2])])

In [6]: data
Out[6]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.357, -0.675, -1.777],
       [-0.969, -1.295,  0.414]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 -2 0 2

If you supply a pandas Series or DataFrame, metadata is copied directly:

In [7]: xr.DataArray(pd.Series(range(3), index=list('abc'), name='foo'))
Out[7]: 
<xarray.DataArray 'foo' (dim_0: 3)>
array([0, 1, 2])
Coordinates:
  * dim_0    (dim_0) object 'a' 'b' 'c'

Here are the key properties for a DataArray:

# like in pandas, values is a numpy array that you can modify in-place
In [8]: data.values
Out[8]: 
array([[ 0.357, -0.675, -1.777],
       [-0.969, -1.295,  0.414]])

In [9]: data.dims
Out[9]: ('x', 'y')

In [10]: data.coords
Out[10]: 
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 -2 0 2

# you can use this dictionary to store arbitrary metadata
In [11]: data.attrs
Out[11]: OrderedDict()
Indexing

xarray supports four kind of indexing. These operations are just as fast as in pandas, because we borrow pandas’ indexing machinery.

# positional and by integer label, like numpy
In [12]: data[[0, 1]]
Out[12]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.357, -0.675, -1.777],
       [-0.969, -1.295,  0.414]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 -2 0 2

# positional and by coordinate label, like pandas
In [13]: data.loc['a':'b']
Out[13]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.357, -0.675, -1.777],
       [-0.969, -1.295,  0.414]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 -2 0 2

# by dimension name and integer label
In [14]: data.isel(x=slice(2))
Out[14]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.357, -0.675, -1.777],
       [-0.969, -1.295,  0.414]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 -2 0 2

# by dimension name and coordinate label
In [15]: data.sel(x=['a', 'b'])
Out[15]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.357, -0.675, -1.777],
       [-0.969, -1.295,  0.414]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 -2 0 2
Computation

Data arrays work very similarly to numpy ndarrays:

In [16]: data + 10
Out[16]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 10.357,   9.325,   8.223],
       [  9.031,   8.705,  10.414]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 -2 0 2

In [17]: np.sin(data)
Out[17]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.349, -0.625, -0.979],
       [-0.824, -0.962,  0.402]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 -2 0 2

In [18]: data.T
Out[18]: 
<xarray.DataArray (y: 3, x: 2)>
array([[ 0.357, -0.969],
       [-0.675, -1.295],
       [-1.777,  0.414]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 -2 0 2

In [19]: data.sum()
Out[19]: 
<xarray.DataArray ()>
array(-3.9441825539138033)

However, aggregation operations can use dimension names instead of axis numbers:

In [20]: data.mean(dim='x')
Out[20]: 
<xarray.DataArray (y: 3)>
array([-0.306, -0.985, -0.682])
Coordinates:
  * y        (y) int64 -2 0 2

Arithmetic operations broadcast based on dimension name. This means you don’t need to insert dummy dimensions for alignment:

In [21]: a = xr.DataArray(np.random.randn(3), [data.coords['y']])

In [22]: b = xr.DataArray(np.random.randn(4), dims='z')

In [23]: a
Out[23]: 
<xarray.DataArray (y: 3)>
array([ 0.277, -0.472, -0.014])
Coordinates:
  * y        (y) int64 -2 0 2

In [24]: b
Out[24]: 
<xarray.DataArray (z: 4)>
array([-0.363, -0.006, -0.923,  0.896])
Coordinates:
  * z        (z) int64 0 1 2 3

In [25]: a + b
Out[25]: 
<xarray.DataArray (y: 3, z: 4)>
array([[-0.086,  0.271, -0.646,  1.172],
       [-0.835, -0.478, -1.395,  0.424],
       [-0.377, -0.02 , -0.937,  0.882]])
Coordinates:
  * y        (y) int64 -2 0 2
  * z        (z) int64 0 1 2 3

It also means that in most cases you do not need to worry about the order of dimensions:

In [26]: data - data.T
Out[26]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 -2 0 2

Operations also align based on index labels:

In [27]: data[:-1] - data[:1]
Out[27]: 
<xarray.DataArray (x: 1, y: 3)>
array([[ 0.,  0.,  0.]])
Coordinates:
  * x        (x) |S1 'a'
  * y        (y) int64 -2 0 2
GroupBy

xarray supports grouped operations using a very similar API to pandas:

In [28]: labels = xr.DataArray(['E', 'F', 'E'], [data.coords['y']], name='labels')

In [29]: labels
Out[29]: 
<xarray.DataArray 'labels' (y: 3)>
array(['E', 'F', 'E'], 
      dtype='|S1')
Coordinates:
  * y        (y) int64 -2 0 2

In [30]: data.groupby(labels).mean('y')
Out[30]: 
<xarray.DataArray (x: 2, labels: 2)>
array([[-0.71 , -0.675],
       [-0.278, -1.295]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * labels   (labels) object 'E' 'F'

In [31]: data.groupby(labels).apply(lambda x: x - x.min())
Out[31]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 2.134,  0.62 ,  0.   ],
       [ 0.808,  0.   ,  2.191]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 -2 0 2
    labels   (y) |S1 'E' 'F' 'E'
Convert to pandas

A key feature of xarray is robust conversion to and from pandas objects:

In [32]: data.to_series()
Out[32]: 
x  y 
a  -2    0.357021
    0   -0.674600
    2   -1.776904
b  -2   -0.968914
    0   -1.294524
    2    0.413738
dtype: float64

In [33]: data.to_pandas()
Out[33]: 
y        -2         0         2
x                              
a  0.357021 -0.674600 -1.776904
b -0.968914 -1.294524  0.413738
Datasets and NetCDF

xarray.Dataset is a dict-like container of DataArray objects that share index labels and dimensions. It looks a lot like a netCDF file:

In [34]: ds = data.to_dataset(name='foo')

In [35]: ds
Out[35]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 -2 0 2
Data variables:
    foo      (x, y) float64 0.357 -0.6746 -1.777 -0.9689 -1.295 0.4137

You can do almost everything you can do with DataArray objects with Dataset objects if you prefer to work with multiple variables at once.

Datasets also let you easily read and write netCDF files:

In [36]: ds.to_netcdf('example.nc')

In [37]: xr.open_dataset('example.nc')
Out[37]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * y        (y) int32 -2 0 2
  * x        (x) |S1 'a' 'b'
Data variables:
    foo      (x, y) float64 0.357 -0.6746 -1.777 -0.9689 -1.295 0.4137

Toy weather data

Here is an example of how to easily manipulate a toy weather dataset using xarray and other recommended Python libraries:

Shared setup:

import xarray as xr
import numpy as np
import pandas as pd
import seaborn as sns # pandas aware plotting library

np.random.seed(123)

times = pd.date_range('2000-01-01', '2001-12-31', name='time')
annual_cycle = np.sin(2 * np.pi * (times.dayofyear / 365.25 - 0.28))

base = 10 + 15 * annual_cycle.reshape(-1, 1)
tmin_values = base + 3 * np.random.randn(annual_cycle.size, 3)
tmax_values = base + 10 + 3 * np.random.randn(annual_cycle.size, 3)

ds = xr.Dataset({'tmin': (('time', 'location'), tmin_values),
                 'tmax': (('time', 'location'), tmax_values)},
                {'time': times, 'location': ['IA', 'IN', 'IL']})
Examine a dataset with pandas and seaborn
In [1]: ds
Out[1]: 
<xarray.Dataset>
Dimensions:   (location: 3, time: 731)
Coordinates:
  * location  (location) |S2 'IA' 'IN' 'IL'
  * time      (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
Data variables:
    tmax      (time, location) float64 12.98 3.31 6.779 0.4479 6.373 4.843 ...
    tmin      (time, location) float64 -8.037 -1.788 -3.932 -9.341 -6.558 ...

In [2]: df = ds.to_dataframe()

In [3]: df.head()
Out[3]: 
                          tmax       tmin
location time                            
IA       2000-01-01  12.980549  -8.037369
         2000-01-02   0.447856  -9.341157
         2000-01-03   5.322699 -12.139719
         2000-01-04   1.889425  -7.492914
         2000-01-05   0.791176  -0.447129

In [4]: df.describe()
Out[4]: 
              tmax         tmin
count  2193.000000  2193.000000
mean     20.108232     9.975426
std      11.010569    10.963228
min      -3.506234   -13.395763
25%       9.853905    -0.040347
50%      19.967409    10.060403
75%      30.045588    20.083590
max      43.271148    33.456060

In [5]: ds.mean(dim='location').to_dataframe().plot()
Out[5]: <matplotlib.axes._subplots.AxesSubplot at 0x7f0c29cc2bd0>
_images/examples_tmin_tmax_plot.png
In [6]: sns.pairplot(df.reset_index(), vars=ds.data_vars)
Out[6]: <seaborn.axisgrid.PairGrid at 0x7f0fd2368a10>
_images/examples_pairplot.png
Probability of freeze by calendar month
In [7]: freeze = (ds['tmin'] <= 0).groupby('time.month').mean('time')

In [8]: freeze
Out[8]: 
<xarray.DataArray 'tmin' (month: 12, location: 3)>
array([[ 0.952,  0.887,  0.935],
       [ 0.842,  0.719,  0.772],
       [ 0.242,  0.129,  0.161],
       ..., 
       [ 0.   ,  0.016,  0.   ],
       [ 0.333,  0.35 ,  0.233],
       [ 0.935,  0.855,  0.823]])
Coordinates:
  * location  (location) |S2 'IA' 'IN' 'IL'
  * month     (month) int64 1 2 3 4 5 6 7 8 9 10 11 12

In [9]: freeze.to_pandas().plot()
Out[9]: <matplotlib.axes._subplots.AxesSubplot at 0x7f0c3c0edc50>
_images/examples_freeze_prob.png
Monthly averaging
In [10]: monthly_avg = ds.resample('1MS', dim='time', how='mean')

In [11]: monthly_avg.sel(location='IA').to_dataframe().plot(style='s-')
Out[11]: <matplotlib.axes._subplots.AxesSubplot at 0x7f0c299de650>
_images/examples_tmin_tmax_plot_mean.png

Note that MS here refers to Month-Start; M labels Month-End (the last day of the month).

Calculate monthly anomalies

In climatology, “anomalies” refer to the difference between observations and typical weather for a particular season. Unlike observations, anomalies should not show any seasonal cycle.

In [12]: climatology = ds.groupby('time.month').mean('time')

In [13]: anomalies = ds.groupby('time.month') - climatology

In [14]: anomalies.mean('location').to_dataframe()[['tmin', 'tmax']].plot()
Out[14]: <matplotlib.axes._subplots.AxesSubplot at 0x7f0c299af3d0>
_images/examples_anomalies_plot.png
Fill missing values with climatology

The fillna() method on grouped objects lets you easily fill missing values by group:

# throw away the first half of every month
In [15]: some_missing = ds.tmin.sel(time=ds['time.day'] > 15).reindex_like(ds)

In [16]: filled = some_missing.groupby('time.month').fillna(climatology.tmin)

In [17]: both = xr.Dataset({'some_missing': some_missing, 'filled': filled})

In [18]: both
Out[18]: 
<xarray.Dataset>
Dimensions:       (location: 3, time: 731)
Coordinates:
  * location      (location) object 'IA' 'IN' 'IL'
  * time          (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 ...
    month         (time) int32 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ...
Data variables:
    some_missing  (time, location) float64 nan nan nan nan nan nan nan nan ...
    filled        (time, location) float64 -5.163 -4.216 -4.681 -5.163 ...

In [19]: df = both.sel(time='2000').mean('location').reset_coords(drop=True).to_dataframe()

In [20]: df[['filled', 'some_missing']].plot()
Out[20]: <matplotlib.axes._subplots.AxesSubplot at 0x7f0c298dff10>
_images/examples_filled.png

Calculating Seasonal Averages from Timeseries of Monthly Means

Author: Joe Hamman

The data for this example can be found in the xray-data repository. This example is also available in an IPython Notebook that is available here.

Suppose we have a netCDF or xray Dataset of monthly mean data and we want to calculate the seasonal average. To do this properly, we need to calculate the weighted average considering that each month has a different number of days.

%matplotlib inline
import numpy as np
import pandas as pd
import xray
from netCDF4 import num2date
import matplotlib.pyplot as plt

print("numpy version  : ", np.__version__)
print("pandas version : ", pd.version.version)
print("xray version   : ", xray.version.version)
numpy version  :  1.9.2
pandas version :  0.16.2
xray version   :  0.5.1
Some calendar information so we can support any netCDF calendar.
dpm = {'noleap': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
       '365_day': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
       'standard': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
       'gregorian': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
       'proleptic_gregorian': [0, 31, 28, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
       'all_leap': [0, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
       '366_day': [0, 31, 29, 31, 30, 31, 30, 31, 31, 30, 31, 30, 31],
       '360_day': [0, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30, 30]}
A few calendar functions to determine the number of days in each month

If you were just using the standard calendar, it would be easy to use the calendar.month_range function.

def leap_year(year, calendar='standard'):
    """Determine if year is a leap year"""
    leap = False
    if ((calendar in ['standard', 'gregorian',
        'proleptic_gregorian', 'julian']) and
        (year % 4 == 0)):
        leap = True
        if ((calendar == 'proleptic_gregorian') and
            (year % 100 == 0) and
            (year % 400 != 0)):
            leap = False
        elif ((calendar in ['standard', 'gregorian']) and
                 (year % 100 == 0) and (year % 400 != 0) and
                 (year < 1583)):
            leap = False
    return leap

def get_dpm(time, calendar='standard'):
    """
    return a array of days per month corresponding to the months provided in `months`
    """
    month_length = np.zeros(len(time), dtype=np.int)

    cal_days = dpm[calendar]

    for i, (month, year) in enumerate(zip(time.month, time.year)):
        month_length[i] = cal_days[month]
        if leap_year(year, calendar=calendar):
            month_length[i] += 1
    return month_length
Open the Dataset
monthly_mean_file = 'RASM_example_data.nc'
ds = xray.open_dataset(monthly_mean_file, decode_coords=False)
print(ds)
<xray.Dataset>
Dimensions:  (time: 36, x: 275, y: 205)
Coordinates:
  * time     (time) datetime64[ns] 1980-09-16T12:00:00 1980-10-17 ...
  * x        (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * y        (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
Data variables:
    Tair     (time, y, x) float64 nan nan nan nan nan nan nan nan nan nan ...
Attributes:
    title: /workspace/jhamman/processed/R1002RBRxaaa01a/lnd/temp/R1002RBRxaaa01a.vic.ha.1979-09-01.nc
    institution: U.W.
    source: RACM R1002RBRxaaa01a
    output_frequency: daily
    output_mode: averaged
    convention: CF-1.4
    references: Based on the initial model of Liang et al., 1994, JGR, 99, 14,415- 14,429.
    comment: Output from the Variable Infiltration Capacity (VIC) model.
    nco_openmp_thread_number: 1
    NCO: 4.3.7
    history: history deleted for brevity
Now for the heavy lifting:

We first have to come up with the weights, - calculate the month lengths for each monthly data record - calculate weights using groupby('time.season')

Finally, we just need to multiply our weights by the Dataset and sum allong the time dimension.

# Make a DataArray with the number of days in each month, size = len(time)
month_length = xray.DataArray(get_dpm(ds.time.to_index(),
                                      calendar='noleap'),
                              coords=[ds.time], name='month_length')

# Calculate the weights by grouping by 'time.season'.
# Conversion to float type ('astype(float)') only necessary for Python 2.x
weights = month_length.groupby('time.season') / month_length.astype(float).groupby('time.season').sum()

# Test that the sum of the weights for each season is 1.0
np.testing.assert_allclose(weights.groupby('time.season').sum().values, np.ones(4))

# Calculate the weighted average
ds_weighted = (ds * weights).groupby('time.season').sum(dim='time')
print(ds_weighted)
<xray.Dataset>
Dimensions:  (season: 4, x: 275, y: 205)
Coordinates:
  * y        (y) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * x        (x) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 ...
  * season   (season) object 'DJF' 'JJA' 'MAM' 'SON'
Data variables:
    Tair     (season, y, x) float64 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...
# only used for comparisons
ds_unweighted = ds.groupby('time.season').mean('time')
ds_diff = ds_weighted - ds_unweighted
# Quick plot to show the results
is_null = np.isnan(ds_unweighted['Tair'][0].values)

fig, axes = plt.subplots(nrows=4, ncols=3, figsize=(14,12))
for i, season in enumerate(('DJF', 'MAM', 'JJA', 'SON')):
    plt.sca(axes[i, 0])
    plt.pcolormesh(np.ma.masked_where(is_null, ds_weighted['Tair'].sel(season=season).values),
                   vmin=-30, vmax=30, cmap='Spectral_r')
    plt.colorbar(extend='both')

    plt.sca(axes[i, 1])
    plt.pcolormesh(np.ma.masked_where(is_null, ds_unweighted['Tair'].sel(season=season).values),
                   vmin=-30, vmax=30, cmap='Spectral_r')
    plt.colorbar(extend='both')

    plt.sca(axes[i, 2])
    plt.pcolormesh(np.ma.masked_where(is_null, ds_diff['Tair'].sel(season=season).values),
                   vmin=-0.1, vmax=.1, cmap='RdBu_r')
    plt.colorbar(extend='both')
    for j in range(3):
        axes[i, j].axes.get_xaxis().set_ticklabels([])
        axes[i, j].axes.get_yaxis().set_ticklabels([])
        axes[i, j].axes.axis('tight')

    axes[i, 0].set_ylabel(season)

axes[0, 0].set_title('Weighted by DPM')
axes[0, 1].set_title('Equal Weighting')
axes[0, 2].set_title('Difference')

plt.tight_layout()

fig.suptitle('Seasonal Surface Air Temperature', fontsize=16, y=1.02)
_images/monthly_means_output.png
# Wrap it into a simple function
def season_mean(ds, calendar='standard'):
    # Make a DataArray of season/year groups
    year_season = xray.DataArray(ds.time.to_index().to_period(freq='Q-NOV').to_timestamp(how='E'),
                                 coords=[ds.time], name='year_season')

    # Make a DataArray with the number of days in each month, size = len(time)
    month_length = xray.DataArray(get_dpm(ds.time.to_index(), calendar=calendar),
                                  coords=[ds.time], name='month_length')
    # Calculate the weights by grouping by 'time.season'
    weights = month_length.groupby('time.season') / month_length.groupby('time.season').sum()

    # Test that the sum of the weights for each season is 1.0
    np.testing.assert_allclose(weights.groupby('time.season').sum().values, np.ones(4))

    # Calculate the weighted average
    return (ds * weights).groupby('time.season').sum(dim='time')

Installation

Required dependencies

  • Python 2.6, 2.7, 3.3, 3.4 or 3.5.
  • numpy (1.7 or later)
  • pandas (0.15.0 or later)

Optional dependencies

For netCDF and IO
  • netCDF4: recommended if you want to use xarray for reading or writing netCDF files
  • scipy: used as a fallback for reading/writing netCDF3
  • pydap: used as a fallback for accessing OPeNDAP
  • h5netcdf: an alternative library for reading and writing netCDF4 files that does not use the netCDF-C libraries
For accelerating xarray
  • bottleneck: speeds up NaN-skipping and rolling window aggregations by a large factor
  • cyordereddict: speeds up most internal operations with xarray data structures
For parallel computing
For plotting

Instructions

xarray itself is a pure Python package, but its dependencies are not. The easiest way to get them installed is to use conda. You can then install xarray with its recommended dependencies with the conda command line tool:

$ conda install xarray dask netCDF4 bottleneck

If you don’t use conda, be sure you have the required dependencies (numpy and pandas) installed first. Then, install xarray with pip:

$ pip install xarray

To run the test suite after installing xarray, install py.test and run py.test xarray.

Data Structures

DataArray

xarray.DataArray is xarray’s implementation of a labeled, multi-dimensional array. It has several key properties:

  • values: a numpy.ndarray holding the array’s values
  • dims: dimension names for each axis (e.g., ('x', 'y', 'z'))
  • coords: a dict-like container of arrays (coordinates) that label each point (e.g., 1-dimensional arrays of numbers, datetime objects or strings)
  • attrs: an OrderedDict to hold arbitrary metadata (attributes)

xarray uses dims and coords to enable its core metadata aware operations. Dimensions provide names that xarray uses instead of the axis argument found in many numpy functions. Coordinates enable fast label based indexing and alignment, building on the functionality of the index found on a pandas DataFrame or Series.

DataArray objects also can have a name and can hold arbitrary metadata in the form of their attrs property (an ordered dictionary). Names and attributes are strictly for users and user-written code: xarray makes no attempt to interpret them, and propagates them only in unambiguous cases (see FAQ, What is your approach to metadata?).

Creating a DataArray

The DataArray constructor takes:

  • data: a multi-dimensional array of values (e.g., a numpy ndarray, Series, DataFrame or Panel)
  • coords: a list or dictionary of coordinates
  • dims: a list of dimension names. If omitted, dimension names are taken from coords if possible
  • attrs: a dictionary of attributes to add to the instance
  • name: a string that names the instance
In [1]: data = np.random.rand(4, 3)

In [2]: locs = ['IA', 'IL', 'IN']

In [3]: times = pd.date_range('2000-01-01', periods=4)

In [4]: foo = xr.DataArray(data, coords=[times, locs], dims=['time', 'space'])

In [5]: foo
Out[5]: 
<xarray.DataArray (time: 4, space: 3)>
array([[ 0.127,  0.967,  0.26 ],
       [ 0.897,  0.377,  0.336],
       [ 0.451,  0.84 ,  0.123],
       [ 0.543,  0.373,  0.448]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IA' 'IL' 'IN'

Only data is required; all of other arguments will be filled in with default values:

In [6]: xr.DataArray(data)
Out[6]: 
<xarray.DataArray (dim_0: 4, dim_1: 3)>
array([[ 0.127,  0.967,  0.26 ],
       [ 0.897,  0.377,  0.336],
       [ 0.451,  0.84 ,  0.123],
       [ 0.543,  0.373,  0.448]])
Coordinates:
  * dim_0    (dim_0) int64 0 1 2 3
  * dim_1    (dim_1) int64 0 1 2

As you can see, dimensions and coordinate arrays corresponding to each dimension are always present. This behavior is similar to pandas, which fills in index values in the same way.

Coordinates can take the following forms:

  • A list of (dim, ticks[, attrs]) pairs with length equal to the number of dimensions
  • A dictionary of {coord_name: coord} where the values are each a scalar value, a 1D array or a tuple. Tuples are be in the same form as the above, and multiple dimensions can be supplied with the form (dims, data[, attrs]). Supplying as a tuple allows other coordinates than those corresponding to dimensions (more on these later).

As a list of tuples:

In [7]: xr.DataArray(data, coords=[('time', times), ('space', locs)])
Out[7]: 
<xarray.DataArray (time: 4, space: 3)>
array([[ 0.127,  0.967,  0.26 ],
       [ 0.897,  0.377,  0.336],
       [ 0.451,  0.84 ,  0.123],
       [ 0.543,  0.373,  0.448]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IA' 'IL' 'IN'

As a dictionary:

In [8]: xr.DataArray(data, coords={'time': times, 'space': locs, 'const': 42,
   ...:                            'ranking': ('space', [1, 2, 3])},
   ...:              dims=['time', 'space'])
   ...: 
Out[8]: 
<xarray.DataArray (time: 4, space: 3)>
array([[ 0.127,  0.967,  0.26 ],
       [ 0.897,  0.377,  0.336],
       [ 0.451,  0.84 ,  0.123],
       [ 0.543,  0.373,  0.448]])
Coordinates:
    ranking  (space) int64 1 2 3
  * space    (space) |S2 'IA' 'IL' 'IN'
    const    int64 42
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04

As a dictionary with coords across multiple dimensions:

In [9]: xr.DataArray(data, coords={'time': times, 'space': locs, 'const': 42,
   ...:                            'ranking': (('time', 'space'), np.arange(12).reshape(4,3))},
   ...:              dims=['time', 'space'])
   ...: 
Out[9]: 
<xarray.DataArray (time: 4, space: 3)>
array([[ 0.127,  0.967,  0.26 ],
       [ 0.897,  0.377,  0.336],
       [ 0.451,  0.84 ,  0.123],
       [ 0.543,  0.373,  0.448]])
Coordinates:
    ranking  (time, space) int64 0 1 2 3 4 5 6 7 8 9 10 11
  * space    (space) |S2 'IA' 'IL' 'IN'
    const    int64 42
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04

If you create a DataArray by supplying a pandas Series, DataFrame or Panel, any non-specified arguments in the DataArray constructor will be filled in from the pandas object:

In [10]: df = pd.DataFrame({'x': [0, 1], 'y': [2, 3]}, index=['a', 'b'])

In [11]: df.index.name = 'abc'

In [12]: df.columns.name = 'xyz'

In [13]: df
Out[13]: 
xyz  x  y
abc      
a    0  2
b    1  3

In [14]: xr.DataArray(df)
Out[14]: 
<xarray.DataArray (abc: 2, xyz: 2)>
array([[0, 2],
       [1, 3]])
Coordinates:
  * abc      (abc) object 'a' 'b'
  * xyz      (xyz) object 'x' 'y'

xarray does not (yet!) support labeling coordinate values with a pandas.MultiIndex (see GH164). However, the alternate from_series constructor will automatically unpack any hierarchical indexes it encounters by expanding the series into a multi-dimensional array, as described in Working with pandas.

DataArray properties

Let’s take a look at the important properties on our array:

In [15]: foo.values
Out[15]: 
array([[ 0.127,  0.967,  0.26 ],
       [ 0.897,  0.377,  0.336],
       [ 0.451,  0.84 ,  0.123],
       [ 0.543,  0.373,  0.448]])

In [16]: foo.dims
Out[16]: ('time', 'space')

In [17]: foo.coords
Out[17]: 
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IA' 'IL' 'IN'

In [18]: foo.attrs
Out[18]: OrderedDict()

In [19]: print(foo.name)
None

You can even modify values inplace:

In [20]: foo.values = 1.0 * foo.values

Note

The array values in a DataArray have a single (homogeneous) data type. To work with heterogeneous or structured data types in xarray, use coordinates, or put separate DataArray objects in a single Dataset (see below).

Now fill in some of that missing metadata:

In [21]: foo.name = 'foo'

In [22]: foo.attrs['units'] = 'meters'

In [23]: foo
Out[23]: 
<xarray.DataArray 'foo' (time: 4, space: 3)>
array([[ 0.127,  0.967,  0.26 ],
       [ 0.897,  0.377,  0.336],
       [ 0.451,  0.84 ,  0.123],
       [ 0.543,  0.373,  0.448]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IA' 'IL' 'IN'
Attributes:
    units: meters

The rename() method is another option, returning a new data array:

In [24]: foo.rename('bar')
Out[24]: 
<xarray.DataArray 'bar' (time: 4, space: 3)>
array([[ 0.127,  0.967,  0.26 ],
       [ 0.897,  0.377,  0.336],
       [ 0.451,  0.84 ,  0.123],
       [ 0.543,  0.373,  0.448]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IA' 'IL' 'IN'
Attributes:
    units: meters
DataArray Coordinates

The coords property is dict like. Individual coordinates can be accessed from the coordinates by name, or even by indexing the data array itself:

In [25]: foo.coords['time']
Out[25]: 
<xarray.DataArray 'time' (time: 4)>
array(['2000-01-01T00:00:00.000000000+0000', '2000-01-02T00:00:00.000000000+0000',
       '2000-01-03T00:00:00.000000000+0000', '2000-01-04T00:00:00.000000000+0000'], dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04

In [26]: foo['time']
Out[26]: 
<xarray.DataArray 'time' (time: 4)>
array(['2000-01-01T00:00:00.000000000+0000', '2000-01-02T00:00:00.000000000+0000',
       '2000-01-03T00:00:00.000000000+0000', '2000-01-04T00:00:00.000000000+0000'], dtype='datetime64[ns]')
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04

These are also DataArray objects, which contain tick-labels for each dimension.

Coordinates can also be set or removed by using the dictionary like syntax:

In [27]: foo['ranking'] = ('space', [1, 2, 3])

In [28]: foo.coords
Out[28]: 
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IA' 'IL' 'IN'
    ranking  (space) int64 1 2 3

In [29]: del foo['ranking']

In [30]: foo.coords
Out[30]: 
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IA' 'IL' 'IN'

Dataset

xarray.Dataset is xarray’s multi-dimensional equivalent of a DataFrame. It is a dict-like container of labeled arrays (DataArray objects) with aligned dimensions. It is designed as an in-memory representation of the data model from the netCDF file format.

In addition to the dict-like interface of the dataset itself, which can be used to access any variable in a dataset, datasets have four key properties:

  • dims: a dictionary mapping from dimension names to the fixed length of each dimension (e.g., {'x': 6, 'y': 6, 'time': 8})
  • data_vars: a dict-like container of DataArrays corresponding to variables
  • coords: another dict-like container of DataArrays intended to label points used in data_vars (e.g., arrays of numbers, datetime objects or strings)
  • attrs: an OrderedDict to hold arbitrary metadata

The distinction between whether a variables falls in data or coordinates (borrowed from CF conventions) is mostly semantic, and you can probably get away with ignoring it if you like: dictionary like access on a dataset will supply variables found in either category. However, xarray does make use of the distinction for indexing and computations. Coordinates indicate constant/fixed/independent quantities, unlike the varying/measured/dependent quantities that belong in data.

Here is an example of how we might structure a dataset for a weather forecast:

_images/dataset-diagram.png

In this example, it would be natural to call temperature and precipitation “data variables” and all the other arrays “coordinate variables” because they label the points along the dimensions. (see [1] for more background on this example).

Creating a Dataset

To make an Dataset from scratch, supply dictionaries for any variables (data_vars), coordinates (coords) and attributes (attrs).

data_vars are supplied as a dictionary with each key as the name of the variable and each value as one of: - A DataArray - A tuple of the form (dims, data[, attrs]) - A pandas object

coords are supplied as dictionary of {coord_name: coord} where the values are scalar values, arrays or tuples in the form of (dims, data[, attrs]).

Let’s create some fake data for the example we show above:

In [31]: temp = 15 + 8 * np.random.randn(2, 2, 3)

In [32]: precip = 10 * np.random.rand(2, 2, 3)

In [33]: lon = [[-99.83, -99.32], [-99.79, -99.23]]

In [34]: lat = [[42.25, 42.21], [42.63, 42.59]]

# for real use cases, its good practice to supply array attributes such as
# units, but we won't bother here for the sake of brevity
In [35]: ds = xr.Dataset({'temperature': (['x', 'y', 'time'],  temp),
   ....:                  'precipitation': (['x', 'y', 'time'], precip)},
   ....:                 coords={'lon': (['x', 'y'], lon),
   ....:                         'lat': (['x', 'y'], lat),
   ....:                         'time': pd.date_range('2014-09-06', periods=3),
   ....:                         'reference_time': pd.Timestamp('2014-09-05')})
   ....: 

In [36]: ds
Out[36]: 
<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    reference_time  datetime64[ns] 2014-09-05
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
  * x               (x) int64 0 1
  * y               (y) int64 0 1
Data variables:
    precipitation   (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
    temperature     (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...

Notice that we did not explicitly include coordinates for the “x” or “y” dimensions, so they were filled in array of ascending integers of the proper length.

Here we pass xarray.DataArray objects or a pandas object as values in the dictionary:

In [37]: xr.Dataset({'bar': foo})
Out[37]: 
<xarray.Dataset>
Dimensions:  (space: 3, time: 4)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IA' 'IL' 'IN'
Data variables:
    bar      (time, space) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 ...
In [38]: xr.Dataset({'bar': foo.to_pandas()})
Out[38]: 
<xarray.Dataset>
Dimensions:  (space: 3, time: 4)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) object 'IA' 'IL' 'IN'
Data variables:
    bar      (time, space) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 ...

Where a pandas object is supplied as a value, the names of its indexes are used as dimension names, and its data is aligned to any existing dimensions.

You can also create an dataset from: - A pandas.DataFrame or pandas.Panel along its columns and items

respectively, by passing it into the xarray.Dataset directly
Dataset contents

Dataset implements the Python dictionary interface, with values given by xarray.DataArray objects:

In [39]: 'temperature' in ds
Out[39]: True

In [40]: ds.keys()
Out[40]: 
['precipitation',
 'temperature',
 'lat',
 'reference_time',
 'lon',
 'time',
 'x',
 'y']

In [41]: ds['temperature']
Out[41]: 
<xarray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[ 11.041,  23.574,  20.772],
        [  9.346,   6.683,  17.175]],

       [[ 11.6  ,  19.536,  17.21 ],
        [  6.301,   9.61 ,  15.909]]])
Coordinates:
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    reference_time  datetime64[ns] 2014-09-05
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
  * x               (x) int64 0 1
  * y               (y) int64 0 1

The valid keys include each listed coordinate and data variable.

Data and coordinate variables are also contained separately in the data_vars and coords dictionary-like attributes:

In [42]: ds.data_vars
Out[42]: 
Data variables:
    precipitation  (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 0.3777 ...
    temperature    (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...

In [43]: ds.coords
Out[43]: 
Coordinates:
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    reference_time  datetime64[ns] 2014-09-05
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
  * x               (x) int64 0 1
  * y               (y) int64 0 1

Finally, like data arrays, datasets also store arbitrary metadata in the form of attributes:

In [44]: ds.attrs
Out[44]: OrderedDict()

In [45]: ds.attrs['title'] = 'example attribute'

In [46]: ds
Out[46]: 
<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    reference_time  datetime64[ns] 2014-09-05
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
  * x               (x) int64 0 1
  * y               (y) int64 0 1
Data variables:
    precipitation   (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
    temperature     (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
Attributes:
    title: example attribute

xarray does not enforce any restrictions on attributes, but serialization to some file formats may fail if you use objects that are not strings, numbers or numpy.ndarray objects.

As a useful shortcut, you can use attribute style access for reading (but not setting) variables and attributes:

In [47]: ds.temperature
Out[47]: 
<xarray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[ 11.041,  23.574,  20.772],
        [  9.346,   6.683,  17.175]],

       [[ 11.6  ,  19.536,  17.21 ],
        [  6.301,   9.61 ,  15.909]]])
Coordinates:
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    reference_time  datetime64[ns] 2014-09-05
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
  * x               (x) int64 0 1
  * y               (y) int64 0 1

This is particularly useful in an exploratory context, because you can tab-complete these variable names with tools like IPython.

Dictionary like methods

We can update a dataset in-place using Python’s standard dictionary syntax. For example, to create this example dataset from scratch, we could have written:

In [48]: ds = xr.Dataset()

In [49]: ds['temperature'] = (('x', 'y', 'time'), temp)

In [50]: ds['precipitation'] = (('x', 'y', 'time'), precip)

In [51]: ds.coords['lat'] = (('x', 'y'), lat)

In [52]: ds.coords['lon'] = (('x', 'y'), lon)

In [53]: ds.coords['time'] = pd.date_range('2014-09-06', periods=3)

In [54]: ds.coords['reference_time'] = pd.Timestamp('2014-09-05')

To change the variables in a Dataset, you can use all the standard dictionary methods, including values, items, __delitem__, get and update(). Note that assigning a DataArray or pandas object to a Dataset variable using __setitem__ or update will automatically align the array(s) to the original dataset’s indexes.

You can copy a Dataset by calling the copy() method. By default, the copy is shallow, so only the container will be copied: the arrays in the Dataset will still be stored in the same underlying numpy.ndarray objects. You can copy all data by calling ds.copy(deep=True).

Transforming datasets

In addition to dictionary-like methods (described above), xarray has additional methods (like pandas) for transforming datasets into new objects.

For removing variables, you can select and drop an explicit list of variables by using the by indexing with a list of names or using the drop() methods to return a new Dataset. These operations keep around coordinates:

In [55]: list(ds[['temperature']])
Out[55]: ['temperature', 'lat', 'time', 'y', 'x', 'reference_time', 'lon']

In [56]: list(ds[['x']])
Out[56]: ['x', 'reference_time']

In [57]: list(ds.drop('temperature'))
Out[57]: ['time', 'x', 'y', 'precipitation', 'lat', 'lon', 'reference_time']

If a dimension name is given as an argument to drop, it also drops all variables that use that dimension:

In [58]: list(ds.drop('time'))
Out[58]: ['x', 'y', 'lat', 'lon', 'reference_time']

As an alternate to dictionary-like modifications, you can use assign() and assign_coords(). These methods return a new dataset with additional (or replaced) or values:

In [59]: ds.assign(temperature2 = 2 * ds.temperature)
Out[59]: 
<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
  * x               (x) int64 0 1
  * y               (y) int64 0 1
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    reference_time  datetime64[ns] 2014-09-05
Data variables:
    temperature     (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
    precipitation   (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
    temperature2    (x, y, time) float64 22.08 47.15 41.54 18.69 13.37 34.35 ...

There is also the pipe() method that allows you to use a method call with an external function (e.g., ds.pipe(func)) instead of simply calling it (e.g., func(ds)). This allows you to write pipelines for transforming you data (using “method chaining”) instead of writing hard to follow nested function calls:

# these lines are equivalent, but with pipe we can make the logic flow
# entirely from left to right
In [60]: plt.plot((2 * ds.temperature.sel(x=0)).mean('y'))
Out[60]: [<matplotlib.lines.Line2D at 0x7f0c29c59850>]

In [61]: (ds.temperature
   ....:  .sel(x=0)
   ....:  .pipe(lambda x: 2 * x)
   ....:  .mean('y')
   ....:  .pipe(plt.plot))
   ....: 
Out[61]: [<matplotlib.lines.Line2D at 0x7f0c29c59e10>]

Both pipe and assign replicate the pandas methods of the same names (DataFrame.pipe and DataFrame.assign).

With xarray, there is no performance penalty for creating new datasets, even if variables are lazily loaded from a file on disk. Creating new objects instead of mutating existing objects often results in easier to understand code, so we encourage using this approach.

Renaming variables

Another useful option is the rename() method to rename dataset variables:

In [62]: ds.rename({'temperature': 'temp', 'precipitation': 'precip'})
Out[62]: 
<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
  * x               (x) int64 0 1
  * y               (y) int64 0 1
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    reference_time  datetime64[ns] 2014-09-05
Data variables:
    temp            (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
    precip          (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...

The related swap_dims() method allows you do to swap dimension and non-dimension variables:

In [63]: ds.coords['day'] = ('time', [6, 7, 8])

In [64]: ds.swap_dims({'time': 'day'})
Out[64]: 
<xarray.Dataset>
Dimensions:         (day: 3, x: 2, y: 2)
Coordinates:
    time            (day) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
  * x               (x) int64 0 1
  * y               (y) int64 0 1
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    reference_time  datetime64[ns] 2014-09-05
  * day             (day) int64 6 7 8
Data variables:
    temperature     (x, y, day) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
    precipitation   (x, y, day) float64 5.904 2.453 3.404 9.847 9.195 0.3777 ...

Coordinates

Coordinates are ancillary variables stored for DataArray and Dataset objects in the coords attribute:

In [65]: ds.coords
Out[65]: 
Coordinates:
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
  * x               (x) int64 0 1
  * y               (y) int64 0 1
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    reference_time  datetime64[ns] 2014-09-05
    day             (time) int64 6 7 8

Unlike attributes, xarray does interpret and persist coordinates in operations that transform xarray objects.

One dimensional coordinates with a name equal to their sole dimension (marked by * when printing a dataset or data array) take on a special meaning in xarray. They are used for label based indexing and alignment, like the index found on a pandas DataFrame or Series. Indeed, these “dimension” coordinates use a pandas.Index internally to store their values.

Other than for indexing, xarray does not make any direct use of the values associated with coordinates. Coordinates with names not matching a dimension are not used for alignment or indexing, nor are they required to match when doing arithmetic (see Coordinates).

Modifying coordinates

To entirely add or removing coordinate arrays, you can use dictionary like syntax, as shown above.

To convert back and forth between data and coordinates, you can use the set_coords() and reset_coords() methods:

In [66]: ds.reset_coords()
Out[66]: 
<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
  * x               (x) int64 0 1
  * y               (y) int64 0 1
Data variables:
    temperature     (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
    precipitation   (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    reference_time  datetime64[ns] 2014-09-05
    day             (time) int64 6 7 8

In [67]: ds.set_coords(['temperature', 'precipitation'])
Out[67]: 
<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
    temperature     (x, y, time) float64 11.04 23.57 20.77 9.346 6.683 17.17 ...
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
  * x               (x) int64 0 1
  * y               (y) int64 0 1
    precipitation   (x, y, time) float64 5.904 2.453 3.404 9.847 9.195 ...
    lat             (x, y) float64 42.25 42.21 42.63 42.59
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    reference_time  datetime64[ns] 2014-09-05
    day             (time) int64 6 7 8
Data variables:
    *empty*

In [68]: ds['temperature'].reset_coords(drop=True)
Out[68]: 
<xarray.DataArray 'temperature' (x: 2, y: 2, time: 3)>
array([[[ 11.041,  23.574,  20.772],
        [  9.346,   6.683,  17.175]],

       [[ 11.6  ,  19.536,  17.21 ],
        [  6.301,   9.61 ,  15.909]]])
Coordinates:
  * time     (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
  * x        (x) int64 0 1
  * y        (y) int64 0 1

Notice that these operations skip coordinates with names given by dimensions, as used for indexing. This mostly because we are not entirely sure how to design the interface around the fact that xarray cannot store a coordinate and variable with the name but different values in the same dictionary. But we do recognize that supporting something like this would be useful.

Coordinates methods

Coordinates objects also have a few useful methods, mostly for converting them into dataset objects:

In [69]: ds.coords.to_dataset()
Out[69]: 
<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2)
Coordinates:
    lat             (x, y) float64 42.25 42.21 42.63 42.59
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
  * y               (y) int64 0 1
  * x               (x) int64 0 1
    reference_time  datetime64[ns] 2014-09-05
    lon             (x, y) float64 -99.83 -99.32 -99.79 -99.23
    day             (time) int64 6 7 8
Data variables:
    *empty*

The merge method is particularly interesting, because it implements the same logic used for merging coordinates in arithmetic operations (see Computation):

In [70]: alt = xr.Dataset(coords={'z': [10], 'lat': 0, 'lon': 0})

In [71]: ds.coords.merge(alt.coords)
Out[71]: 
<xarray.Dataset>
Dimensions:         (time: 3, x: 2, y: 2, z: 1)
Coordinates:
  * time            (time) datetime64[ns] 2014-09-06 2014-09-07 2014-09-08
  * y               (y) int64 0 1
  * x               (x) int64 0 1
    reference_time  datetime64[ns] 2014-09-05
    day             (time) int64 6 7 8
  * z               (z) int64 10
Data variables:
    *empty*

The coords.merge method may be useful if you want to implement your own binary operations that act on xarray objects. In the future, we hope to write more helper functions so that you can easily make your functions act like xarray’s built-in arithmetic.

Indexes

To convert a coordinate (or any DataArray) into an actual pandas.Index, use the to_index() method:

In [72]: ds['time'].to_index()
Out[72]: DatetimeIndex(['2014-09-06', '2014-09-07', '2014-09-08'], dtype='datetime64[ns]', name=u'time', freq='D')

A useful shortcut is the indexes property (on both DataArray and Dataset), which lazily constructs a dictionary whose keys are given by each dimension and whose the values are Index objects:

In [73]: ds.indexes
Out[73]: 
time: DatetimeIndex(['2014-09-06', '2014-09-07', '2014-09-08'], dtype='datetime64[ns]', name=u'time', freq='D')
x: Int64Index([0, 1], dtype='int64', name=u'x')
y: Int64Index([0, 1], dtype='int64', name=u'y')
[1]Latitude and longitude are 2D arrays because the dataset uses projected coordinates. reference_time refers to the reference time at which the forecast was made, rather than time which is the valid time for which the forecast applies.

Indexing and selecting data

Similarly to pandas objects, xarray objects support both integer and label based lookups along each dimension. However, xarray objects also have named dimensions, so you can optionally use dimension names instead of relying on the positional ordering of dimensions.

Thus in total, xarray supports four different kinds of indexing, as described below and summarized in this table:

Dimension lookup Index lookup DataArray syntax Dataset syntax
Positional By integer arr[:, 0] not available
Positional By label arr.loc[:, 'IA'] not available
By name By integer arr.isel(space=0) or
arr[dict(space=0)]
ds.isel(space=0) or
ds[dict(space=0)]
By name By label arr.sel(space='IA') or
arr.loc[dict(space='IA')]
ds.sel(space='IA') or
ds.loc[dict(space='IA')]

Positional indexing

Indexing a DataArray directly works (mostly) just like it does for numpy arrays, except that the returned object is always another DataArray:

In [1]: arr = xr.DataArray(np.random.rand(4, 3),
   ...:                    [('time', pd.date_range('2000-01-01', periods=4)),
   ...:                     ('space', ['IA', 'IL', 'IN'])])
   ...: 

In [2]: arr[:2]
Out[2]: 
<xarray.DataArray (time: 2, space: 3)>
array([[ 0.127,  0.967,  0.26 ],
       [ 0.897,  0.377,  0.336]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02
  * space    (space) |S2 'IA' 'IL' 'IN'

In [3]: arr[0, 0]
Out[3]: 
<xarray.DataArray ()>
array(0.12696983303810094)
Coordinates:
    time     datetime64[ns] 2000-01-01
    space    |S2 'IA'

In [4]: arr[:, [2, 1]]
Out[4]: 
<xarray.DataArray (time: 4, space: 2)>
array([[ 0.26 ,  0.967],
       [ 0.336,  0.377],
       [ 0.123,  0.84 ],
       [ 0.448,  0.373]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IN' 'IL'

Attributes are persisted in all indexing operations.

Warning

Positional indexing deviates from the NumPy when indexing with multiple arrays like arr[[0, 1], [0, 1]], as described in Orthogonal (outer) vs. vectorized indexing. See Pointwise indexing for how to achieve this functionality in xarray.

xarray also supports label-based indexing, just like pandas. Because we use a pandas.Index under the hood, label based indexing is very fast. To do label based indexing, use the loc attribute:

In [5]: arr.loc['2000-01-01':'2000-01-02', 'IA']
Out[5]: 
<xarray.DataArray (time: 2)>
array([ 0.127,  0.897])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02
    space    |S2 'IA'

You can perform any of the label indexing operations supported by pandas, including indexing with individual, slices and arrays of labels, as well as indexing with boolean arrays. Like pandas, label based indexing in xarray is inclusive of both the start and stop bounds.

Setting values with label based indexing is also supported:

In [6]: arr.loc['2000-01-01', ['IL', 'IN']] = -10

In [7]: arr
Out[7]: 
<xarray.DataArray (time: 4, space: 3)>
array([[  0.127, -10.   , -10.   ],
       [  0.897,   0.377,   0.336],
       [  0.451,   0.84 ,   0.123],
       [  0.543,   0.373,   0.448]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IA' 'IL' 'IN'

Indexing with labeled dimensions

With labeled dimensions, we do not have to rely on dimension order and can use them explicitly to slice data. There are two ways to do this:

  1. Use a dictionary as the argument for array positional or label based array indexing:

    # index by integer array indices
    In [8]: arr[dict(space=0, time=slice(None, 2))]
    Out[8]: 
    <xarray.DataArray (time: 2)>
    array([ 0.127,  0.897])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-01 2000-01-02
        space    |S2 'IA'
    
    # index by dimension coordinate labels
    In [9]: arr.loc[dict(time=slice('2000-01-01', '2000-01-02'))]
    Out[9]: 
    <xarray.DataArray (time: 2, space: 3)>
    array([[  0.127, -10.   , -10.   ],
           [  0.897,   0.377,   0.336]])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-01 2000-01-02
      * space    (space) |S2 'IA' 'IL' 'IN'
    
  2. Use the sel() and isel() convenience methods:

    # index by integer array indices
    In [10]: arr.isel(space=0, time=slice(None, 2))
    Out[10]: 
    <xarray.DataArray (time: 2)>
    array([ 0.127,  0.897])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-01 2000-01-02
        space    |S2 'IA'
    
    # index by dimension coordinate labels
    In [11]: arr.sel(time=slice('2000-01-01', '2000-01-02'))
    Out[11]: 
    <xarray.DataArray (time: 2, space: 3)>
    array([[  0.127, -10.   , -10.   ],
           [  0.897,   0.377,   0.336]])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-01 2000-01-02
      * space    (space) |S2 'IA' 'IL' 'IN'
    

The arguments to these methods can be any objects that could index the array along the dimension given by the keyword, e.g., labels for an individual value, Python slice() objects or 1-dimensional arrays.

Note

We would love to be able to do indexing with labeled dimension names inside brackets, but unfortunately, Python does yet not support indexing with keyword arguments like arr[space=0]

Warning

Do not try to assign values when using any of the indexing methods isel, isel_points, sel or sel_points:

# DO NOT do this
arr.isel(space=0) = 0

Depending on whether the underlying numpy indexing returns a copy or a view, the method will fail, and when it fails, it will fail silently. Instead, you should use normal index assignment:

# this is safe
arr[dict(space=0)] = 0

Pointwise indexing

xarray pointwise indexing supports the indexing along multiple labeled dimensions using list-like objects. While isel() performs orthogonal indexing, the isel_points() method provides similar numpy indexing behavior as if you were using multiple lists to index an array (e.g. arr[[0, 1], [0, 1]] ):

# index by integer array indices
In [12]: da = xr.DataArray(np.arange(56).reshape((7, 8)), dims=['x', 'y'])

In [13]: da
Out[13]: 
<xarray.DataArray (x: 7, y: 8)>
array([[ 0,  1,  2, ...,  5,  6,  7],
       [ 8,  9, 10, ..., 13, 14, 15],
       [16, 17, 18, ..., 21, 22, 23],
       ..., 
       [32, 33, 34, ..., 37, 38, 39],
       [40, 41, 42, ..., 45, 46, 47],
       [48, 49, 50, ..., 53, 54, 55]])
Coordinates:
  * x        (x) int64 0 1 2 3 4 5 6
  * y        (y) int64 0 1 2 3 4 5 6 7

In [14]: da.isel_points(x=[0, 1, 6], y=[0, 1, 0])
Out[14]: 
<xarray.DataArray (points: 3)>
array([ 0,  9, 48])
Coordinates:
    y        (points) int64 0 1 0
    x        (points) int64 0 1 6
  * points   (points) int64 0 1 2

There is also sel_points(), which analogously allows you to do point-wise indexing by label:

In [15]: times = pd.to_datetime(['2000-01-03', '2000-01-02', '2000-01-01'])

In [16]: arr.sel_points(space=['IA', 'IL', 'IN'], time=times)
Out[16]: 
<xarray.DataArray (points: 3)>
array([  0.451,   0.377, -10.   ])
Coordinates:
    time     (points) datetime64[ns] 2000-01-03 2000-01-02 2000-01-01
    space    (points) |S2 'IA' 'IL' 'IN'
  * points   (points) int64 0 1 2

The equivalent pandas method to sel_points is lookup().

Dataset indexing

We can also use these methods to index all variables in a dataset simultaneously, returning a new dataset:

In [17]: ds = arr.to_dataset(name='foo')

In [18]: ds.isel(space=[0], time=[0])
Out[18]: 
<xarray.Dataset>
Dimensions:  (space: 1, time: 1)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01
  * space    (space) |S2 'IA'
Data variables:
    foo      (time, space) float64 0.127

In [19]: ds.sel(time='2000-01-01')
Out[19]: 
<xarray.Dataset>
Dimensions:  (space: 3)
Coordinates:
    time     datetime64[ns] 2000-01-01
  * space    (space) |S2 'IA' 'IL' 'IN'
Data variables:
    foo      (space) float64 0.127 -10.0 -10.0

In [20]: ds2 = da.to_dataset(name='bar')

In [21]: ds2.isel_points(x=[0, 1, 6], y=[0, 1, 0], dim='points')
Out[21]: 
<xarray.Dataset>
Dimensions:  (points: 3)
Coordinates:
    y        (points) int64 0 1 0
    x        (points) int64 0 1 6
  * points   (points) int64 0 1 2
Data variables:
    bar      (points) int64 0 9 48

Positional indexing on a dataset is not supported because the ordering of dimensions in a dataset is somewhat ambiguous (it can vary between different arrays). However, you can do normal indexing with labeled dimensions:

In [22]: ds[dict(space=[0], time=[0])]
Out[22]: 
<xarray.Dataset>
Dimensions:  (space: 1, time: 1)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01
  * space    (space) |S2 'IA'
Data variables:
    foo      (time, space) float64 0.127

In [23]: ds.loc[dict(time='2000-01-01')]
Out[23]: 
<xarray.Dataset>
Dimensions:  (space: 3)
Coordinates:
    time     datetime64[ns] 2000-01-01
  * space    (space) |S2 'IA' 'IL' 'IN'
Data variables:
    foo      (space) float64 0.127 -10.0 -10.0

Using indexing to assign values to a subset of dataset (e.g., ds[dict(space=0)] = 1) is not yet supported.

Dropping labels

The drop() method returns a new object with the listed index labels along a dimension dropped:

In [24]: ds.drop(['IN', 'IL'], dim='space')
Out[24]: 
<xarray.Dataset>
Dimensions:  (space: 1, time: 4)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IA'
Data variables:
    foo      (time, space) float64 0.127 0.8972 0.4514 0.543

drop is both a Dataset and DataArray method.

Nearest neighbor lookups

The label based selection methods sel(), reindex() and reindex_like() all support method and tolerance keyword argument. The method parameter allows for enabling nearest neighbor (inexact) lookups by use of the methods 'pad', 'backfill' or 'nearest':

In [25]: data = xr.DataArray([1, 2, 3], dims='x')

In [26]: data.sel(x=[1.1, 1.9], method='nearest')
Out[26]: 
<xarray.DataArray (x: 2)>
array([2, 3])
Coordinates:
  * x        (x) int64 1 2

In [27]: data.sel(x=0.1, method='backfill')
Out[27]: 
<xarray.DataArray ()>
array(2)
Coordinates:
    x        int64 1

In [28]: data.reindex(x=[0.5, 1, 1.5, 2, 2.5], method='pad')
Out[28]: 
<xarray.DataArray (x: 5)>
array([1, 2, 2, 3, 3])
Coordinates:
  * x        (x) float64 0.5 1.0 1.5 2.0 2.5

Tolerance limits the maximum distance for valid matches with an inexact lookup:

In [29]: data.reindex(x=[1.1, 1.5], method='nearest', tolerance=0.2)
Out[29]: 
<xarray.DataArray (x: 2)>
array([  2.,  nan])
Coordinates:
  * x        (x) float64 1.1 1.5

Using method='nearest' or a scalar argument with .sel() requires pandas version 0.16 or newer. Using tolerance requries pandas version 0.17 or newer.

The method parameter is not yet supported if any of the arguments to .sel() is a slice object:

In [30]: data.sel(x=slice(1, 3), method='nearest')
NotImplementedError

However, you don’t need to use method to do inexact slicing. Slicing already returns all values inside the range (inclusive), as long as the index labels are monotonic increasing:

In [31]: data.sel(x=slice(0.9, 3.1))
Out[31]: 
<xarray.DataArray (x: 2)>
array([2, 3])
Coordinates:
  * x        (x) int64 1 2

Indexing axes with monotonic decreasing labels also works, as long as the slice or .loc arguments are also decreasing:

In [32]: reversed_data = data[::-1]

In [33]: reversed_data.loc[3.1:0.9]
Out[33]: 
<xarray.DataArray (x: 2)>
array([3, 2])
Coordinates:
  * x        (x) int64 2 1

Masking with where

Indexing methods on xarray objects generally return a subset of the original data. However, it is sometimes useful to select an object with the same shape as the original data, but with some elements masked. To do this type of selection in xarray, use where():

In [34]: arr2 = xr.DataArray(np.arange(16).reshape(4, 4), dims=['x', 'y'])

In [35]: arr2.where(arr2.x + arr2.y < 4)
Out[35]: 
<xarray.DataArray (x: 4, y: 4)>
array([[  0.,   1.,   2.,   3.],
       [  4.,   5.,   6.,  nan],
       [  8.,   9.,  nan,  nan],
       [ 12.,  nan,  nan,  nan]])
Coordinates:
  * x        (x) int64 0 1 2 3
  * y        (y) int64 0 1 2 3

This is particularly useful for ragged indexing of multi-dimensional data, e.g., to apply a 2D mask to an image. Note that where follows all the usual xarray broadcasting and alignment rules for binary operations (e.g., +) between the object being indexed and the condition, as described in Computation:

In [36]: arr2.where(arr2.y < 2)
Out[36]: 
<xarray.DataArray (x: 4, y: 4)>
array([[  0.,   1.,  nan,  nan],
       [  4.,   5.,  nan,  nan],
       [  8.,   9.,  nan,  nan],
       [ 12.,  13.,  nan,  nan]])
Coordinates:
  * x        (x) int64 0 1 2 3
  * y        (y) int64 0 1 2 3

Multi-dimensional indexing

xarray does not yet support efficient routines for generalized multi-dimensional indexing or regridding. However, we are definitely interested in adding support for this in the future (see GH475 for the ongoing discussion).

Copies vs. views

Whether array indexing returns a view or a copy of the underlying data depends on the nature of the labels. For positional (integer) indexing, xarray follows the same rules as NumPy:

  • Positional indexing with only integers and slices returns a view.
  • Positional indexing with arrays or lists returns a copy.

The rules for label based indexing are more complex:

  • Label-based indexing with only slices returns a view.
  • Label-based indexing with arrays returns a copy.
  • Label-based indexing with scalars returns a view or a copy, depending upon if the corresponding positional indexer can be represented as an integer or a slice object. The exact rules are determined by pandas.

Whether data is a copy or a view is more predictable in xarray than in pandas, so unlike pandas, xarray does not produce SettingWithCopy warnings. However, you should still avoid assignment with chained indexing.

Orthogonal (outer) vs. vectorized indexing

Indexing with xarray objects has one important difference from indexing numpy arrays: you can only use one-dimensional arrays to index xarray objects, and each indexer is applied “orthogonally” along independent axes, instead of using numpy’s broadcasting rules to vectorize indexers. This means you can do indexing like this, which would require slightly more awkward syntax with numpy arrays:

In [37]: arr[arr['time.day'] > 1, arr['space'] != 'IL']
Out[37]: 
<xarray.DataArray (time: 3, space: 2)>
array([[ 0.897,  0.336],
       [ 0.451,  0.123],
       [ 0.543,  0.448]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IA' 'IN'

This is a much simpler model than numpy’s advanced indexing. If you would like to do advanced-style array indexing in xarray, you have several options:

In [38]: arr.values[arr.values > 0.5]
Out[38]: array([ 0.897,  0.84 ,  0.543])

Align and reindex

xarray’s reindex, reindex_like and align impose a DataArray or Dataset onto a new set of coordinates corresponding to dimensions. The original values are subset to the index labels still found in the new labels, and values corresponding to new labels not found in the original object are in-filled with NaN.

xarray operations that combine multiple objects generally automatically align their arguments to share the same indexes. However, manual alignment can be useful for greater control and for increased performance.

To reindex a particular dimension, use reindex():

In [39]: arr.reindex(space=['IA', 'CA'])
Out[39]: 
<xarray.DataArray (time: 4, space: 2)>
array([[ 0.127,    nan],
       [ 0.897,    nan],
       [ 0.451,    nan],
       [ 0.543,    nan]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IA' 'CA'

The reindex_like() method is a useful shortcut. To demonstrate, we will make a subset DataArray with new values:

In [40]: foo = arr.rename('foo')

In [41]: baz = (10 * arr[:2, :2]).rename('baz')

In [42]: baz
Out[42]: 
<xarray.DataArray 'baz' (time: 2, space: 2)>
array([[   1.27 , -100.   ],
       [   8.972,    3.767]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02
  * space    (space) |S2 'IA' 'IL'

Reindexing foo with baz selects out the first two values along each dimension:

In [43]: foo.reindex_like(baz)
Out[43]: 
<xarray.DataArray 'foo' (time: 2, space: 2)>
array([[  0.127, -10.   ],
       [  0.897,   0.377]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02
  * space    (space) object 'IA' 'IL'

The opposite operation asks us to reindex to a larger shape, so we fill in the missing values with NaN:

In [44]: baz.reindex_like(foo)
Out[44]: 
<xarray.DataArray 'baz' (time: 4, space: 3)>
array([[   1.27 , -100.   ,      nan],
       [   8.972,    3.767,      nan],
       [     nan,      nan,      nan],
       [     nan,      nan,      nan]])
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) object 'IA' 'IL' 'IN'

The align() function lets us perform more flexible database-like 'inner', 'outer', 'left' and 'right' joins:

In [45]: xr.align(foo, baz, join='inner')
Out[45]: 
(<xarray.DataArray 'foo' (time: 2, space: 2)>
 array([[  0.127, -10.   ],
        [  0.897,   0.377]])
 Coordinates:
   * time     (time) datetime64[ns] 2000-01-01 2000-01-02
   * space    (space) object 'IA' 'IL',
 <xarray.DataArray 'baz' (time: 2, space: 2)>
 array([[   1.27 , -100.   ],
        [   8.972,    3.767]])
 Coordinates:
   * time     (time) datetime64[ns] 2000-01-01 2000-01-02
   * space    (space) object 'IA' 'IL')

In [46]: xr.align(foo, baz, join='outer')
Out[46]: 
(<xarray.DataArray 'foo' (time: 4, space: 3)>
 array([[  0.127, -10.   , -10.   ],
        [  0.897,   0.377,   0.336],
        [  0.451,   0.84 ,   0.123],
        [  0.543,   0.373,   0.448]])
 Coordinates:
   * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
   * space    (space) object 'IA' 'IL' 'IN',
 <xarray.DataArray 'baz' (time: 4, space: 3)>
 array([[   1.27 , -100.   ,      nan],
        [   8.972,    3.767,      nan],
        [     nan,      nan,      nan],
        [     nan,      nan,      nan]])
 Coordinates:
   * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
   * space    (space) object 'IA' 'IL' 'IN')

Both reindex_like and align work interchangeably between DataArray and Dataset objects, and with any number of matching dimension names:

In [47]: ds
Out[47]: 
<xarray.Dataset>
Dimensions:  (space: 3, time: 4)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IA' 'IL' 'IN'
Data variables:
    foo      (time, space) float64 0.127 -10.0 -10.0 0.8972 0.3767 0.3362 ...

In [48]: ds.reindex_like(baz)
Out[48]: 
<xarray.Dataset>
Dimensions:  (space: 2, time: 2)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02
  * space    (space) object 'IA' 'IL'
Data variables:
    foo      (time, space) float64 0.127 -10.0 0.8972 0.3767

In [49]: other = xr.DataArray(['a', 'b', 'c'], dims='other')

# this is a no-op, because there are no shared dimension names
In [50]: ds.reindex_like(other)
Out[50]: 
<xarray.Dataset>
Dimensions:  (space: 3, time: 4)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04
  * space    (space) |S2 'IA' 'IL' 'IN'
Data variables:
    foo      (time, space) float64 0.127 -10.0 -10.0 0.8972 0.3767 0.3362 ...

Computation

The labels associated with DataArray and Dataset objects enables some powerful shortcuts for computation, notably including aggregation and broadcasting by dimension names.

Basic array math

Arithmetic operations with a single DataArray automatically vectorize (like numpy) over all array values:

In [1]: arr = xr.DataArray(np.random.randn(2, 3),
   ...:                    [('x', ['a', 'b']), ('y', [10, 20, 30])])
   ...: 

In [2]: arr - 3
Out[2]: 
<xarray.DataArray (x: 2, y: 3)>
array([[-2.5308877 , -3.28286334, -4.5090585 ],
       [-4.13563237, -1.78788797, -3.17321465]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 10 20 30

In [3]: abs(arr)
Out[3]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.4691123 ,  0.28286334,  1.5090585 ],
       [ 1.13563237,  1.21211203,  0.17321465]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 10 20 30

You can also use any of numpy’s or scipy’s many ufunc functions directly on a DataArray:

In [4]: np.sin(arr)
Out[4]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.45209466, -0.27910634, -0.99809483],
       [-0.90680094,  0.9363595 , -0.17234978]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 10 20 30

Data arrays also implement many numpy.ndarray methods:

In [5]: arr.round(2)
Out[5]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.47, -0.28, -1.51],
       [-1.14,  1.21, -0.17]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 10 20 30

In [6]: arr.T
Out[6]: 
<xarray.DataArray (y: 3, x: 2)>
array([[ 0.4691123 , -1.13563237],
       [-0.28286334,  1.21211203],
       [-1.5090585 , -0.17321465]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 10 20 30

Missing values

xarray objects borrow the isnull(), notnull(), count(), dropna() and fillna() methods for working with missing data from pandas:

In [7]: x = xr.DataArray([0, 1, np.nan, np.nan, 2], dims=['x'])

In [8]: x.isnull()
Out[8]: 
<xarray.DataArray (x: 5)>
array([False, False,  True,  True, False], dtype=bool)
Coordinates:
  * x        (x) int64 0 1 2 3 4

In [9]: x.notnull()
Out[9]: 
<xarray.DataArray (x: 5)>
array([ True,  True, False, False,  True], dtype=bool)
Coordinates:
  * x        (x) int64 0 1 2 3 4

In [10]: x.count()
Out[10]: 
<xarray.DataArray ()>
array(3)

In [11]: x.dropna(dim='x')
Out[11]: 
<xarray.DataArray (x: 3)>
array([ 0.,  1.,  2.])
Coordinates:
  * x        (x) int64 0 1 4

In [12]: x.fillna(-1)
Out[12]: 
<xarray.DataArray (x: 5)>
array([ 0.,  1., -1., -1.,  2.])
Coordinates:
  * x        (x) int64 0 1 2 3 4

Like pandas, xarray uses the float value np.nan (not-a-number) to represent missing values.

Aggregation

Aggregation methods have been updated to take a dim argument instead of axis. This allows for very intuitive syntax for aggregation methods that are applied along particular dimension(s):

In [13]: arr.sum(dim='x')
Out[13]: 
<xarray.DataArray (y: 3)>
array([-0.66652007,  0.92924868, -1.68227315])
Coordinates:
  * y        (y) int64 10 20 30

In [14]: arr.std(['x', 'y'])
Out[14]: 
<xarray.DataArray ()>
array(0.9156385956757354)

In [15]: arr.min()
Out[15]: 
<xarray.DataArray ()>
array(-1.5090585031735124)

If you need to figure out the axis number for a dimension yourself (say, for wrapping code designed to work with numpy arrays), you can use the get_axis_num() method:

In [16]: arr.get_axis_num('y')
Out[16]: 1

These operations automatically skip missing values, like in pandas:

In [17]: xr.DataArray([1, 2, np.nan, 3]).mean()
Out[17]: 
<xarray.DataArray ()>
array(2.0)

If desired, you can disable this behavior by invoking the aggregation method with skipna=False.

Rolling window operations

DataArray objects include a rolling() method. This method supports rolling window aggregation:

In [18]: arr = xr.DataArray(np.arange(0, 7.5, 0.5).reshape(3, 5),
   ....:                    dims=('x', 'y'))
   ....: 

In [19]: arr
Out[19]: 
<xarray.DataArray (x: 3, y: 5)>
array([[ 0. ,  0.5,  1. ,  1.5,  2. ],
       [ 2.5,  3. ,  3.5,  4. ,  4.5],
       [ 5. ,  5.5,  6. ,  6.5,  7. ]])
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) int64 0 1 2 3 4

rolling() is applied along one dimension using the name of the dimension as a key (e.g. y) and the window size as the value (e.g. 3). We get back a Rolling object:

In [20]: arr.rolling(y=3)
Out[20]: DataArrayRolling [window->3,center->False,dim->y]

The label position and minimum number of periods in the rolling window are controlled by the center and min_periods arguments:

In [21]: arr.rolling(y=3, min_periods=2, center=True)
Out[21]: DataArrayRolling [window->3,min_periods->2,center->True,dim->y]

Aggregation and summary methods can be applied directly to the Rolling object:

In [22]: r = arr.rolling(y=3)

In [23]: r.mean()
Out[23]: 
<xarray.DataArray (y: 5, x: 3)>
array([[ nan,  nan,  nan],
       [ nan,  nan,  nan],
       [ 0.5,  3. ,  5.5],
       [ 1. ,  3.5,  6. ],
       [ 1.5,  4. ,  6.5]])
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) int64 0 1 2 3 4

In [24]: r.reduce(np.std)
Out[24]: 
<xarray.DataArray (y: 5, x: 3)>
array([[        nan,         nan,         nan],
       [        nan,         nan,         nan],
       [ 0.40824829,  0.40824829,  0.40824829],
       [ 0.40824829,  0.40824829,  0.40824829],
       [ 0.40824829,  0.40824829,  0.40824829]])
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) int64 0 1 2 3 4

Note that rolling window aggregations are much faster (both asymptotically and because they avoid a loop in Python) when bottleneck is installed. Otherwise, we fall back to a slower, pure Python implementation.

Finally, we can manually iterate through Rolling objects:

In [25]: for label, arr_window in r:
   # arr_window is a view of x

Broadcasting by dimension name

DataArray objects are automatically align themselves (“broadcasting” in the numpy parlance) by dimension name instead of axis order. With xarray, you do not need to transpose arrays or insert dimensions of length 1 to get array operations to work, as commonly done in numpy with np.reshape() or np.newaxis.

This is best illustrated by a few examples. Consider two one-dimensional arrays with different sizes aligned along different dimensions:

In [26]: a = xr.DataArray([1, 2], [('x', ['a', 'b'])])

In [27]: a
Out[27]: 
<xarray.DataArray (x: 2)>
array([1, 2])
Coordinates:
  * x        (x) |S1 'a' 'b'

In [28]: b = xr.DataArray([-1, -2, -3], [('y', [10, 20, 30])])

In [29]: b
Out[29]: 
<xarray.DataArray (y: 3)>
array([-1, -2, -3])
Coordinates:
  * y        (y) int64 10 20 30

With xarray, we can apply binary mathematical operations to these arrays, and their dimensions are expanded automatically:

In [30]: a * b
Out[30]: 
<xarray.DataArray (x: 2, y: 3)>
array([[-1, -2, -3],
       [-2, -4, -6]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 10 20 30

Moreover, dimensions are always reordered to the order in which they first appeared:

In [31]: c = xr.DataArray(np.arange(6).reshape(3, 2), [b['y'], a['x']])

In [32]: c
Out[32]: 
<xarray.DataArray (y: 3, x: 2)>
array([[0, 1],
       [2, 3],
       [4, 5]])
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) |S1 'a' 'b'

In [33]: a + c
Out[33]: 
<xarray.DataArray (x: 2, y: 3)>
array([[1, 3, 5],
       [3, 5, 7]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 10 20 30

This means, for example, that you always subtract an array from its transpose:

In [34]: c - c.T
Out[34]: 
<xarray.DataArray (y: 3, x: 2)>
array([[0, 0],
       [0, 0],
       [0, 0]])
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) |S1 'a' 'b'

You can explicitly broadcast xaray data structures by using the broadcast() function:

a2, b2 = xr.broadcast(a, b2) a2 b2

Automatic alignment

xarray enforces alignment between index Coordinates (that is, coordinates with the same name as a dimension, marked by *) on objects used in binary operations.

Similarly to pandas, this alignment is automatic for arithmetic on binary operations. Note that unlike pandas, this the result of a binary operation is by the intersection (not the union) of coordinate labels:

In [35]: arr + arr[:1]
Out[35]: 
<xarray.DataArray (x: 1, y: 5)>
array([[ 0.,  1.,  2.,  3.,  4.]])
Coordinates:
  * x        (x) int64 0
  * y        (y) int64 0 1 2 3 4

If the result would be empty, an error is raised instead:

In [36]: arr[:2] + arr[2:]
ValueError: no overlapping labels for some dimensions: ['x']

Before loops or performance critical code, it’s a good idea to align arrays explicitly (e.g., by putting them in the same Dataset or using align()) to avoid the overhead of repeated alignment with each operation. See Align and reindex for more details.

Note

There is no automatic alignment between arguments when performing in-place arithmetic operations such as +=. You will need to use manual alignment. This ensures in-place arithmetic never needs to modify data types.

Coordinates

Although index coordinates are aligned, other coordinates are not, and if their values conflict, they will be dropped. This is necessary, for example, because indexing turns 1D coordinates into scalar coordinates:

In [37]: arr[0]
Out[37]: 
<xarray.DataArray (y: 5)>
array([ 0. ,  0.5,  1. ,  1.5,  2. ])
Coordinates:
    x        int64 0
  * y        (y) int64 0 1 2 3 4

In [38]: arr[1]
Out[38]: 
<xarray.DataArray (y: 5)>
array([ 2.5,  3. ,  3.5,  4. ,  4.5])
Coordinates:
    x        int64 1
  * y        (y) int64 0 1 2 3 4

# notice that the scalar coordinate 'x' is silently dropped
In [39]: arr[1] - arr[0]
Out[39]: 
<xarray.DataArray (y: 5)>
array([ 2.5,  2.5,  2.5,  2.5,  2.5])
Coordinates:
  * y        (y) int64 0 1 2 3 4

Still, xarray will persist other coordinates in arithmetic, as long as there are no conflicting values:

# only one argument has the 'x' coordinate
In [40]: arr[0] + 1
Out[40]: 
<xarray.DataArray (y: 5)>
array([ 1. ,  1.5,  2. ,  2.5,  3. ])
Coordinates:
    x        int64 0
  * y        (y) int64 0 1 2 3 4

# both arguments have the same 'x' coordinate
In [41]: arr[0] - arr[0]
Out[41]: 
<xarray.DataArray (y: 5)>
array([ 0.,  0.,  0.,  0.,  0.])
Coordinates:
    x        int64 0
  * y        (y) int64 0 1 2 3 4

Math with datasets

Datasets support arithmetic operations by automatically looping over all data variables:

In [42]: ds = xr.Dataset({'x_and_y': (('x', 'y'), np.random.randn(3, 5)),
   ....:                  'x_only': ('x', np.random.randn(3))},
   ....:                  coords=arr.coords)
   ....: 

In [43]: ds > 0
Out[43]: 
<xarray.Dataset>
Dimensions:  (x: 3, y: 5)
Coordinates:
  * y        (y) int64 0 1 2 3 4
  * x        (x) int64 0 1 2
Data variables:
    x_only   (x) bool True False True
    x_and_y  (x, y) bool True False False False False True True False False ...

Datasets support most of the same methods found on data arrays:

In [44]: ds.mean(dim='x')
Out[44]: 
<xarray.Dataset>
Dimensions:  (y: 5)
Coordinates:
  * y        (y) int64 0 1 2 3 4
Data variables:
    x_only   float64 -0.2799
    x_and_y  (y) float64 0.2553 0.08145 -0.4308 -1.411 -0.2989

In [45]: abs(ds)
Out[45]: 
<xarray.Dataset>
Dimensions:  (x: 3, y: 5)
Coordinates:
  * y        (y) int64 0 1 2 3 4
  * x        (x) int64 0 1 2
Data variables:
    x_only   (x) float64 0.1136 1.478 0.525
    x_and_y  (x, y) float64 0.1192 1.044 0.8618 2.105 0.4949 1.072 0.7216 ...

Unfortunately, a limitation of the current version of numpy means that we cannot override ufuncs for datasets, because datasets cannot be written as a single array [1]. apply() works around this limitation, by applying the given function to each variable in the dataset:

In [46]: ds.apply(np.sin)
Out[46]: 
<xarray.Dataset>
Dimensions:  (x: 3, y: 5)
Coordinates:
  * x        (x) int64 0 1 2
  * y        (y) int64 0 1 2 3 4
Data variables:
    x_only   (x) float64 0.1134 -0.9957 0.5012
    x_and_y  (x, y) float64 0.1189 -0.8645 -0.759 -0.8609 -0.475 0.8781 ...

Datasets also use looping over variables for broadcasting in binary arithmetic. You can do arithmetic between any DataArray and a dataset:

In [47]: ds + arr
Out[47]: 
<xarray.Dataset>
Dimensions:  (x: 3, y: 5)
Coordinates:
  * y        (y) int64 0 1 2 3 4
  * x        (x) int64 0 1 2
Data variables:
    x_only   (x, y) float64 0.1136 0.6136 1.114 1.614 2.114 1.022 1.522 ...
    x_and_y  (x, y) float64 0.1192 -0.5442 0.1382 -0.6046 1.505 3.572 3.722 ...

Arithmetic between two datasets matches data variables of the same name:

In [48]: ds2 = xr.Dataset({'x_and_y': 0, 'x_only': 100})

In [49]: ds - ds2
Out[49]: 
<xarray.Dataset>
Dimensions:  (x: 3, y: 5)
Coordinates:
  * y        (y) int64 0 1 2 3 4
  * x        (x) int64 0 1 2
Data variables:
    x_only   (x) float64 -99.89 -101.5 -99.48
    x_and_y  (x, y) float64 0.1192 -1.044 -0.8618 -2.105 -0.4949 1.072 ...

Similarly to index based alignment, the result has the intersection of all matching variables, and ValueError is raised if the result would be empty.

[1]In some future version of NumPy, we should be able to override ufuncs for datasets by making use of __numpy_ufunc__.

GroupBy: split-apply-combine

xarray supports “group by” operations with the same API as pandas to implement the split-apply-combine strategy:

  • Split your data into multiple independent groups.
  • Apply some function to each group.
  • Combine your groups back into a single data object.

Group by operations work on both Dataset and DataArray objects. Currently, you can only group by a single one-dimensional variable (eventually, we hope to remove this limitation). Also, note that for one-dimensional data, it is usually faster to rely on pandas’ implementation of the same pipeline.

Split

Let’s create a simple example dataset:

In [1]: ds = xr.Dataset({'foo': (('x', 'y'), np.random.rand(4, 3))},
   ...:                 coords={'x': [10, 20, 30, 40],
   ...:                         'letters': ('x', list('abba'))})
   ...: 

In [2]: arr = ds['foo']

In [3]: ds
Out[3]: 
<xarray.Dataset>
Dimensions:  (x: 4, y: 3)
Coordinates:
  * x        (x) int64 10 20 30 40
    letters  (x) |S1 'a' 'b' 'b' 'a'
  * y        (y) int64 0 1 2
Data variables:
    foo      (x, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 0.4514 ...

If we groupby the name of a variable or coordinate in a dataset (we can also use a DataArray directly), we get back a GroupBy object:

In [4]: ds.groupby('letters')
Out[4]: <xarray.core.groupby.DatasetGroupBy at 0x7f0c293abfd0>

This object works very similarly to a pandas GroupBy object. You can view the group indices with the groups attribute:

In [5]: ds.groupby('letters').groups
Out[5]: {'a': [0, 3], 'b': [1, 2]}

You can also iterate over over groups in (label, group) pairs:

In [6]: list(ds.groupby('letters'))
Out[6]: 
[('a', <xarray.Dataset>
  Dimensions:  (x: 2, y: 3)
  Coordinates:
    * x        (x) int64 10 40
      letters  (x) |S1 'a' 'a'
    * y        (y) int64 0 1 2
  Data variables:
      foo      (x, y) float64 0.127 0.9667 0.2605 0.543 0.373 0.448),
 ('b', <xarray.Dataset>
  Dimensions:  (x: 2, y: 3)
  Coordinates:
    * x        (x) int64 20 30
      letters  (x) |S1 'b' 'b'
    * y        (y) int64 0 1 2
  Data variables:
      foo      (x, y) float64 0.8972 0.3767 0.3362 0.4514 0.8403 0.1231)]

Just like in pandas, creating a GroupBy object is cheap: it does not actually split the data until you access particular values.

Apply

To apply a function to each group, you can use the flexible apply() method. The resulting objects are automatically concatenated back together along the group axis:

In [7]: def standardize(x):
   ...:     return (x - x.mean()) / x.std()
   ...: 

In [8]: arr.groupby('letters').apply(standardize)
Out[8]: 
<xarray.DataArray 'foo' (x: 4, y: 3)>
array([[-1.23 ,  1.937, -0.726],
       [ 1.42 , -0.46 , -0.607],
       [-0.191,  1.214, -1.376],
       [ 0.339, -0.302, -0.019]])
Coordinates:
  * y        (y) int64 0 1 2
  * x        (x) int64 10 20 30 40
    letters  (x) |S1 'a' 'b' 'b' 'a'

GroupBy objects also have a reduce() method and methods like mean() as shortcuts for applying an aggregation function:

In [9]: arr.groupby('letters').mean(dim='x')
Out[9]: 
<xarray.DataArray 'foo' (letters: 2, y: 3)>
array([[ 0.335,  0.67 ,  0.354],
       [ 0.674,  0.609,  0.23 ]])
Coordinates:
  * y        (y) int64 0 1 2
  * letters  (letters) object 'a' 'b'

Using a groupby is thus also a convenient shortcut for aggregating over all dimensions other than the provided one:

In [10]: ds.groupby('x').std()
Out[10]: 
<xarray.Dataset>
Dimensions:  (x: 4)
Coordinates:
  * x        (x) int64 10 20 30 40
    letters  (x) |S1 'a' 'b' 'b' 'a'
Data variables:
    foo      (x) float64 0.3684 0.2554 0.2931 0.06957

First and last

There are two special aggregation operations that are currently only found on groupby objects: first and last. These provide the first or last example of values for group along the grouped dimension:

In [11]: ds.groupby('letters').first()
Out[11]: 
<xarray.Dataset>
Dimensions:  (letters: 2, y: 3)
Coordinates:
  * y        (y) int64 0 1 2
  * letters  (letters) object 'a' 'b'
Data variables:
    foo      (letters, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362

By default, they skip missing values (control this with skipna).

Grouped arithmetic

GroupBy objects also support a limited set of binary arithmetic operations, as a shortcut for mapping over all unique labels. Binary arithmetic is supported for (GroupBy, Dataset) and (GroupBy, DataArray) pairs, as long as the dataset or data array uses the unique grouped values as one of its index coordinates. For example:

In [12]: alt = arr.groupby('letters').mean()

In [13]: alt
Out[13]: 
<xarray.DataArray 'foo' (letters: 2)>
array([ 0.453,  0.504])
Coordinates:
  * letters  (letters) object 'a' 'b'

In [14]: ds.groupby('letters') - alt
Out[14]: 
<xarray.Dataset>
Dimensions:  (x: 4, y: 3)
Coordinates:
  * y        (y) int64 0 1 2
  * x        (x) int64 10 20 30 40
    letters  (x) |S1 'a' 'b' 'b' 'a'
Data variables:
    foo      (x, y) float64 -0.3261 0.5137 -0.1926 0.3931 -0.1274 -0.1679 ...

This last line is roughly equivalent to the following:

results = []
for label, group in ds.groupby('letters'):
    results.append(group - alt.sel(x=label))
xr.concat(results, dim='x')

Squeezing

When grouping over a dimension, you can control whether the dimension is squeezed out or if it should remain with length one on each group by using the squeeze parameter:

In [15]: next(iter(arr.groupby('x')))
Out[15]: 
(10, <xarray.DataArray 'foo' (y: 3)>
 array([ 0.127,  0.967,  0.26 ])
 Coordinates:
     x        int64 10
     letters  |S1 'a'
   * y        (y) int64 0 1 2)
In [16]: next(iter(arr.groupby('x', squeeze=False)))
Out[16]: 
(10, <xarray.DataArray 'foo' (x: 1, y: 3)>
 array([[ 0.127,  0.967,  0.26 ]])
 Coordinates:
   * x        (x) int64 10
     letters  (x) |S1 'a'
   * y        (y) int64 0 1 2)

Although xarray will attempt to automatically transpose dimensions back into their original order when you use apply, it is sometimes useful to set squeeze=False to guarantee that all original dimensions remain unchanged.

You can always squeeze explicitly later with the Dataset or DataArray squeeze() methods.

Reshaping and reorganizing data

These methods allow you to reorganize

Reordering dimensions

To reorder dimensions on a DataArray or across all variables on a Dataset, use transpose() or the .T property:

In [1]: ds = xr.Dataset({'foo': (('x', 'y', 'z'), [[[42]]]), 'bar': (('y', 'z'), [[24]])})

In [2]: ds.transpose('y', 'z', 'x')
Out[2]: 
<xarray.Dataset>
Dimensions:  (x: 1, y: 1, z: 1)
Coordinates:
  * x        (x) int64 0
  * y        (y) int64 0
  * z        (z) int64 0
Data variables:
    foo      (y, z, x) int64 42
    bar      (y, z) int64 24

In [3]: ds.T
Out[3]: 
<xarray.Dataset>
Dimensions:  (x: 1, y: 1, z: 1)
Coordinates:
  * x        (x) int64 0
  * y        (y) int64 0
  * z        (z) int64 0
Data variables:
    foo      (z, y, x) int64 42
    bar      (z, y) int64 24

Converting between datasets and arrays

To convert from a Dataset to a DataArray, use to_array():

In [4]: arr = ds.to_array()

In [5]: arr
Out[5]: 
<xarray.DataArray (variable: 2, x: 1, y: 1, z: 1)>
array([[[[42]]],


       [[[24]]]])
Coordinates:
  * y         (y) int64 0
  * x         (x) int64 0
  * z         (z) int64 0
  * variable  (variable) |S3 'foo' 'bar'

This method broadcasts all data variables in the dataset against each other, then concatenates them along a new dimension into a new array while preserving coordinates.

To convert back from a DataArray to a Dataset, use to_dataset():

In [6]: arr.to_dataset(dim='variable')
Out[6]: 
<xarray.Dataset>
Dimensions:  (x: 1, y: 1, z: 1)
Coordinates:
  * y        (y) int64 0
  * x        (x) int64 0
  * z        (z) int64 0
Data variables:
    foo      (x, y, z) int64 42
    bar      (x, y, z) int64 24

The broadcasting behavior of to_array means that the resulting array includes the union of data variable dimensions:

In [7]: ds2 = xr.Dataset({'a': 0, 'b': ('x', [3, 4, 5])})

# the input dataset has 4 elements
In [8]: ds2
Out[8]: 
<xarray.Dataset>
Dimensions:  (x: 3)
Coordinates:
  * x        (x) int64 0 1 2
Data variables:
    a        int64 0
    b        (x) int64 3 4 5

# the resulting array has 6 elements
In [9]: ds2.to_array()
Out[9]: 
<xarray.DataArray (variable: 2, x: 3)>
array([[0, 0, 0],
       [3, 4, 5]])
Coordinates:
  * variable  (variable) |S1 'a' 'b'
  * x         (x) int64 0 1 2

Otherwise, the result could not be represented as an orthogonal array.

If you use to_dataset without supplying the dim argument, the DataArray will be converted into a Dataset of one variable:

In [10]: arr.to_dataset(name='combined')
Out[10]: 
<xarray.Dataset>
Dimensions:   (variable: 2, x: 1, y: 1, z: 1)
Coordinates:
  * y         (y) int64 0
  * x         (x) int64 0
  * z         (z) int64 0
  * variable  (variable) |S3 'foo' 'bar'
Data variables:
    combined  (variable, x, y, z) int64 42 24

Stack and unstack

As part of xarray’s nascent support for pandas.MultiIndex, we have implemented stack() and unstack() method, for combining or splitting dimensions:

In [11]: array = xr.DataArray(np.random.randn(2, 3),
   ....:                      coords=[('x', ['a', 'b']), ('y', [0, 1, 2])])
   ....: 

In [12]: stacked = array.stack(z=('x', 'y'))

In [13]: stacked
Out[13]: 
<xarray.DataArray (z: 6)>
array([ 0.469, -0.283, -1.509, -1.136,  1.212, -0.173])
Coordinates:
  * z        (z) object ('a', 0) ('a', 1) ('a', 2) ('b', 0) ('b', 1) ('b', 2)

In [14]: stacked.unstack('z')
Out[14]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.469, -0.283, -1.509],
       [-1.136,  1.212, -0.173]])
Coordinates:
  * x        (x) object 'a' 'b'
  * y        (y) int64 0 1 2

These methods are modeled on the pandas.DataFrame methods of the same name, although they in xarray they always create new dimensions rather than adding to the existing index or columns.

Like DataFrame.unstack, xarray’s unstack always succeeds, even if the multi-index being unstacked does not contain all possible levels. Missing levels are filled in with NaN in the resulting object:

In [15]: stacked2 = stacked[::2]

In [16]: stacked2
Out[16]: 
<xarray.DataArray (z: 3)>
array([ 0.469, -1.509,  1.212])
Coordinates:
  * z        (z) object ('a', 0) ('a', 2) ('b', 1)

In [17]: stacked2.unstack('z')
Out[17]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.469,    nan, -1.509],
       [   nan,  1.212,    nan]])
Coordinates:
  * x        (x) object 'a' 'b'
  * y        (y) int64 0 1 2

However, xarray’s stack has an important difference from pandas: unlike pandas, it does not automatically drop missing values. Compare:

In [18]: array = xr.DataArray([[np.nan, 1], [2, 3]], dims=['x', 'y'])

In [19]: array.stack(z=('x', 'y'))
Out[19]: 
<xarray.DataArray (z: 4)>
array([ nan,   1.,   2.,   3.])
Coordinates:
  * z        (z) object (0, 0) (0, 1) (1, 0) (1, 1)

In [20]: array.to_pandas().stack()
Out[20]: 
x  y
0  1    1
1  0    2
   1    3
dtype: float64

We departed from pandas’s behavior here because predictable shapes for new array dimensions is necessary for Out of core computation with dask.

Shift and roll

To adjust coordinate labels, you can use the shift() and roll() methods:

In [21]: array = xr.DataArray([1, 2, 3, 4], dims='x')

In [22]: array.shift(x=2)
Out[22]: 
<xarray.DataArray (x: 4)>
array([ nan,  nan,   1.,   2.])
Coordinates:
  * x        (x) int64 0 1 2 3

In [23]: array.roll(x=2)
Out[23]: 
<xarray.DataArray (x: 4)>
array([3, 4, 1, 2])
Coordinates:
  * x        (x) int64 2 3 0 1

Combining data

  • For combining datasets or data arrays along a dimension, see concatenate.
  • For combining datasets with different variables, see merge.

Concatenate

To combine arrays along existing or new dimension into a larger array, you can use concat(). concat takes an iterable of DataArray or Dataset objects, as well as a dimension name, and concatenates along that dimension:

In [1]: arr = xr.DataArray(np.random.randn(2, 3),
   ...:                    [('x', ['a', 'b']), ('y', [10, 20, 30])])
   ...: 

In [2]: arr[:, :1]
Out[2]: 
<xarray.DataArray (x: 2, y: 1)>
array([[ 0.4691123 ],
       [-1.13563237]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 10

# this resembles how you would use np.concatenate
In [3]: xr.concat([arr[:, :1], arr[:, 1:]], dim='y')
Out[3]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
       [-1.13563237,  1.21211203, -0.17321465]])
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 10 20 30

In addition to combining along an existing dimension, concat can create a new dimension by stacking lower dimensional arrays together:

In [4]: arr[0]
Out[4]: 
<xarray.DataArray (y: 3)>
array([ 0.4691123 , -0.28286334, -1.5090585 ])
Coordinates:
    x        |S1 'a'
  * y        (y) int64 10 20 30

# to combine these 1d arrays into a 2d array in numpy, you would use np.array
In [5]: xr.concat([arr[0], arr[1]], 'x')
Out[5]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
       [-1.13563237,  1.21211203, -0.17321465]])
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) |S1 'a' 'b'

If the second argument to concat is a new dimension name, the arrays will be concatenated along that new dimension, which is always inserted as the first dimension:

In [6]: xr.concat([arr[0], arr[1]], 'new_dim')
Out[6]: 
<xarray.DataArray (new_dim: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
       [-1.13563237,  1.21211203, -0.17321465]])
Coordinates:
  * y        (y) int64 10 20 30
    x        (new_dim) |S1 'a' 'b'
  * new_dim  (new_dim) int64 0 1

The second argument to concat can also be an Index or DataArray object as well as a string, in which case it is used to label the values along the new dimension:

In [7]: xr.concat([arr[0], arr[1]], pd.Index([-90, -100], name='new_dim'))
Out[7]: 
<xarray.DataArray (new_dim: 2, y: 3)>
array([[ 0.4691123 , -0.28286334, -1.5090585 ],
       [-1.13563237,  1.21211203, -0.17321465]])
Coordinates:
  * y        (y) int64 10 20 30
    x        (new_dim) |S1 'a' 'b'
  * new_dim  (new_dim) int64 -90 -100

Of course, concat also works on Dataset objects:

In [8]: ds = arr.to_dataset(name='foo')

In [9]: xr.concat([ds.sel(x='a'), ds.sel(x='b')], 'x')
Out[9]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * y        (y) int64 10 20 30
  * x        (x) |S1 'a' 'b'
Data variables:
    foo      (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732

concat() has a number of options which provide deeper control over which variables are concatenated and how it handles conflicting variables between datasets. With the default parameters, xarray will load some coordinate variables into memory to compare them between datasets. This may be prohibitively expensive if you are manipulating your dataset lazily using Out of core computation with dask.

Merge

To combine variables and coordinates between multiple Datasets, you can use the merge() and update() methods. Merge checks for conflicting variables before merging and by default it returns a new Dataset:

In [10]: ds.merge({'hello': ('space', np.arange(3) + 10)})
Out[10]: 
<xarray.Dataset>
Dimensions:  (space: 3, x: 2, y: 3)
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 10 20 30
  * space    (space) int64 0 1 2
Data variables:
    foo      (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
    hello    (space) int64 10 11 12

If you merge another dataset (or a dictionary including data array objects), by default the resulting dataset will be aligned on the union of all index coordinates:

In [11]: other = xr.Dataset({'bar': ('x', [1, 2, 3, 4]), 'x': list('abcd')})

In [12]: ds.merge(other)
Out[12]: 
<xarray.Dataset>
Dimensions:  (x: 4, y: 3)
Coordinates:
  * x        (x) object 'a' 'b' 'c' 'd'
  * y        (y) int64 10 20 30
Data variables:
    foo      (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732 nan ...
    bar      (x) int64 1 2 3 4

This ensures that the merge is non-destructive.

The same non-destructive merging between DataArray index coordinates is used in the Dataset constructor:

In [13]: xr.Dataset({'a': arr[:-1], 'b': arr[1:]})
Out[13]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) object 'a' 'b'
  * y        (y) int64 10 20 30
Data variables:
    a        (x, y) float64 0.4691 -0.2829 -1.509 nan nan nan
    b        (x, y) float64 nan nan nan -1.136 1.212 -0.1732

Update

In contrast to merge, update modifies a dataset in-place without checking for conflicts, and will overwrite any existing variables with new values:

In [14]: ds.update({'space': ('space', [10.2, 9.4, 3.9])})
Out[14]: 
<xarray.Dataset>
Dimensions:  (space: 3, x: 2, y: 3)
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 10 20 30
  * space    (space) float64 10.2 9.4 3.9
Data variables:
    foo      (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732

However, dimensions are still required to be consistent between different Dataset variables, so you cannot change the size of a dimension unless you replace all dataset variables that use it.

update also performs automatic alignment if necessary. Unlike merge, it maintains the alignment of the original array instead of merging indexes:

In [15]: ds.update(other)
Out[15]: 
<xarray.Dataset>
Dimensions:  (space: 3, x: 2, y: 3)
Coordinates:
  * x        (x) object 'a' 'b'
  * y        (y) int64 10 20 30
  * space    (space) float64 10.2 9.4 3.9
Data variables:
    foo      (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
    bar      (x) int64 1 2

The exact same alignment logic when setting a variable with __setitem__ syntax:

In [16]: ds['baz'] = xr.DataArray([9, 9, 9, 9, 9], coords=[('x', list('abcde'))])

In [17]: ds.baz
Out[17]: 
<xarray.DataArray 'baz' (x: 2)>
array([9, 9])
Coordinates:
  * x        (x) object 'a' 'b'

Equals and identical

xarray objects can be compared by using the equals(), identical() and broadcast_equals() methods. These methods are used by the optional compat argument on concat and merge.

equals checks dimension names, indexes and array values:

In [18]: arr.equals(arr.copy())
Out[18]: True

identical also checks attributes, and the name of each object:

In [19]: arr.identical(arr.rename('bar'))
Out[19]: False

broadcast_equals does a more relaxed form of equality check that allows variables to have different dimensions, as long as values are constant along those new dimensions:

In [20]: left = xr.Dataset(coords={'x': 0})

In [21]: right = xr.Dataset({'x': [0, 0, 0]})

In [22]: left.broadcast_equals(right)
Out[22]: True

Like pandas objects, two xarray objects are still equal or identical if they have missing values marked by NaN in the same locations.

In contrast, the == operation performs element-wise comparison (like numpy):

In [23]: arr == arr.copy()
Out[23]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ True,  True,  True],
       [ True,  True,  True]], dtype=bool)
Coordinates:
  * x        (x) |S1 'a' 'b'
  * y        (y) int64 10 20 30

Note that NaN does not compare equal to NaN in element-wise comparison; you may need to deal with missing values explicitly.

Time series data

A major use case for xarray is multi-dimensional time-series data. Accordingly, we’ve copied many of features that make working with time-series data in pandas such a joy to xarray. In most cases, we rely on pandas for the core functionality.

Creating datetime64 data

xarray uses the numpy dtypes datetime64[ns] and timedelta64[ns] to represent datetime data, which offer vectorized (if sometimes buggy) operations with numpy and smooth integration with pandas.

To convert to or create regular arrays of datetime64 data, we recommend using pandas.to_datetime() and pandas.date_range():

In [1]: pd.to_datetime(['2000-01-01', '2000-02-02'])
Out[1]: DatetimeIndex(['2000-01-01', '2000-02-02'], dtype='datetime64[ns]', freq=None)

In [2]: pd.date_range('2000-01-01', periods=365)
Out[2]: 
DatetimeIndex(['2000-01-01', '2000-01-02', '2000-01-03', '2000-01-04',
               '2000-01-05', '2000-01-06', '2000-01-07', '2000-01-08',
               '2000-01-09', '2000-01-10',
               ...
               '2000-12-21', '2000-12-22', '2000-12-23', '2000-12-24',
               '2000-12-25', '2000-12-26', '2000-12-27', '2000-12-28',
               '2000-12-29', '2000-12-30'],
              dtype='datetime64[ns]', length=365, freq='D')

Alternatively, you can supply arrays of Python datetime objects. These get converted automatically when used as arguments in xarray objects:

In [3]: import datetime

In [4]: xr.Dataset({'time': datetime.datetime(2000, 1, 1)})
Out[4]: 
<xarray.Dataset>
Dimensions:  ()
Coordinates:
    *empty*
Data variables:
    time     datetime64[ns] 2000-01-01

When reading or writing netCDF files, xarray automatically decodes datetime and timedelta arrays using CF conventions (that is, by using a units attribute like 'days since 2000-01-01').

You can manual decode arrays in this form by passing a dataset to decode_cf():

In [5]: attrs = {'units': 'hours since 2000-01-01'}

In [6]: ds = xr.Dataset({'time': ('time', [0, 1, 2, 3], attrs)})

In [7]: xr.decode_cf(ds)
Out[7]: 
<xarray.Dataset>
Dimensions:  (time: 4)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...
Data variables:
    *empty*

One unfortunate limitation of using datetime64[ns] is that it limits the native representation of dates to those that fall between the years 1678 and 2262. When a netCDF file contains dates outside of these bounds, dates will be returned as arrays of netcdftime.datetime objects.

Datetime indexing

xarray borrows powerful indexing machinery from pandas (see Indexing and selecting data).

This allows for several useful and suscinct forms of indexing, particularly for datetime64 data. For example, we support indexing with strings for single items and with the slice object:

In [8]: time = pd.date_range('2000-01-01', freq='H', periods=365 * 24)

In [9]: ds = xr.Dataset({'foo': ('time', np.arange(365 * 24)), 'time': time})

In [10]: ds.sel(time='2000-01')
Out[10]: 
<xarray.Dataset>
Dimensions:  (time: 744)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...
Data variables:
    foo      (time) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...

In [11]: ds.sel(time=slice('2000-06-01', '2000-06-10'))
Out[11]: 
<xarray.Dataset>
Dimensions:  (time: 240)
Coordinates:
  * time     (time) datetime64[ns] 2000-06-01 2000-06-01T01:00:00 ...
Data variables:
    foo      (time) int64 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 ...

You can also select a particular time by indexing with a datetime.time object:

In [12]: ds.sel(time=datetime.time(12))
Out[12]: 
<xarray.Dataset>
Dimensions:  (time: 365)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01T12:00:00 2000-01-02T12:00:00 ...
Data variables:
    foo      (time) int64 12 36 60 84 108 132 156 180 204 228 252 276 300 ...

For more details, read the pandas documentation.

Datetime components

xarray supports a notion of “virtual” or “derived” coordinates for datetime components implemented by pandas, including “year”, “month”, “day”, “hour”, “minute”, “second”, “dayofyear”, “week”, “dayofweek”, “weekday” and “quarter”:

In [13]: ds['time.month']
Out[13]: 
<xarray.DataArray 'month' (time: 8760)>
array([ 1,  1,  1, ..., 12, 12, 12], dtype=int32)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...

In [14]: ds['time.dayofyear']
Out[14]: 
<xarray.DataArray 'dayofyear' (time: 8760)>
array([  1,   1,   1, ..., 365, 365, 365], dtype=int32)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...

xarray adds 'season' to the list of datetime components supported by pandas:

In [15]: ds['time.season']
Out[15]: 
<xarray.DataArray 'season' (time: 8760)>
array(['DJF', 'DJF', 'DJF', ..., 'DJF', 'DJF', 'DJF'], 
      dtype='|S3')
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-01T01:00:00 ...

The set of valid seasons consists of ‘DJF’, ‘MAM’, ‘JJA’ and ‘SON’, labeled by the first letters of the corresponding months.

You can use these shortcuts with both Datasets and DataArray coordinates.

Resampling and grouped operations

Datetime components couple particularly well with grouped operations (see GroupBy: split-apply-combine) for analyzing features that repeat over time. Here’s how to calculate the mean by time of day:

In [16]: ds.groupby('time.hour').mean()
Out[16]: 
<xarray.Dataset>
Dimensions:  (hour: 24)
Coordinates:
  * hour     (hour) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 ...
Data variables:
    foo      (hour) float64 4.368e+03 4.369e+03 4.37e+03 4.371e+03 4.372e+03 ...

For upsampling or downsampling temporal resolutions, xarray offers a resample() method building on the core functionality offered by the pandas method of the same name. Resample uses essentialy the same api as resample in pandas.

For example, we can downsample our dataset from hourly to 6-hourly:

In [17]: ds.resample('6H', dim='time', how='mean')
Out[17]: 
<xarray.Dataset>
Dimensions:  (time: 1460)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-01T06:00:00 ...
Data variables:
    foo      (time) float64 2.5 8.5 14.5 20.5 26.5 32.5 38.5 44.5 50.5 56.5 ...

Resample also works for upsampling, in which case intervals without any values are marked by NaN:

In [18]: ds.resample('30Min', 'time')
Out[18]: 
<xarray.Dataset>
Dimensions:  (time: 17519)
Coordinates:
  * time     (time) datetime64[ns] 2000-01-01 2000-01-01T00:30:00 ...
Data variables:
    foo      (time) float64 0.0 nan 1.0 nan 2.0 nan 3.0 nan 4.0 nan 5.0 nan ...

Of course, all of these resampling and groupby operation work on both Dataset and DataArray objects with any number of additional dimensions.

For more examples of using grouped operations on a time dimension, see Toy weather data.

Working with pandas

One of the most important features of xarray is the ability to convert to and from pandas objects to interact with the rest of the PyData ecosystem. For example, for plotting labeled data, we highly recommend using the visualization built in to pandas itself or provided by the pandas aware libraries such as Seaborn.

Hierarchical and tidy data

Tabular data is easiest to work with when it meets the criteria for tidy data:

  • Each column holds a different variable.
  • Each rows holds a different observation.

In this “tidy data” format, we can represent any Dataset and DataArray in terms of pandas.DataFrame and pandas.Series, respectively (and vice-versa). The representation works by flattening non-coordinates to 1D, and turning the tensor product of coordinate indexes into a pandas.MultiIndex.

Dataset and DataFrame

To convert any dataset to a DataFrame in tidy form, use the Dataset.to_dataframe() method:

In [1]: ds = xr.Dataset({'foo': (('x', 'y'), np.random.randn(2, 3))},
   ...:                  coords={'x': [10, 20], 'y': ['a', 'b', 'c'],
   ...:                          'along_x': ('x', np.random.randn(2)),
   ...:                          'scalar': 123})
   ...: 

In [2]: ds
Out[2]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * y        (y) |S1 'a' 'b' 'c'
  * x        (x) int64 10 20
    scalar   int64 123
    along_x  (x) float64 0.1192 -1.044
Data variables:
    foo      (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732

In [3]: df = ds.to_dataframe()

In [4]: df
Out[4]: 
           foo  scalar   along_x
x  y                            
10 a  0.469112     123  0.119209
   b -0.282863     123  0.119209
   c -1.509059     123  0.119209
20 a -1.135632     123 -1.044236
   b  1.212112     123 -1.044236
   c -0.173215     123 -1.044236

We see that each variable and coordinate in the Dataset is now a column in the DataFrame, with the exception of indexes which are in the index. To convert the DataFrame to any other convenient representation, use DataFrame methods like reset_index(), stack() and unstack().

To create a Dataset from a DataFrame, use the from_dataframe() class method:

In [5]: xr.Dataset.from_dataframe(df)
Out[5]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * x        (x) int64 10 20
  * y        (y) object 'a' 'b' 'c'
Data variables:
    foo      (x, y) float64 0.4691 -0.2829 -1.509 -1.136 1.212 -0.1732
    scalar   (x, y) int64 123 123 123 123 123 123
    along_x  (x, y) float64 0.1192 0.1192 0.1192 -1.044 -1.044 -1.044

Notice that that dimensions of variables in the Dataset have now expanded after the round-trip conversion to a DataFrame. This is because every object in a DataFrame must have the same indices, so we need to broadcast the data of each array to the full size of the new MultiIndex.

Likewise, all the coordinates (other than indexes) ended up as variables, because pandas does not distinguish non-index coordinates.

DataArray and Series

DataArray objects have a complementary representation in terms of a pandas.Series. Using a Series preserves the Dataset to DataArray relationship, because DataFrames are dict-like containers of Series. The methods are very similar to those for working with DataFrames:

In [6]: s = ds['foo'].to_series()

In [7]: s
Out[7]: 
x   y
10  a    0.469112
    b   -0.282863
    c   -1.509059
20  a   -1.135632
    b    1.212112
    c   -0.173215
Name: foo, dtype: float64

In [8]: xr.DataArray.from_series(s)
Out[8]: 
<xarray.DataArray 'foo' (x: 2, y: 3)>
array([[ 0.469, -0.283, -1.509],
       [-1.136,  1.212, -0.173]])
Coordinates:
  * x        (x) int64 10 20
  * y        (y) object 'a' 'b' 'c'

Both the from_series and from_dataframe methods use reindexing, so they work even if not the hierarchical index is not a full tensor product:

In [9]: s[::2]
Out[9]: 
x   y
10  a    0.469112
    c   -1.509059
20  b    1.212112
Name: foo, dtype: float64

In [10]: xr.DataArray.from_series(s[::2])
Out[10]: 
<xarray.DataArray 'foo' (x: 2, y: 3)>
array([[ 0.469,    nan, -1.509],
       [   nan,  1.212,    nan]])
Coordinates:
  * x        (x) int64 10 20
  * y        (y) object 'a' 'b' 'c'

Multi-dimensional data

DataArray.to_pandas() is a shortcut that lets you convert a DataArray directly into a pandas object with the same dimensionality (i.e., a 1D array is converted to a Series, 2D to DataFrame and 3D to Panel):

In [11]: arr = xr.DataArray(np.random.randn(2, 3),
   ....:                    coords=[('x', [10, 20]), ('y', ['a', 'b', 'c'])])
   ....: 

In [12]: df = arr.to_pandas()

In [13]: df
Out[13]: 
y          a         b         c
x                               
10 -0.861849 -2.104569 -0.494929
20  1.071804  0.721555 -0.706771

To perform the inverse operation of converting any pandas objects into a data array with the same shape, simply use the DataArray constructor:

In [14]: xr.DataArray(df)
Out[14]: 
<xarray.DataArray (x: 2, y: 3)>
array([[-0.862, -2.105, -0.495],
       [ 1.072,  0.722, -0.707]])
Coordinates:
  * x        (x) int64 10 20
  * y        (y) object 'a' 'b' 'c'

xarray objects do not yet support hierarchical indexes, so if your data has a hierarchical index, you will either need to unstack it first or use the from_series() or from_dataframe() constructors described above.

Serialization and IO

xarray supports direct serialization and IO to several file formats. For more options, consider exporting your objects to pandas (see the preceding section) and using its broad range of IO tools.

Pickle

The simplest way to serialize an xarray object is to use Python’s built-in pickle module:

In [1]: import cPickle as pickle

In [2]: ds = xr.Dataset({'foo': (('x', 'y'), np.random.rand(4, 5))},
   ...:                 coords={'x': [10, 20, 30, 40],
   ...:                         'y': pd.date_range('2000-01-01', periods=5),
   ...:                         'z': ('x', list('abcd'))})
   ...: 

# use the highest protocol (-1) because it is way faster than the default
# text based pickle format
In [3]: pkl = pickle.dumps(ds, protocol=-1)

In [4]: pickle.loads(pkl)
Out[4]: 
<xarray.Dataset>
Dimensions:  (x: 4, y: 5)
Coordinates:
  * y        (y) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04 ...
  * x        (x) int64 10 20 30 40
    z        (x) |S1 'a' 'b' 'c' 'd'
Data variables:
    foo      (x, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 0.4514 ...

Pickle support is important because it doesn’t require any external libraries and lets you use xarray objects with Python modules like multiprocessing. However, there are two important caveats:

  1. To simplify serialization, xarray’s support for pickle currently loads all array values into memory before dumping an object. This means it is not suitable for serializing datasets too big to load into memory (e.g., from netCDF or OPeNDAP).
  2. Pickle will only work as long as the internal data structure of xarray objects remains unchanged. Because the internal design of xarray is still being refined, we make no guarantees (at this point) that objects pickled with this version of xarray will work in future versions.

netCDF

Currently, the only disk based serialization format that xarray directly supports is netCDF. netCDF is a file format for fully self-described datasets that is widely used in the geosciences and supported on almost all platforms. We use netCDF because xarray was based on the netCDF data model, so netCDF files on disk directly correspond to Dataset objects. Recent versions netCDF are based on the even more widely used HDF5 file-format.

Reading and writing netCDF files with xarray requires the netCDF4-Python library or scipy to be installed.

We can save a Dataset to disk using the Dataset.to_netcdf method:

In [5]: ds.to_netcdf('saved_on_disk.nc')

By default, the file is saved as netCDF4 (assuming netCDF4-Python is installed). You can control the format and engine used to write the file with the format and engine arguments.

We can load netCDF files to create a new Dataset using open_dataset():

In [6]: ds_disk = xr.open_dataset('saved_on_disk.nc')

In [7]: ds_disk
Out[7]: 
<xarray.Dataset>
Dimensions:  (x: 4, y: 5)
Coordinates:
  * y        (y) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03 2000-01-04 ...
  * x        (x) int32 10 20 30 40
    z        (x) |S1 'a' 'b' 'c' 'd'
Data variables:
    foo      (x, y) float64 0.127 0.9667 0.2605 0.8972 0.3767 0.3362 0.4514 ...

A dataset can also be loaded or written to a specific group within a netCDF file. To load from a group, pass a group keyword argument to the open_dataset function. The group can be specified as a path-like string, e.g., to access subgroup ‘bar’ within group ‘foo’ pass ‘/foo/bar’ as the group argument. When writing multiple groups in one file, pass mode='a' to to_netcdf to ensure that each call does not delete the file.

Data is loaded lazily from netCDF files. You can manipulate, slice and subset Dataset and DataArray objects, and no array values are loaded into memory until you try to perform some sort of actual computation. For an example of how these lazy arrays work, see the OPeNDAP section below.

It is important to note that when you modify values of a Dataset, even one linked to files on disk, only the in-memory copy you are manipulating in xarray is modified: the original file on disk is never touched.

Tip

xarray’s lazy loading of remote or on-disk datasets is often but not always desirable. Before performing computationally intense operations, it is often a good idea to load a dataset entirely into memory by invoking the load() method.

Datasets have a close() method to close the associated netCDF file. However, it’s often cleaner to use a with statement:

# this automatically closes the dataset after use
In [8]: with xr.open_dataset('saved_on_disk.nc') as ds:
   ...:     print(ds.keys())
   ...: 
['y', 'x', 'foo', 'z']
Reading encoded data

NetCDF files follow some conventions for encoding datetime arrays (as numbers with a “units” attribute) and for packing and unpacking data (as described by the “scale_factor” and “add_offset” attributes). If the argument decode_cf=True (default) is given to open_dataset, xarray will attempt to automatically decode the values in the netCDF objects according to CF conventions. Sometimes this will fail, for example, if a variable has an invalid “units” or “calendar” attribute. For these cases, you can turn this decoding off manually.

You can view this encoding information (among others) in the DataArray.encoding attribute:

In [9]: ds_disk['y'].encoding
Out[9]: 
{'calendar': u'proleptic_gregorian',
 'chunksizes': None,
 'complevel': 0,
 'contiguous': True,
 'dtype': dtype('float64'),
 'fletcher32': False,
 'least_significant_digit': None,
 'shuffle': False,
 'source': 'saved_on_disk.nc',
 'units': u'days since 2000-01-01 00:00:00',
 'zlib': False}

Note that all operations that manipulate variables other than indexing will remove encoding information.

Writing encoded data

Conversely, you can customize how xarray writes netCDF files on disk by providing explicit encodings for each dataset variable. The encoding argument takes a dictionary with variable names as keys and variable specific encodings as values. These encodings are saved as attributes on the netCDF variables on disk, which allows xarray to faithfully read encoded data back into memory.

It is important to note that using encodings is entirely optional: if you do not supply any of these encoding options, xarray will write data to disk using a default encoding, or the options in the encoding attribute, if set. This works perfectly fine in most cases, but encoding can be useful for additional control, especially for enabling compression.

In the file on disk, these encodings as saved as attributes on each variable, which allow xarray and other CF-compliant tools for working with netCDF files to correctly read the data.

Scaling and type conversions

These encoding options work on any version of the netCDF file format:

  • dtype: Any valid NumPy dtype or string convertable to a dtype, e.g., 'int16' or 'float32'. This controls the type of the data written on disk.
  • _FillValue: Values of NaN in xarray variables are remapped to this value when saved on disk. This is important when converting floating point with missing values to integers on disk, because NaN is not a valid dtype for integer dtypes.
  • scale_factor and add_offset: Used to convert from encoded data on disk to to the decoded data in memory, according to the formula decoded = scale_factor * encoded + add_offset.

These parameters can be fruitfully combined to compress discretized data on disk. For example, to save the variable foo with a precision of 0.1 in 16-bit integers while converting NaN to -9999, we would use encoding={'foo': {'dtype': 'int16', 'scale_factor': 0.1, '_FillValue': -9999}}. Compression and decompression with such discretization is extremely fast.

Chunk based compression

zlib, complevel, fletcher32, continguous and chunksizes can be used for enabling netCDF4/HDF5’s chunk based compression, as described in the documentation for createVariable for netCDF4-Python. This only works for netCDF4 files and thus requires using format='netCDF4' and either engine='netcdf4' or engine='h5netcdf'.

Chunk based gzip compression can yield impressive space savings, especially for sparse data, but it comes with significant performance overhead. HDF5 libraries can only read complete chunks back into memory, and maximum decompression speed is in the range of 50-100 MB/s. Worse, HDF5’s compression and decompression currently cannot be parallelized with dask. For these reasons, we recommend trying discretization based compression (described above) first.

Time units

The units and calendar attributes control how xarray serializes datetime64 and timedelta64 arrays to datasets on disk as numeric values. The units encoding should be a string like 'days since 1900-01-01' for datetime64 data or a string like 'days' for timedelta64 data. calendar should be one of the calendar types supported by netCDF4-python: ‘standard’, ‘gregorian’, ‘proleptic_gregorian’ ‘noleap’, ‘365_day’, ‘360_day’, ‘julian’, ‘all_leap’, ‘366_day’.

By default, xarray uses the ‘proleptic_gregorian’ calendar and units of the smallest time difference between values, with a reference time of the first time value.

OPeNDAP

xarray includes support for OPeNDAP (via the netCDF4 library or Pydap), which lets us access large datasets over HTTP.

For example, we can open a connection to GBs of weather data produced by the PRISM project, and hosted by IRI at Columbia:

In [10]: remote_data = xr.open_dataset(
   ....:     'http://iridl.ldeo.columbia.edu/SOURCES/.OSU/.PRISM/.monthly/dods',
   ....:     decode_times=False)
   ....: 

In [11]: remote_data
Out[11]: 
<xarray.Dataset>
Dimensions:  (T: 1422, X: 1405, Y: 621)
Coordinates:
  * X        (X) float32 -125.0 -124.958 -124.917 -124.875 -124.833 -124.792 -124.75 ...
  * T        (T) float32 -779.5 -778.5 -777.5 -776.5 -775.5 -774.5 -773.5 -772.5 -771.5 ...
  * Y        (Y) float32 49.9167 49.875 49.8333 49.7917 49.75 49.7083 49.6667 49.625 ...
Data variables:
    ppt      (T, Y, X) float64 ...
    tdmean   (T, Y, X) float64 ...
    tmax     (T, Y, X) float64 ...
    tmin     (T, Y, X) float64 ...
Attributes:
    Conventions: IRIDL
    expires: 1375315200

Note

Like many real-world datasets, this dataset does not entirely follow CF conventions. Unexpected formats will usually cause xarray’s automatic decoding to fail. The way to work around this is to either set decode_cf=False in open_dataset to turn off all use of CF conventions, or by only disabling the troublesome parser. In this case, we set decode_times=False because the time axis here provides the calendar attribute in a format that xarray does not expect (the integer 360 instead of a string like '360_day').

We can select and slice this data any number of times, and nothing is loaded over the network until we look at particular values:

In [12]: tmax = remote_data['tmax'][:500, ::3, ::3]

In [13]: tmax
Out[13]: 
<xarray.DataArray 'tmax' (T: 500, Y: 207, X: 469)>
[48541500 values with dtype=float64]
Coordinates:
  * Y        (Y) float32 49.9167 49.7917 49.6667 49.5417 49.4167 49.2917 ...
  * X        (X) float32 -125.0 -124.875 -124.75 -124.625 -124.5 -124.375 ...
  * T        (T) float32 -779.5 -778.5 -777.5 -776.5 -775.5 -774.5 -773.5 ...
Attributes:
    pointwidth: 120
    standard_name: air_temperature
    units: Celsius_scale
    expires: 1443657600

# the data is downloaded automatically when we make the plot
In [14]: tmax[0].plot()
_images/opendap-prism-tmax.png

Formats supported by PyNIO

xarray can also read GRIB, HDF4 and other file formats supported by PyNIO, if PyNIO is installed. To use PyNIO to read such files, supply engine='pynio' to open_dataset().

We recommend installing PyNIO via conda:

conda install -c dbrown pynio

Combining multiple files

NetCDF files are often encountered in collections, e.g., with different files corresponding to different model runs. xarray can straightforwardly combine such files into a single Dataset by making use of concat().

Note

Version 0.5 includes experimental support for manipulating datasets that don’t fit into memory with dask. If you have dask installed, you can open multiple files simultaneously using open_mfdataset():

xr.open_mfdataset('my/files/*.nc')

This function automatically concatenates and merges into a single xarray datasets. For more details, see Reading and writing data.

For example, here’s how we could approximate MFDataset from the netCDF4 library:

from glob import glob
import xarray as xr

def read_netcdfs(files, dim):
    # glob expands paths with * to a list of files, like the unix shell
    paths = sorted(glob(files))
    datasets = [xr.open_dataset(p) for p in paths]
    combined = xr.concat(dataset, dim)
    return combined

read_netcdfs('/all/my/files/*.nc', dim='time')

This function will work in many cases, but it’s not very robust. First, it never closes files, which means it will fail one you need to load more than a few thousands file. Second, it assumes that you want all the data from each file and that it can all fit into memory. In many situations, you only need a small subset or an aggregated summary of the data from each file.

Here’s a slightly more sophisticated example of how to remedy these deficiencies:

def read_netcdfs(files, dim, transform_func=None):
    def process_one_path(path):
        # use a context manager, to ensure the file gets closed after use
        with xr.open_dataset(path) as ds:
            # transform_func should do some sort of selection or
            # aggregation
            if transform_func is not None:
                ds = transform_func(ds)
            # load all data from the transformed dataset, to ensure we can
            # use it after closing each original file
            ds.load()
            return ds

    paths = sorted(glob(files))
    datasets = [process_one_path(p) for p in paths]
    combined = xr.concat(datasets, dim)
    return combined

# here we suppose we only care about the combined mean of each file;
# you might also use indexing operations like .sel to subset datasets
read_netcdfs('/all/my/files/*.nc', dim='time',
             transform_func=lambda ds: ds.mean())

This pattern works well and is very robust. We’ve used similar code to process tens of thousands of files constituting 100s of GB of data.

Out of core computation with dask

xarray integrates with dask to support streaming computation on datasets that don’t fit into memory.

Currently, dask is an entirely optional feature for xarray. However, the benefits of using dask are sufficiently strong that dask may become a required dependency in a future version of xarray.

For a full example of how to use xarray’s dask integration, read the blog post introducing xarray and dask.

What is a dask array?

A dask array

Dask divides arrays into many small pieces, called chunks, each of which is presumed to be small enough to fit into memory.

Unlike NumPy, which has eager evaluation, operations on dask arrays are lazy. Operations queue up a series of taks mapped over blocks, and no computation is performed until you actually ask values to be computed (e.g., to print results to your screen or write to disk). At that point, data is loaded into memory and computation proceeds in a streaming fashion, block-by-block.

The actual computation is controlled by a multi-processing or thread pool, which allows dask to take full advantage of multiple processers available on most modern computers.

For more details on dask, read its documentation.

Reading and writing data

The usual way to create a dataset filled with dask arrays is to load the data from a netCDF file or files. You can do this by supplying a chunks argument to open_dataset() or using the open_mfdataset() function.

In [1]: ds = xr.open_dataset('example-data.nc', chunks={'time': 10})

In [2]: ds
Out[2]: 
<xarray.Dataset>
Dimensions:      (latitude: 180, longitude: 360, time: 365)
Coordinates:
  * latitude     (latitude) float64 89.5 88.5 87.5 86.5 85.5 84.5 83.5 82.5 ...
  * time         (time) datetime64[ns] 2015-01-01 2015-01-02 2015-01-03 ...
  * longitude    (longitude) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
Data variables:
    temperature  (time, latitude, longitude) float64 0.4691 -0.2829 -1.509 ...

In this example latitude and longitude do not appear in the chunks dict, so only one chunk will be used along those dimensions. It is also entirely equivalent to open a dataset using open_dataset and then chunk the data use the chunk method, e.g., xr.open_dataset('example-data.nc').chunk({'time': 10}).

To open multiple files simultaneously, use open_mfdataset():

xr.open_mfdataset('my/files/*.nc')

This function will automatically concatenate and merge dataset into one in the simple cases that it understands (see auto_combine() for the full disclaimer). By default, open_mfdataset will chunk each netCDF file into a single dask array; again, supply the chunks argument to control the size of the resulting dask arrays. In more complex cases, you can open each file individually using open_dataset and merge the result, as described in Combining data.

You’ll notice that printing a dataset still shows a preview of array values, even if they are actually dask arrays. We can do this quickly with dask because we only need to the compute the first few values (typically from the first block). To reveal the true nature of an array, print a DataArray:

In [3]: ds.temperature
Out[3]: 
<xarray.DataArray 'temperature' (time: 365, latitude: 180, longitude: 360)>
dask.array<example..., shape=(365, 180, 360), dtype=float64, chunksize=(10, 180, 360)>
Coordinates:
  * latitude   (latitude) float64 89.5 88.5 87.5 86.5 85.5 84.5 83.5 82.5 ...
  * time       (time) datetime64[ns] 2015-01-01 2015-01-02 2015-01-03 ...
  * longitude  (longitude) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 ...

Once you’ve manipulated a dask array, you can still write a dataset too big to fit into memory back to disk by using to_netcdf() in the usual way.

Using dask with xarray

Nearly all existing xarray methods (including those for indexing, computation, concatenating and grouped operations) have been extended to work automatically with dask arrays. When you load data as a dask array in an xarray data structure, almost all xarray operations will keep it as a dask array; when this is not possible, they will raise an exception rather than unexpectedly loading data into memory. Converting a dask array into memory generally requires an explicit conversion step. One noteable exception is indexing operations: to enable label based indexing, xarray will automatically load coordinate labels into memory.

The easiest way to convert an xarray data structure from lazy dask arrays into eager, in-memory numpy arrays is to use the load() method:

In [4]: ds.load()
Out[4]: 
<xarray.Dataset>
Dimensions:      (latitude: 180, longitude: 360, time: 365)
Coordinates:
  * latitude     (latitude) float64 89.5 88.5 87.5 86.5 85.5 84.5 83.5 82.5 ...
  * time         (time) datetime64[ns] 2015-01-01 2015-01-02 2015-01-03 ...
  * longitude    (longitude) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
Data variables:
    temperature  (time, latitude, longitude) float64 0.4691 -0.2829 -1.509 ...

You can also access values, which will always be a numpy array:

In [5]: ds.temperature.values
Out[5]: 
array([[[  4.691e-01,  -2.829e-01, ...,  -5.577e-01,   3.814e-01],
        [  1.337e+00,  -1.531e+00, ...,   8.726e-01,  -1.538e+00],
        ...
# truncated for brevity

Explicit conversion by wrapping a DataArray with np.asarray also works:

In [6]: np.asarray(ds.temperature)
Out[6]: 
array([[[  4.691e-01,  -2.829e-01, ...,  -5.577e-01,   3.814e-01],
        [  1.337e+00,  -1.531e+00, ...,   8.726e-01,  -1.538e+00],
        ...

With the current version of dask, there is no automatic alignment of chunks when performing operations between dask arrays with different chunk sizes. If your computation involves multiple dask arrays with different chunks, you may need to explicitly rechunk each array to ensure compatibility. With xarray, both converting data to a dask arrays and converting the chunk sizes of dask arrays is done with the chunk() method:

In [7]: rechunked = ds.chunk({'latitude': 100, 'longitude': 100})

You can view the size of existing chunks on an array by viewing the chunks attribute:

In [8]: rechunked.chunks
Out[8]: Frozen(SortedKeysDict({'latitude': (100, 80), 'longitude': (100, 100, 100, 60), 'time': (10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 10, 5)}))

If there are not consistent chunksizes between all the arrays in a dataset along a particular dimension, an exception is raised when you try to access .chunks.

Note

In the future, we would like to enable automatic alignment of dask chunksizes (but not the other way around). We might also require that all arrays in a dataset share the same chunking alignment. Neither of these are currently done.

NumPy ufuncs like np.sin currently only work on eagerly evaluated arrays (this will change with the next major NumPy release). We have provided replacements that also work on all xarray objects, including those that store lazy dask arrays, in the xarray.ufuncs module:

In [9]: import xarray.ufuncs as xu

In [10]: xu.sin(rechunked)
Out[10]: 
<xarray.Dataset>
Dimensions:      (latitude: 180, longitude: 360, time: 365)
Coordinates:
  * latitude     (latitude) float64 89.5 88.5 87.5 86.5 85.5 84.5 83.5 82.5 ...
  * longitude    (longitude) int64 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 ...
  * time         (time) datetime64[ns] 2015-01-01 2015-01-02 2015-01-03 ...
Data variables:
    temperature  (time, latitude, longitude) float64 0.4521 -0.2791 -0.9981 ...

To access dask arrays directly, use the new DataArray.data attribute. This attribute exposes array data either as a dask array or as a numpy array, depending on whether it has been loaded into dask or not:

In [11]: ds.temperature.data
Out[11]: dask.array<xarray-..., shape=(365, 180, 360), dtype=float64, chunksize=(10, 180, 360)>

Note

In the future, we may extend .data to support other “computable” array backends beyond dask and numpy (e.g., to support sparse arrays).

Chunking and performance

The chunks parameter has critical performance implications when using dask arrays. If your chunks are too small, queueing up operations will be extremely slow, because dask will translates each operation into a huge number of operations mapped across chunks. Computation on dask arrays with small chunks can also be slow, because each operation on a chunk has some fixed overhead from the Python interpreter and the dask task executor.

Conversely, if your chunks are too big, some of your computation may be wasted, because dask only computes results one chunk at a time.

A good rule of thumb to create arrays with a minimum chunksize of at least one million elements (e.g., a 1000x1000 matrix). With large arrays (10+ GB), the cost of queueing up dask operations can be noticeable, and you may need even larger chunksizes.

Plotting

Introduction

Labeled data enables expressive computations. These same labels can also be used to easily create informative plots.

xarray’s plotting capabilities are centered around xarray.DataArray objects. To plot xarray.Dataset objects simply access the relevant DataArrays, ie dset['var1']. Here we focus mostly on arrays 2d or larger. If your data fits nicely into a pandas DataFrame then you’re better off using one of the more developed tools there.

xarray plotting functionality is a thin wrapper around the popular matplotlib library. Matplotlib syntax and function names were copied as much as possible, which makes for an easy transition between the two. Matplotlib must be installed before xarray can plot.

For more extensive plotting applications consider the following projects:

  • Seaborn: “provides a high-level interface for drawing attractive statistical graphics.” Integrates well with pandas.
  • Holoviews: “Composable, declarative data structures for building even complex visualizations easily.” Works for 2d datasets.
  • Cartopy: Provides cartographic tools.
Imports

The following imports are necessary for all of the examples.

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: import matplotlib.pyplot as plt

In [4]: import xarray as xr

For these examples we’ll use the North American air temperature dataset.

In [5]: airtemps = xr.tutorial.load_dataset('air_temperature')

In [6]: airtemps
Out[6]: 
<xarray.Dataset>
Dimensions:  (lat: 25, lon: 53, time: 2920)
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 52.5 ...
  * time     (time) datetime64[ns] 2013-01-01 2013-01-01T06:00:00 ...
  * lon      (lon) float32 200.0 202.5 205.0 207.5 210.0 212.5 215.0 217.5 ...
Data variables:
    air      (time, lat, lon) float64 241.2 242.5 243.5 244.0 244.1 243.9 ...
Attributes:
    platform: Model
    Conventions: COARDS
    references: http://www.esrl.noaa.gov/psd/data/gridded/data.ncep.reanalysis.html
    description: Data is from NMC initialized reanalysis
(4x/day).  These are the 0.9950 sigma level values.
    title: 4x daily NMC reanalysis (1948)

# Convert to celsius
In [7]: air = airtemps.air - 273.15

One Dimension

Simple Example

xarray uses the coordinate name to label the x axis.

In [8]: air1d = air.isel(lat=10, lon=10)

In [9]: air1d.plot()
Out[9]: [<matplotlib.lines.Line2D at 0x7f0c29208b90>]
_images/plotting_1d_simple.png
Additional Arguments

Additional arguments are passed directly to the matplotlib function which does the work. For example, xarray.plot.line() calls matplotlib.pyplot.plot passing in the index and the array values as x and y, respectively. So to make a line plot with blue triangles a matplotlib format string can be used:

In [10]: air1d[:200].plot.line('b-^')
Out[10]: [<matplotlib.lines.Line2D at 0x7f0c292577d0>]
_images/plotting_1d_additional_args.png

Note

Not all xarray plotting methods support passing positional arguments to the wrapped matplotlib functions, but they do all support keyword arguments.

Keyword arguments work the same way, and are more explicit.

In [11]: air1d[:200].plot.line(color='purple', marker='o')
Out[11]: [<matplotlib.lines.Line2D at 0x7f0c297f2710>]
_images/plotting_example_sin3.png
Adding to Existing Axis

To add the plot to an existing axis pass in the axis as a keyword argument ax. This works for all xarray plotting methods. In this example axes is an array consisting of the left and right axes created by plt.subplots.

In [12]: fig, axes = plt.subplots(ncols=2)

In [13]: axes
Out[13]: 
array([<matplotlib.axes._subplots.AxesSubplot object at 0x7f0c29ec5c10>,
       <matplotlib.axes._subplots.AxesSubplot object at 0x7f0c2bd3a890>], dtype=object)

In [14]: air1d.plot(ax=axes[0])
Out[14]: [<matplotlib.lines.Line2D at 0x7f0c2985a590>]

In [15]: air1d.plot.hist(ax=axes[1])
Out[15]: 
(array([   9.,   38.,  255.,  584.,  542.,  489.,  368.,  258.,  327.,   50.]),
 array([  0.95 ,   2.719,   4.488, ...,  15.102,  16.871,  18.64 ]),
 <a list of 10 Patch objects>)

In [16]: plt.tight_layout()

In [17]: plt.show()
_images/plotting_example_existing_axes.png

On the right is a histogram created by xarray.plot.hist().

Two Dimensions

Simple Example

The default method xarray.DataArray.plot() sees that the data is 2 dimensional and calls xarray.plot.pcolormesh().

In [18]: air2d = air.isel(time=500)

In [19]: air2d.plot()
Out[19]: <matplotlib.collections.QuadMesh at 0x7f0c2bdcfb10>
_images/2d_simple.png

All 2d plots in xarray allow the use of the keyword arguments yincrease and xincrease.

In [20]: air2d.plot(yincrease=False)
Out[20]: <matplotlib.collections.QuadMesh at 0x7f0c3c643f90>
_images/2d_simple_yincrease.png

Note

We use xarray.plot.pcolormesh() as the default two-dimensional plot method because it is more flexible than xarray.plot.imshow(). However, for large arrays, imshow can be much faster than pcolormesh. If speed is important to you and you are plotting a regular mesh, consider using imshow.

Missing Values

xarray plots data with Missing values.

In [21]: bad_air2d = air2d.copy()

In [22]: bad_air2d[dict(lat=slice(0, 10), lon=slice(0, 25))] = np.nan

In [23]: bad_air2d.plot()
Out[23]: <matplotlib.collections.QuadMesh at 0x7f0c287fda90>
_images/plotting_missing_values.png
Nonuniform Coordinates

It’s not necessary for the coordinates to be evenly spaced. Both xarray.plot.pcolormesh() (default) and xarray.plot.contourf() can produce plots with nonuniform coordinates.

In [24]: b = air2d.copy()

# Apply a nonlinear transformation to one of the coords
In [25]: b.coords['lat'] = np.log(b.coords['lat'])

In [26]: b.plot()
Out[26]: <matplotlib.collections.QuadMesh at 0x7f0c299a5a50>
_images/plotting_nonuniform_coords.png
Calling Matplotlib

Since this is a thin wrapper around matplotlib, all the functionality of matplotlib is available.

In [27]: air2d.plot(cmap=plt.cm.Blues)
Out[27]: <matplotlib.collections.QuadMesh at 0x7f0c3c24a890>

In [28]: plt.title('These colors prove North America\nhas fallen in the ocean')
Out[28]: <matplotlib.text.Text at 0x7f0c28767710>

In [29]: plt.ylabel('latitude')
Out[29]: <matplotlib.text.Text at 0x7f0c29968a90>

In [30]: plt.xlabel('longitude')
Out[30]: <matplotlib.text.Text at 0x7f0c2990ae90>

In [31]: plt.tight_layout()

In [32]: plt.show()
_images/plotting_2d_call_matplotlib.png

Note

xarray methods update label information and generally play around with the axes. So any kind of updates to the plot should be done after the call to the xarray’s plot. In the example below, plt.xlabel effectively does nothing, since d_ylog.plot() updates the xlabel.

In [33]: plt.xlabel('Never gonna see this.')
Out[33]: <matplotlib.text.Text at 0x7f0c28638d10>

In [34]: air2d.plot()
Out[34]: <matplotlib.collections.QuadMesh at 0x7f0c285af850>

In [35]: plt.show()
_images/plotting_2d_call_matplotlib2.png
Colormaps

xarray borrows logic from Seaborn to infer what kind of color map to use. For example, consider the original data in Kelvins rather than Celsius:

In [36]: airtemps.air.isel(time=0).plot()
Out[36]: <matplotlib.collections.QuadMesh at 0x7f0c2857d490>
_images/plotting_kelvin.png

The Celsius data contain 0, so a diverging color map was used. The Kelvins do not have 0, so the default color map was used.

Robust

Outliers often have an extreme effect on the output of the plot. Here we add two bad data points. This affects the color scale, washing out the plot.

In [37]: air_outliers = airtemps.air.isel(time=0).copy()

In [38]: air_outliers[0, 0] = 100

In [39]: air_outliers[-1, -1] = 400

In [40]: air_outliers.plot()
Out[40]: <matplotlib.collections.QuadMesh at 0x7f0c283aacd0>
_images/plotting_robust1.png

This plot shows that we have outliers. The easy way to visualize the data without the outliers is to pass the parameter robust=True. This will use the 2nd and 98th percentiles of the data to compute the color limits.

In [41]: air_outliers.plot(robust=True)
Out[41]: <matplotlib.collections.QuadMesh at 0x7f0c282df310>
_images/plotting_robust2.png

Observe that the ranges of the color bar have changed. The arrows on the color bar indicate that the colors include data points outside the bounds.

Discrete Colormaps

It is often useful, when visualizing 2d data, to use a discrete colormap, rather than the default continuous colormaps that matplotlib uses. The levels keyword argument can be used to generate plots with discrete colormaps. For example, to make a plot with 8 discrete color intervals:

In [42]: air2d.plot(levels=8)
Out[42]: <matplotlib.collections.QuadMesh at 0x7f0c28453950>
_images/plotting_discrete_levels.png

It is also possible to use a list of levels to specify the boundaries of the discrete colormap:

In [43]: air2d.plot(levels=[0, 12, 18, 30])
Out[43]: <matplotlib.collections.QuadMesh at 0x7f0c28826fd0>
_images/plotting_listed_levels.png

You can also specify a list of discrete colors through the colors argument:

In [44]: flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71"]

In [45]: air2d.plot(levels=[0, 12, 18, 30], colors=flatui)
Out[45]: <matplotlib.collections.QuadMesh at 0x7f0c280c72d0>
_images/plotting_custom_colors_levels.png

Finally, if you have Seaborn installed, you can also specify a seaborn color palette to the cmap argument. Note that levels must be specified with seaborn color palettes if using imshow or pcolormesh (but not with contour or contourf, since levels are chosen automatically).

In [46]: air2d.plot(levels=10, cmap='husl')
Out[46]: <matplotlib.collections.QuadMesh at 0x7f0c1e1ff050>
_images/plotting_seaborn_palette.png

Faceting

Faceting here refers to splitting an array along one or two dimensions and plotting each group. xarray’s basic plotting is useful for plotting two dimensional arrays. What about three or four dimensional arrays? That’s where facets become helpful.

Consider the temperature data set. There are 4 observations per day for two years which makes for 2920 values along the time dimension. One way to visualize this data is to make a seperate plot for each time period.

The faceted dimension should not have too many values; faceting on the time dimension will produce 2920 plots. That’s too much to be helpful. To handle this situation try performing an operation that reduces the size of the data in some way. For example, we could compute the average air temperature for each month and reduce the size of this dimension from 2920 -> 12. A simpler way is to just take a slice on that dimension. So let’s use a slice to pick 6 times throughout the first year.

In [47]: t = air.isel(time=slice(0, 365 * 4, 250))

In [48]: t.coords
Out[48]: 
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 52.5 ...
  * time     (time) datetime64[ns] 2013-01-01 2013-03-04T12:00:00 2013-05-06 ...
  * lon      (lon) float32 200.0 202.5 205.0 207.5 210.0 212.5 215.0 217.5 ...
Simple Example

The easiest way to create faceted plots is to pass in row or col arguments to the xarray plotting methods/functions. This returns a xarray.plot.FacetGrid object.

In [49]: g_simple = t.plot(x='lon', y='lat', col='time', col_wrap=3)
_images/plot_facet_dataarray.png
4 dimensional

For 4 dimensional arrays we can use the rows and columns of the grids. Here we create a 4 dimensional array by taking the original data and adding a fixed amount. Now we can see how the temperature maps would compare if one were much hotter.

In [50]: t2 = t.isel(time=slice(0, 2))

In [51]: t4d = xr.concat([t2, t2 + 40], pd.Index(['normal', 'hot'], name='fourth_dim'))

# This is a 4d array
In [52]: t4d.coords
Out[52]: 
Coordinates:
  * lat         (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 ...
  * time        (time) datetime64[ns] 2013-01-01 2013-03-04T12:00:00
  * lon         (lon) float32 200.0 202.5 205.0 207.5 210.0 212.5 215.0 ...
  * fourth_dim  (fourth_dim) object 'normal' 'hot'

In [53]: t4d.plot(x='lon', y='lat', col='time', row='fourth_dim')
Out[53]: <xarray.plot.facetgrid.FacetGrid at 0x7f0c1de79410>
_images/plot_facet_4d.png
Other features

Faceted plotting supports other arguments common to xarray 2d plots.

In [54]: hasoutliers = t.isel(time=slice(0, 5)).copy()

In [55]: hasoutliers[0, 0, 0] = -100

In [56]: hasoutliers[-1, -1, -1] = 400

In [57]: g = hasoutliers.plot.pcolormesh('lon', 'lat', col='time', col_wrap=3,
   ....:                                 robust=True, cmap='viridis')
   ....: 
_images/plot_facet_robust.png
FacetGrid Objects

xarray.plot.FacetGrid is used to control the behavior of the multiple plots. It borrows an API and code from Seaborn. The structure is contained within the axes and name_dicts attributes, both 2d Numpy object arrays.

In [58]: g.axes
Out[58]: 
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f0c1e09a990>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f0c1d9f2f50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f0c1d937ad0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x7f0c1d926850>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f0c1d8a7650>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x7f0c1d7f5e50>]], dtype=object)

In [59]: g.name_dicts
Out[59]: 
array([[{'time': numpy.datetime64('2013-01-01T00:00:00.000000000+0000')},
        {'time': numpy.datetime64('2013-03-04T12:00:00.000000000+0000')},
        {'time': numpy.datetime64('2013-05-06T00:00:00.000000000+0000')}],
       [{'time': numpy.datetime64('2013-07-07T12:00:00.000000000+0000')},
        {'time': numpy.datetime64('2013-09-08T00:00:00.000000000+0000')}, None]], dtype=object)

It’s possible to select the xarray.DataArray or xarray.Dataset corresponding to the FacetGrid through the name_dicts.

In [60]: g.data.loc[g.name_dicts[0, 0]]
Out[60]: 
<xarray.DataArray 'air' (lat: 25, lon: 53)>
array([[-100.  ,  -30.65,  -29.65, ...,  -40.35,  -37.65,  -34.55],
       [ -29.35,  -28.65,  -28.45, ...,  -40.35,  -37.85,  -33.85],
       [ -23.15,  -23.35,  -24.26, ...,  -39.95,  -36.76,  -31.45],
       ..., 
       [  23.45,   23.05,   23.25, ...,   22.25,   21.95,   21.55],
       [  22.75,   23.05,   23.64, ...,   22.75,   22.75,   22.05],
       [  23.14,   23.64,   23.95, ...,   23.75,   23.64,   23.45]])
Coordinates:
  * lat      (lat) float32 75.0 72.5 70.0 67.5 65.0 62.5 60.0 57.5 55.0 52.5 ...
    time     datetime64[ns] 2013-01-01
  * lon      (lon) float32 200.0 202.5 205.0 207.5 210.0 212.5 215.0 217.5 ...

Here is an example of using the lower level API and then modifying the axes after they have been plotted.

In [61]: g = t.plot.imshow('lon', 'lat', col='time', col_wrap=3, robust=True)

In [62]: for i, ax in enumerate(g.axes.flat):
   ....:     ax.set_title('Air Temperature %d' % i)
   ....: 

In [63]: bottomright = g.axes[-1, -1]

In [64]: bottomright.annotate('bottom right', (240, 40))
Out[64]: <matplotlib.text.Annotation at 0x7f0c1d75f3d0>

In [65]: plt.show()
_images/plot_facet_iterator.png

TODO: add an example of using the map method to plot dataset variables (e.g., with plt.quiver).

Maps

To follow this section you’ll need to have Cartopy installed and working.

This script will plot the air temperature on a map.

import xarray as xr
import matplotlib.pyplot as plt
import cartopy.crs as ccrs


air = (xr.tutorial
       .load_dataset('air_temperature')
       .air
       .isel(time=0))

ax = plt.axes(projection=ccrs.Orthographic(-80, 35))
ax.set_global()
air.plot.contourf(ax=ax, transform=ccrs.PlateCarree())
ax.coastlines()

plt.savefig('cartopy_example.png')

Here is the resulting image:

_images/cartopy_example.png

Details

Ways to Use

There are three ways to use the xarray plotting functionality:

  1. Use plot as a convenience method for a DataArray.
  2. Access a specific plotting method from the plot attribute of a DataArray.
  3. Directly from the xarray plot submodule.

These are provided for user convenience; they all call the same code.

In [66]: import xarray.plot as xplt

In [67]: da = xr.DataArray(range(5))

In [68]: fig, axes = plt.subplots(ncols=2, nrows=2)

In [69]: da.plot(ax=axes[0, 0])
Out[69]: [<matplotlib.lines.Line2D at 0x7f0c1d1cad10>]

In [70]: da.plot.line(ax=axes[0, 1])
Out[70]: [<matplotlib.lines.Line2D at 0x7f0c1d1cab50>]

In [71]: xplt.plot(da, ax=axes[1, 0])
Out[71]: [<matplotlib.lines.Line2D at 0x7f0c1d1d6b10>]

In [72]: xplt.line(da, ax=axes[1, 1])
Out[72]: [<matplotlib.lines.Line2D at 0x7f0c29ef1310>]

In [73]: plt.tight_layout()

In [74]: plt.show()
_images/plotting_ways_to_use.png

Here the output is the same. Since the data is 1 dimensional the line plot was used.

The convenience method xarray.DataArray.plot() dispatches to an appropriate plotting function based on the dimensions of the DataArray and whether the coordinates are sorted and uniformly spaced. This table describes what gets plotted:

Dimensions Plotting function
1 xarray.plot.line()
2 xarray.plot.pcolormesh()
Anything else xarray.plot.hist()
Coordinates

If you’d like to find out what’s really going on in the coordinate system, read on.

In [75]: a0 = xr.DataArray(np.zeros((4, 3, 2)), dims=('y', 'x', 'z'),
   ....:                   name='temperature')
   ....: 

In [76]: a0[0, 0, 0] = 1

In [77]: a = a0.isel(z=0)

In [78]: a
Out[78]: 
<xarray.DataArray 'temperature' (y: 4, x: 3)>
array([[ 1.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])
Coordinates:
  * y        (y) int64 0 1 2 3
  * x        (x) int64 0 1 2
    z        int64 0

The plot will produce an image corresponding to the values of the array. Hence the top left pixel will be a different color than the others. Before reading on, you may want to look at the coordinates and think carefully about what the limits, labels, and orientation for each of the axes should be.

In [79]: a.plot()
Out[79]: <matplotlib.collections.QuadMesh at 0x7f0c1d0cbd90>
_images/plotting_example_2d_simple.png

It may seem strange that the values on the y axis are decreasing with -0.5 on the top. This is because the pixels are centered over their coordinates, and the axis labels and ranges correspond to the values of the coordinates.

API reference

This page provides an auto-generated summary of xarray’s API. For more details and examples, refer to the relevant chapters in the main part of the documentation.

Top-level functions

align(*objects[, join, copy]) Given any number of Dataset and/or DataArray objects, returns new objects with aligned indexes.
broadcast(*args) Explicitly broadcast any number of DataArray or Dataset objects against one another.
concat(objs[, dim, data_vars, coords, ...]) Concatenate xarray objects along a new or existing dimension.
empty_like
set_options(**kwargs) Set global state within a controlled context

Dataset

Creating a dataset
Dataset([data_vars, coords, attrs, compat]) A multi-dimensional, in memory, array database.
decode_cf(obj[, concat_characters, ...]) Decode the given Dataset or Datastore according to CF conventions into a new Dataset.
Attributes
Dataset.dims Mapping from dimension names to lengths.
Dataset.data_vars Dictionary of xarray.DataArray objects corresponding to data variables
Dataset.coords Dictionary of xarray.DataArray objects corresponding to coordinate
Dataset.attrs Dictionary of global attributes on this dataset
Dictionary interface

Datasets implement the mapping interface with keys given by variable names and values given by DataArray objects.

Dataset.__getitem__(key) Access variables or coordinates this dataset as a DataArray.
Dataset.__setitem__(key, value) Add an array to this dataset.
Dataset.__delitem__(key) Remove a variable from this dataset.
Dataset.update(other[, inplace]) Update this dataset’s variables with those from another dataset.
Dataset.iteritems(...)
Dataset.itervalues(...)
Dataset contents
Dataset.copy([deep]) Returns a copy of this dataset.
Dataset.assign(**kwargs) Assign new data variables to a Dataset, returning a new object with all the original variables in addition to the new ones.
Dataset.assign_coords(**kwargs) Assign new coordinates to this object, returning a new object with all the original data in addition to the new coordinates.
Dataset.pipe(func, *args, **kwargs) Apply func(self, *args, **kwargs)
Dataset.merge(other[, inplace, ...]) Merge the arrays of two datasets into a single dataset.
Dataset.rename(name_dict[, inplace]) Returns a new object with renamed variables and dimensions.
Dataset.swap_dims(dims_dict[, inplace]) Returns a new object with swapped dimensions.
Dataset.drop(labels[, dim]) Drop variables or index labels from this dataset.
Dataset.set_coords(names[, inplace]) Given names of one or more variables, set them as coordinates
Dataset.reset_coords([names, drop, inplace]) Given names of coordinates, reset them to become variables
Comparisons
Dataset.equals(other) Two Datasets are equal if they have matching variables and coordinates, all of which are equal.
Dataset.identical(other) Like equals, but also checks all dataset attributes and the attributes on all variables and coordinates.
Dataset.broadcast_equals(other) Two Datasets are broadcast equal if they are equal after broadcasting all variables against each other.
Indexing
Dataset.loc Attribute for location based indexing.
Dataset.isel(**indexers) Returns a new dataset with each array indexed along the specified dimension(s).
Dataset.sel([method, tolerance]) Returns a new dataset with each array indexed by tick labels along the specified dimension(s).
Dataset.isel_points([dim]) Returns a new dataset with each array indexed pointwise along the specified dimension(s).
Dataset.sel_points([dim, method, tolerance]) Returns a new dataset with each array indexed pointwise by tick labels along the specified dimension(s).
Dataset.squeeze([dim]) Returns a new dataset with squeezed data.
Dataset.reindex([indexers, method, ...]) Conform this object onto a new set of indexes, filling in missing values with NaN.
Dataset.reindex_like(other[, method, ...]) Conform this object onto the indexes of another object, filling in missing values with NaN.
Computation
Dataset.apply(func[, keep_attrs, args]) Apply a function over the data variables in this dataset.
Dataset.reduce(func[, dim, keep_attrs, ...]) Reduce this dataset by applying func along some dimension(s).
Dataset.groupby(group[, squeeze]) Returns a GroupBy object for performing grouped operations.
Dataset.resample(freq, dim[, how, skipna, ...]) Resample this object to a new temporal resolution.
Dataset.diff(dim[, n, label]) Calculate the n-th order discrete difference along given axis.

Aggregation: all any argmax argmin max mean median min prod sum std var

Missing values: isnull notnull count dropna fillna where

ndarray methods: argsort clip conj conjugate imag round real T

Grouped operations: assign assign_coords first last fillna where

Reshaping and reorganizing
Dataset.transpose(*dims) Return a new Dataset object with all array dimensions transposed.
Dataset.stack(**dimensions) Stack any number of existing dimensions into a single new dimension.
Dataset.unstack(dim) Unstack an existing dimension corresponding to a MultiIndex into multiple new dimensions.
Dataset.shift(**shifts) Shift this dataset by an offset along one or more dimensions.
Dataset.roll(**shifts) Roll this dataset by an offset along one or more dimensions.

DataArray

DataArray(data[, coords, dims, name, attrs, ...]) N-dimensional array with labeled coordinates and dimensions.
Attributes
DataArray.values The array’s data as a numpy.ndarray
DataArray.data The array’s data as a dask or numpy array
DataArray.coords Dictionary-like container of coordinate arrays.
DataArray.dims Dimension names associated with this array.
DataArray.name The name of this array.
DataArray.attrs Dictionary storing arbitrary metadata with this array.
DataArray.encoding Dictionary of format-specific settings for how this array should be serialized.

ndarray attributes: ndim shape size dtype

DataArray contents
DataArray.assign_coords(**kwargs) Assign new coordinates to this object, returning a new object with all the original data in addition to the new coordinates.
DataArray.rename(new_name_or_name_dict) Returns a new DataArray with renamed coordinates and/or a new name.
DataArray.swap_dims(dims_dict) Returns a new DataArray with swapped dimensions.
DataArray.drop(labels[, dim]) Drop coordinates or index labels from this DataArray.
DataArray.reset_coords([names, drop, inplace]) Given names of coordinates, reset them to become variables.
DataArray.copy([deep]) Returns a copy of this array.

ndarray methods: astype item

Indexing
DataArray.__getitem__(key)
DataArray.__setitem__(key, value)
DataArray.loc Attribute for location based indexing like pandas.
DataArray.isel(**indexers) Return a new DataArray whose dataset is given by integer indexing along the specified dimension(s).
DataArray.sel([method, tolerance]) Return a new DataArray whose dataset is given by selecting index labels along the specified dimension(s).
DataArray.isel_points([dim]) Return a new DataArray whose dataset is given by pointwise integer indexing along the specified dimension(s).
DataArray.sel_points([dim, method, tolerance]) Return a new DataArray whose dataset is given by pointwise selection of index labels along the specified dimension(s).
DataArray.squeeze([dim]) Return a new DataArray object with squeezed data.
DataArray.reindex([method, tolerance, copy]) Conform this object onto a new set of indexes, filling in missing values with NaN.
DataArray.reindex_like(other[, method, ...]) Conform this object onto the indexes of another object, filling in missing values with NaN.
Comparisons
DataArray.equals(other) True if two DataArrays have the same dimensions, coordinates and values; otherwise False.
DataArray.identical(other) Like equals, but also checks the array name and attributes, and attributes on all coordinates.
DataArray.broadcast_equals(other) Two DataArrays are broadcast equal if they are equal after broadcasting them against each other such that they have the same dimensions.
Computation
DataArray.reduce(func[, dim, axis, keep_attrs]) Reduce this array by applying func along some dimension(s).
DataArray.groupby(group[, squeeze]) Returns a GroupBy object for performing grouped operations.
DataArray.rolling([min_periods, center]) Rolling window object.
DataArray.resample(freq, dim[, how, skipna, ...]) Resample this object to a new temporal resolution.
DataArray.get_axis_num(dim) Return axis number(s) corresponding to dimension(s) in this array.
DataArray.diff(dim[, n, label]) Calculate the n-th order discrete difference along given axis.
DataArray.dot(other) Perform dot product of two DataArrays along their shared dims.

Aggregation: all any argmax argmin max mean median min prod sum std var

Missing values: isnull notnull count dropna fillna where

ndarray methods: argsort clip conj conjugate imag searchsorted round real T

Grouped operations: assign_coords first last fillna where

Reshaping and reorganizing
DataArray.transpose(*dims) Return a new DataArray object with transposed dimensions.
DataArray.stack(**dimensions) Stack any number of existing dimensions into a single new dimension.
DataArray.unstack(dim) Unstack an existing dimension corresponding to a MultiIndex into multiple new dimensions.
DataArray.shift(**shifts) Shift this array by an offset along one or more dimensions.
DataArray.roll(**shifts) Roll this array by an offset along one or more dimensions.

Universal functions

This functions are copied from NumPy, but extended to work on NumPy arrays, dask arrays and all xarray objects. You can find them in the xarray.ufuncs module:

angle arccos arccosh arcsin arcsinh arctan arctan2 arctanh ceil conj copysign cos cosh deg2rad degrees exp expm1 fabs fix floor fmax fmin fmod fmod frexp hypot imag iscomplex isfinite isinf isnan isreal ldexp log log10 log1p log2 logaddexp logaddexp2 logical_and logical_not logical_or logical_xor maximum minimum nextafter rad2deg radians real rint sign signbit sin sinh sqrt square tan tanh trunc

IO / Conversion

Dataset methods
open_dataset(filename_or_obj[, group, ...]) Load and decode a dataset from a file or file-like object.
open_mfdataset(paths[, chunks, concat_dim, ...]) Open multiple files as a single dataset.
Dataset.to_netcdf([path, mode, format, ...]) Write dataset contents to a netCDF file.
save_mfdataset(datasets, paths[, mode, ...]) Write multiple datasets to disk as netCDF files simultaneously.
Dataset.to_array([dim, name]) Convert this dataset into an xarray.DataArray
Dataset.to_dataframe() Convert this dataset into a pandas.DataFrame.
Dataset.from_dataframe(dataframe) Convert a pandas.DataFrame into an xarray.Dataset
Dataset.close() Close any files linked to this dataset
Dataset.load() Manually trigger loading of this dataset’s data from disk or a remote source into memory and return this dataset.
Dataset.chunk([chunks, name_prefix, token, lock]) Coerce all arrays in this dataset into dask arrays with the given chunks.
DataArray methods
DataArray.to_dataset([dim, name]) Convert a DataArray to a Dataset.
DataArray.to_pandas() Convert this array into a pandas object with the same shape.
DataArray.to_series() Convert this array into a pandas.Series.
DataArray.to_dataframe([name]) Convert this array and its coordinates into a tidy pandas.DataFrame.
DataArray.to_index() Convert this variable to a pandas.Index.
DataArray.to_masked_array([copy]) Convert this array into a numpy.ma.MaskedArray
DataArray.to_cdms2() Convert this array into a cdms2.Variable
DataArray.from_series(series) Convert a pandas.Series into an xarray.DataArray.
DataArray.from_cdms2(variable) Convert a cdms2.Variable into an xarray.DataArray
DataArray.load() Manually trigger loading of this array’s data from disk or a remote source into memory and return this array.
DataArray.chunk([chunks]) Coerce this array’s data into a dask arrays with the given chunks.
Backends (experimental)

These backends provide a low-level interface for lazily loading data from external file-formats or protocols, and can be manually invoked to create arguments for the from_store and dump_to_store Dataset methods.

backends.NetCDF4DataStore(filename[, mode, ...]) Store for reading and writing data via the Python-NetCDF4 library.
backends.H5NetCDFStore(filename[, mode, ...]) Store for reading and writing data via h5netcdf
backends.PydapDataStore(url) Store for accessing OpenDAP datasets with pydap.
backends.ScipyDataStore(filename_or_obj[, ...]) Store for reading and writing data via scipy.io.netcdf.

Plotting

plot.plot(darray[, row, col, col_wrap, ax, ...]) Default plot of DataArray using matplotlib.pyplot.
plot.contourf(darray[, x, y, ax, row, col, ...]) Filled contour plot of 2d DataArray
plot.contour(darray[, x, y, ax, row, col, ...]) Contour plot of 2d DataArray
plot.hist(darray[, ax]) Histogram of DataArray
plot.imshow(darray[, x, y, ax, row, col, ...]) Image plot of 2d DataArray using matplotlib.pyplot
plot.line(darray, *args, **kwargs) Line plot of 1 dimensional DataArray index against values
plot.pcolormesh(darray[, x, y, ax, row, ...]) Pseudocolor plot of 2d DataArray
plot.FacetGrid(data[, col, row, col_wrap, ...]) Initialize the matplotlib figure and FacetGrid object.

Frequently Asked Questions

Why is pandas not enough?

pandas, thanks to its unrivaled speed and flexibility, has emerged as the premier python package for working with labeled arrays. So why are we contributing to further fragmentation in the ecosystem for working with data arrays in Python?

Sometimes, we really want to work with collections of higher dimensional arrays (ndim > 2), or arrays for which the order of dimensions (e.g., columns vs rows) shouldn’t really matter. For example, climate and weather data is often natively expressed in 4 or more dimensions: time, x, y and z.

Pandas does support N-dimensional panels, but the implementation is very limited:

  • You need to create a new factory type for each dimensionality.
  • You can’t do math between NDPanels with different dimensionality.
  • Each dimension in a NDPanel has a name (e.g., ‘labels’, ‘items’, ‘major_axis’, etc.) but the dimension names refer to order, not their meaning. You can’t specify an operation as to be applied along the “time” axis.

Fundamentally, the N-dimensional panel is limited by its context in pandas’s tabular model, which treats a 2D DataFrame as a collections of 1D Series, a 3D Panel as a collection of 2D DataFrame, and so on. In my experience, it usually easier to work with a DataFrame with a hierarchical index rather than to use higher dimensional (N > 3) data structures in pandas.

Another use case is handling collections of arrays with different numbers of dimensions. For example, suppose you have a 2D array and a handful of associated 1D arrays that share one of the same axes. Storing these in one pandas object is possible but awkward – you can either upcast all the 1D arrays to 2D and store everything in a Panel, or put everything in a DataFrame, where the first few columns have a different meaning than the other columns. In contrast, this sort of data structure fits very naturally in an xarray Dataset.

Pandas gets a lot of things right, but scientific users need fully multi- dimensional data structures.

How do xarray data structures differ from those found in pandas?

The main distinguishing feature of xarray’s DataArray over labeled arrays in pandas is that dimensions can have names (e.g., “time”, “latitude”, “longitude”). Names are much easier to keep track of than axis numbers, and xarray uses dimension names for indexing, aggregation and broadcasting. Not only can you write x.sel(time='2000-01-01') and x.mean(dim='time'), but operations like x - x.mean(dim='time') always work, no matter the order of the “time” dimension. You never need to reshape arrays (e.g., with np.newaxis) to align them for arithmetic operations in xarray.

Should I use xarray instead of pandas?

It’s not an either/or choice! xarray provides robust support for converting back and forth between the tabular data-structures of pandas and its own multi-dimensional data-structures.

That said, you should only bother with xarray if some aspect of data is fundamentally multi-dimensional. If your data is unstructured or one-dimensional, stick with pandas, which is a more developed toolkit for doing data analysis in Python.

What is your approach to metadata?

We are firm believers in the power of labeled data! In addition to dimensions and coordinates, xarray supports arbitrary metadata in the form of global (Dataset) and variable specific (DataArray) attributes (attrs).

Automatic interpretation of labels is powerful but also reduces flexibility. With xarray, we draw a firm line between labels that the library understands (dims and coords) and labels for users and user code (attrs). For example, we do not automatically interpret and enforce units or CF conventions. (An exception is serialization to and from netCDF files.)

An implication of this choice is that we do not propagate attrs through most operations unless explicitly flagged (some methods have a keep_attrs option). Similarly, xarray does not check for conflicts between attrs when combining arrays and datasets, unless explicitly requested with the option compat='identical'. The guiding principle is that metadata should not be allowed to get in the way.

What’s New

v0.7.2 (13 March 2016)

This release includes two new, entirely backwards compatible features and several bug fixes.

Enhancements
  • New DataArray method DataArray.dot() for calculating the dot product of two DataArrays along shared dimensions. By Dean Pospisil.

  • Rolling window operations on DataArray objects are now supported via a new DataArray.rolling() method. For example:

    In [1]: import xarray as xr; import numpy as np
    
    In [2]: arr = xr.DataArray(np.arange(0, 7.5, 0.5).reshape(3, 5),
                               dims=('x', 'y'))
    
    In [3]: arr
    Out[3]: 
    <xarray.DataArray (x: 3, y: 5)>
    array([[ 0. ,  0.5,  1. ,  1.5,  2. ],
           [ 2.5,  3. ,  3.5,  4. ,  4.5],
           [ 5. ,  5.5,  6. ,  6.5,  7. ]])
    Coordinates:
      * x        (x) int64 0 1 2
      * y        (y) int64 0 1 2 3 4
    
    In [4]: arr.rolling(y=3, min_periods=2).mean()
    Out[4]: 
    <xarray.DataArray (x: 3, y: 5)>
    array([[  nan,  0.25,  0.5 ,  1.  ,  1.5 ],
           [  nan,  2.75,  3.  ,  3.5 ,  4.  ],
           [  nan,  5.25,  5.5 ,  6.  ,  6.5 ]])
    Coordinates:
      * x        (x) int64 0 1 2
      * y        (y) int64 0 1 2 3 4
    

    See Rolling window operations for more details. By Joe Hamman.

Bug fixes
  • Fixed an issue where plots using pcolormesh and Cartopy axes were being distorted by the inference of the axis interval breaks. This change chooses not to modify the coordinate variables when the axes have the attribute projection, allowing Cartopy to handle the extent of pcolormesh plots (GH781). By Joe Hamman.
  • 2D plots now better handle additional coordinates which are not DataArray dimensions (GH788). By Fabien Maussion.

v0.7.1 (16 February 2016)

This is a bug fix release that includes two small, backwards compatible enhancements. We recommend that all users upgrade.

Enhancements
  • Numerical operations now return empty objects on no overlapping labels rather than raising ValueError (GH739).
  • Series is now supported as valid input to the Dataset constructor (GH740).
Bug fixes
  • Restore checks for shape consistency between data and coordinates in the DataArray constructor (GH758).
  • Single dimension variables no longer transpose as part of a broader .transpose. This behavior was causing pandas.PeriodIndex dimensions to lose their type (GH749)
  • Dataset labels remain as their native type on .to_dataset. Previously they were coerced to strings (GH745)
  • Fixed a bug where replacing a DataArray index coordinate would improperly align the coordinate (GH725).
  • DataArray.reindex_like now maintains the dtype of complex numbers when reindexing leads to NaN values (GH738).
  • Dataset.rename and DataArray.rename support the old and new names being the same (GH724).
  • Fix from_dataset() for DataFrames with Categorical column and a MultiIndex index (GH737).
  • Fixes to ensure xarray works properly after the upcoming pandas v0.18 and NumPy v1.11 releases.
Acknowledgments

The following individuals contributed to this release:

  • Edward Richards
  • Maximilian Roos
  • Rafael Guedes
  • Spencer Hill
  • Stephan Hoyer

v0.7.0 (21 January 2016)

This major release includes redesign of DataArray internals, as well as new methods for reshaping, rolling and shifting data. It includes preliminary support for pandas.MultiIndex, as well as a number of other features and bug fixes, several of which offer improved compatibility with pandas.

New name

The project formerly known as “xray” is now “xarray”, pronounced “x-array”! This avoids a namespace conflict with the entire field of x-ray science. Renaming our project seemed like the right thing to do, especially because some scientists who work with actual x-rays are interested in using this project in their work. Thanks for your understanding and patience in this transition. You can now find our documentation and code repository at new URLs:

To ease the transition, we have simultaneously released v0.7.0 of both xray and xarray on the Python Package Index. These packages are identical. For now, import xray still works, except it issues a deprecation warning. This will be the last xray release. Going forward, we recommend switching your import statements to import xarray as xr.

Breaking changes
  • The internal data model used by DataArray has been rewritten to fix several outstanding issues (GH367, GH634, this stackoverflow report). Internally, DataArray is now implemented in terms of ._variable and ._coords attributes instead of holding variables in a Dataset object.

    This refactor ensures that if a DataArray has the same name as one of its coordinates, the array and the coordinate no longer share the same data.

    In practice, this means that creating a DataArray with the same name as one of its dimensions no longer automatically uses that array to label the corresponding coordinate. You will now need to provide coordinate labels explicitly. Here’s the old behavior:

    In [5]: xray.DataArray([4, 5, 6], dims='x', name='x')
    Out[5]: 
    <xray.DataArray 'x' (x: 3)>
    array([4, 5, 6])
    Coordinates:
      * x        (x) int64 4 5 6
    

    and the new behavior (compare the values of the x coordinate):

    In [6]: xray.DataArray([4, 5, 6], dims='x', name='x')
    Out[6]: 
    <xray.DataArray 'x' (x: 3)>
    array([4, 5, 6])
    Coordinates:
      * x        (x) int64 0 1 2
    
  • It is no longer possible to convert a DataArray to a Dataset with xray.DataArray.to_dataset() if it is unnamed. This will now raise ValueError. If the array is unnamed, you need to supply the name argument.

Enhancements
  • Basic support for MultiIndex coordinates on xray objects, including indexing, stack() and unstack():

    In [7]: df = pd.DataFrame({'foo': range(3),
       ...:                    'x': ['a', 'b', 'b'],
       ...:                    'y': [0, 0, 1]})
       ...: 
    
    In [8]: s = df.set_index(['x', 'y'])['foo']
    
    In [9]: arr = xray.DataArray(s, dims='z')
    
    In [10]: arr
    Out[10]: 
    <xray.DataArray 'foo' (z: 3)>
    array([0, 1, 2])
    Coordinates:
      * z        (z) object ('a', 0) ('b', 0) ('b', 1)
    
    In [11]: arr.indexes['z']
    Out[11]: 
    MultiIndex(levels=[[u'a', u'b'], [0, 1]],
               labels=[[0, 1, 1], [0, 0, 1]],
               names=[u'x', u'y'])
    
    In [12]: arr.unstack('z')
    Out[12]: 
    <xray.DataArray 'foo' (x: 2, y: 2)>
    array([[  0.,  nan],
           [  1.,   2.]])
    Coordinates:
      * x        (x) object 'a' 'b'
      * y        (y) int64 0 1
    
    In [13]: arr.unstack('z').stack(z=('x', 'y'))
    Out[13]: 
    <xray.DataArray 'foo' (z: 4)>
    array([  0.,  nan,   1.,   2.])
    Coordinates:
      * z        (z) object ('a', 0) ('a', 1) ('b', 0) ('b', 1)
    

    See Stack and unstack for more details.

    Warning

    xray’s MultiIndex support is still experimental, and we have a long to- do list of desired additions (GH719), including better display of multi-index levels when printing a Dataset, and support for saving datasets with a MultiIndex to a netCDF file. User contributions in this area would be greatly appreciated.

  • Support for reading GRIB, HDF4 and other file formats via PyNIO. See Formats supported by PyNIO for more details.

  • Better error message when a variable is supplied with the same name as one of its dimensions.

  • Plotting: more control on colormap parameters (GH642). vmin and vmax will not be silently ignored anymore. Setting center=False prevents automatic selection of a divergent colormap.

  • New shift() and roll() methods for shifting/rotating datasets or arrays along a dimension:

    In [14]: array = xray.DataArray([5, 6, 7, 8], dims='x')
    
    In [15]: array.shift(x=2)
    Out[15]: 
    <xarray.DataArray (x: 4)>
    array([ nan,  nan,   5.,   6.])
    Coordinates:
      * x        (x) int64 0 1 2 3
    
    In [16]: array.roll(x=2)
    Out[16]: 
    <xarray.DataArray (x: 4)>
    array([7, 8, 5, 6])
    Coordinates:
      * x        (x) int64 2 3 0 1
    

    Notice that shift moves data independently of coordinates, but roll moves both data and coordinates.

  • Assigning a pandas object directly as a Dataset variable is now permitted. Its index names correspond to the dims of the Dataset, and its data is aligned.

  • Passing a pandas.DataFrame or pandas.Panel to a Dataset constructor is now permitted.

  • New function broadcast() for explicitly broadcasting DataArray and Dataset objects against each other. For example:

    In [17]: a = xray.DataArray([1, 2, 3], dims='x')
    
    In [18]: b = xray.DataArray([5, 6], dims='y')
    
    In [19]: a
    Out[19]: 
    <xarray.DataArray (x: 3)>
    array([1, 2, 3])
    Coordinates:
      * x        (x) int64 0 1 2
    
    In [20]: b
    Out[20]: 
    <xarray.DataArray (y: 2)>
    array([5, 6])
    Coordinates:
      * y        (y) int64 0 1
    
    In [21]: a2, b2 = xray.broadcast(a, b)
    
    In [22]: a2
    Out[22]: 
    <xarray.DataArray (x: 3, y: 2)>
    array([[1, 1],
           [2, 2],
           [3, 3]])
    Coordinates:
      * x        (x) int64 0 1 2
      * y        (y) int64 0 1
    
    In [23]: b2
    Out[23]: 
    <xarray.DataArray (x: 3, y: 2)>
    array([[5, 6],
           [5, 6],
           [5, 6]])
    Coordinates:
      * y        (y) int64 0 1
      * x        (x) int64 0 1 2
    
Bug fixes
  • Fixes for several issues found on DataArray objects with the same name as one of their coordinates (see Breaking changes for more details).
  • DataArray.to_masked_array always returns masked array with mask being an array (not a scalar value) (GH684)
  • Allows for (imperfect) repr of Coords when underlying index is PeriodIndex (GH645).
  • Fixes for several issues found on DataArray objects with the same name as one of their coordinates (see Breaking changes for more details).
  • Attempting to assign a Dataset or DataArray variable/attribute using attribute-style syntax (e.g., ds.foo = 42) now raises an error rather than silently failing (GH656, GH714).
  • You can now pass pandas objects with non-numpy dtypes (e.g., categorical or datetime64 with a timezone) into xray without an error (GH716).
Acknowledgments

The following individuals contributed to this release:

  • Antony Lee
  • Fabien Maussion
  • Joe Hamman
  • Maximilian Roos
  • Stephan Hoyer
  • Takeshi Kanmae
  • femtotrader

v0.6.1 (21 October 2015)

This release contains a number of bug and compatibility fixes, as well as enhancements to plotting, indexing and writing files to disk.

Note that the minimum required version of dask for use with xray is now version 0.6.

API Changes
  • The handling of colormaps and discrete color lists for 2D plots in plot() was changed to provide more compatibility with matplotlib’s contour and contourf functions (GH538). Now discrete lists of colors should be specified using colors keyword, rather than cmap.
Enhancements
  • Faceted plotting through FacetGrid and the plot() method. See Faceting for more details and examples.

  • sel() and reindex() now support the tolerance argument for controlling nearest-neighbor selection (GH629):

    In [24]: array = xray.DataArray([1, 2, 3], dims='x')
    
    In [25]: array.reindex(x=[0.9, 1.5], method='nearest', tolerance=0.2)
    Out[25]: 
    <xray.DataArray (x: 2)>
    array([  2.,  nan])
    Coordinates:
      * x        (x) float64 0.9 1.5
    

    This feature requires pandas v0.17 or newer.

  • New encoding argument in to_netcdf() for writing netCDF files with compression, as described in the new documentation section on Writing encoded data.

  • Add real and imag attributes to Dataset and DataArray (GH553).

  • More informative error message with from_dataframe() if the frame has duplicate columns.

  • xray now uses deterministic names for dask arrays it creates or opens from disk. This allows xray users to take advantage of dask’s nascent support for caching intermediate computation results. See GH555 for an example.

Bug fixes
  • Forwards compatibility with the latest pandas release (v0.17.0). We were using some internal pandas routines for datetime conversion, which unfortunately have now changed upstream (GH569).
  • Aggregation functions now correctly skip NaN for data for complex128 dtype (GH554).
  • Fixed indexing 0d arrays with unicode dtype (GH568).
  • name() and Dataset keys must be a string or None to be written to netCDF (GH533).
  • where() now uses dask instead of numpy if either the array or other is a dask array. Previously, if other was a numpy array the method was evaluated eagerly.
  • Global attributes are now handled more consistently when loading remote datasets using engine='pydap' (GH574).
  • It is now possible to assign to the .data attribute of DataArray objects.
  • coordinates attribute is now kept in the encoding dictionary after decoding (GH610).
  • Compatibility with numpy 1.10 (GH617).
Acknowledgments

The following individuals contributed to this release:

  • Ryan Abernathey
  • Pete Cable
  • Clark Fitzgerald
  • Joe Hamman
  • Stephan Hoyer
  • Scott Sinclair

v0.6.0 (21 August 2015)

This release includes numerous bug fixes and enhancements. Highlights include the introduction of a plotting module and the new Dataset and DataArray methods isel_points(), sel_points(), where() and diff(). There are no breaking changes from v0.5.2.

Enhancements
  • Plotting methods have been implemented on DataArray objects plot() through integration with matplotlib (GH185). For an introduction, see Plotting.

  • Variables in netCDF files with multiple missing values are now decoded as NaN after issuing a warning if open_dataset is called with mask_and_scale=True.

  • We clarified our rules for when the result from an xray operation is a copy vs. a view (see Copies vs. views for more details).

  • Dataset variables are now written to netCDF files in order of appearance when using the netcdf4 backend (GH479).

  • Added isel_points() and sel_points() to support pointwise indexing of Datasets and DataArrays (GH475).

    In [26]: da = xray.DataArray(np.arange(56).reshape((7, 8)),
       ....:                     coords={'x': list('abcdefg'),
       ....:                             'y': 10 * np.arange(8)},
       ....:                     dims=['x', 'y'])
       ....: 
    
    In [27]: da
    Out[27]: 
    <xray.DataArray (x: 7, y: 8)>
    array([[ 0,  1,  2,  3,  4,  5,  6,  7],
           [ 8,  9, 10, 11, 12, 13, 14, 15],
           [16, 17, 18, 19, 20, 21, 22, 23],
           [24, 25, 26, 27, 28, 29, 30, 31],
           [32, 33, 34, 35, 36, 37, 38, 39],
           [40, 41, 42, 43, 44, 45, 46, 47],
           [48, 49, 50, 51, 52, 53, 54, 55]])
    Coordinates:
    * y        (y) int64 0 10 20 30 40 50 60 70
    * x        (x) |S1 'a' 'b' 'c' 'd' 'e' 'f' 'g'
    
    # we can index by position along each dimension
    In [28]: da.isel_points(x=[0, 1, 6], y=[0, 1, 0], dim='points')
    Out[28]: 
    <xray.DataArray (points: 3)>
    array([ 0,  9, 48])
    Coordinates:
        y        (points) int64 0 10 0
        x        (points) |S1 'a' 'b' 'g'
      * points   (points) int64 0 1 2
    
    # or equivalently by label
    In [29]: da.sel_points(x=['a', 'b', 'g'], y=[0, 10, 0], dim='points')
    Out[29]: 
    <xray.DataArray (points: 3)>
    array([ 0,  9, 48])
    Coordinates:
        y        (points) int64 0 10 0
        x        (points) |S1 'a' 'b' 'g'
      * points   (points) int64 0 1 2
    
  • New where() method for masking xray objects according to some criteria. This works particularly well with multi-dimensional data:

    In [30]: ds = xray.Dataset(coords={'x': range(100), 'y': range(100)})
    
    In [31]: ds['distance'] = np.sqrt(ds.x ** 2 + ds.y ** 2)
    
    In [32]: ds.distance.where(ds.distance < 100).plot()
    Out[32]: <matplotlib.collections.QuadMesh at 0x7f0c2920c5d0>
    
    _images/where_example.png
  • Added new methods DataArray.diff and Dataset.diff for finite difference calculations along a given axis.

  • New to_masked_array() convenience method for returning a numpy.ma.MaskedArray.

    In [33]: da = xray.DataArray(np.random.random_sample(size=(5, 4)))
    
    In [34]: da.where(da < 0.5)
    Out[34]: 
    <xarray.DataArray (dim_0: 5, dim_1: 4)>
    array([[ 0.127,    nan,  0.26 ,    nan],
           [ 0.377,  0.336,  0.451,    nan],
           [ 0.123,    nan,  0.373,  0.448],
           [ 0.129,    nan,    nan,  0.352],
           [ 0.229,    nan,    nan,  0.138]])
    Coordinates:
      * dim_0    (dim_0) int64 0 1 2 3 4
      * dim_1    (dim_1) int64 0 1 2 3
    
    In [35]: da.where(da < 0.5).to_masked_array(copy=True)
    Out[35]: 
    masked_array(data =
     [[0.12696983303810094 -- 0.26047600586578334 --]
     [0.37674971618967135 0.33622174433445307 0.45137647047539964 --]
     [0.12310214428849964 -- 0.37301222522143085 0.4479968246859435]
     [0.12944067971751294 -- -- 0.35205353914802473]
     [0.2288873043216132 -- -- 0.1375535565632705]],
                 mask =
     [[False  True False  True]
     [False False False  True]
     [False  True False False]
     [False  True  True False]
     [False  True  True False]],
           fill_value = 1e+20)
    
  • Added new flag “drop_variables” to open_dataset() for excluding variables from being parsed. This may be useful to drop variables with problems or inconsistent values.

Bug fixes
  • Fixed aggregation functions (e.g., sum and mean) on big-endian arrays when bottleneck is installed (GH489).
  • Dataset aggregation functions dropped variables with unsigned integer dtype (GH505).
  • .any() and .all() were not lazy when used on xray objects containing dask arrays.
  • Fixed an error when attempting to saving datetime64 variables to netCDF files when the first element is NaT (GH528).
  • Fix pickle on DataArray objects (GH515).
  • Fixed unnecessary coercion of float64 to float32 when using netcdf3 and netcdf4_classic formats (GH526).

v0.5.2 (16 July 2015)

This release contains bug fixes, several additional options for opening and saving netCDF files, and a backwards incompatible rewrite of the advanced options for xray.concat.

Backwards incompatible changes
  • The optional arguments concat_over and mode in concat() have been removed and replaced by data_vars and coords. The new arguments are both more easily understood and more robustly implemented, and allowed us to fix a bug where concat accidentally loaded data into memory. If you set values for these optional arguments manually, you will need to update your code. The default behavior should be unchanged.
Enhancements
  • open_mfdataset() now supports a preprocess argument for preprocessing datasets prior to concatenaton. This is useful if datasets cannot be otherwise merged automatically, e.g., if the original datasets have conflicting index coordinates (GH443).

  • open_dataset() and open_mfdataset() now use a global thread lock by default for reading from netCDF files with dask. This avoids possible segmentation faults for reading from netCDF4 files when HDF5 is not configured properly for concurrent access (GH444).

  • Added support for serializing arrays of complex numbers with engine=’h5netcdf’.

  • The new save_mfdataset() function allows for saving multiple datasets to disk simultaneously. This is useful when processing large datasets with dask.array. For example, to save a dataset too big to fit into memory to one file per year, we could write:

    In [36]: years, datasets = zip(*ds.groupby('time.year'))
    
    In [37]: paths = ['%s.nc' % y for y in years]
    
    In [38]: xray.save_mfdataset(datasets, paths)
    
Bug fixes
  • Fixed min, max, argmin and argmax for arrays with string or unicode types (GH453).
  • open_dataset() and open_mfdataset() support supplying chunks as a single integer.
  • Fixed a bug in serializing scalar datetime variable to netCDF.
  • Fixed a bug that could occur in serialization of 0-dimensional integer arrays.
  • Fixed a bug where concatenating DataArrays was not always lazy (GH464).
  • When reading datasets with h5netcdf, bytes attributes are decoded to strings. This allows conventions decoding to work properly on Python 3 (GH451).

v0.5.1 (15 June 2015)

This minor release fixes a few bugs and an inconsistency with pandas. It also adds the pipe method, copied from pandas.

Enhancements
  • Added pipe(), replicating the new pandas method in version 0.16.2. See Transforming datasets for more details.
  • assign() and assign_coords() now assign new variables in sorted (alphabetical) order, mirroring the behavior in pandas. Previously, the order was arbitrary.
Bug fixes
  • xray.concat fails in an edge case involving identical coordinate variables (GH425)
  • We now decode variables loaded from netCDF3 files with the scipy engine using native endianness (GH416). This resolves an issue when aggregating these arrays with bottleneck installed.

v0.5 (1 June 2015)

Highlights

The headline feature in this release is experimental support for out-of-core computing (data that doesn’t fit into memory) with dask. This includes a new top-level function open_mfdataset() that makes it easy to open a collection of netCDF (using dask) as a single xray.Dataset object. For more on dask, read the blog post introducing xray + dask and the new documentation section Out of core computation with dask.

Dask makes it possible to harness parallelism and manipulate gigantic datasets with xray. It is currently an optional dependency, but it may become required in the future.

Backwards incompatible changes
  • The logic used for choosing which variables are concatenated with concat() has changed. Previously, by default any variables which were equal across a dimension were not concatenated. This lead to some surprising behavior, where the behavior of groupby and concat operations could depend on runtime values (GH268). For example:

    In [39]: ds = xray.Dataset({'x': 0})
    
    In [40]: xray.concat([ds, ds], dim='y')
    Out[40]: 
    <xray.Dataset>
    Dimensions:  ()
    Coordinates:
        *empty*
    Data variables:
        x        int64 0
    

    Now, the default always concatenates data variables:

    In [41]: xray.concat([ds, ds], dim='y')
    Out[41]: 
    <xarray.Dataset>
    Dimensions:  (y: 2)
    Coordinates:
      * y        (y) int64 0 1
    Data variables:
        x        (y) int64 0 0
    

    To obtain the old behavior, supply the argument concat_over=[].

Enhancements
  • New to_array() and enhanced to_dataset() methods make it easy to switch back and forth between arrays and datasets:

    In [42]: ds = xray.Dataset({'a': 1, 'b': ('x', [1, 2, 3])},
       ....:                   coords={'c': 42}, attrs={'Conventions': 'None'})
       ....: 
    
    In [43]: ds.to_array()
    Out[43]: 
    <xarray.DataArray (variable: 2, x: 3)>
    array([[1, 1, 1],
           [1, 2, 3]])
    Coordinates:
      * variable  (variable) |S1 'a' 'b'
      * x         (x) int64 0 1 2
        c         int64 42
    Attributes:
        Conventions: None
    
    In [44]: ds.to_array().to_dataset(dim='variable')
    Out[44]: 
    <xarray.Dataset>
    Dimensions:  (x: 3)
    Coordinates:
      * x        (x) int64 0 1 2
        c        int64 42
    Data variables:
        a        (x) int64 1 1 1
        b        (x) int64 1 2 3
    Attributes:
        Conventions: None
    
  • New fillna() method to fill missing values, modeled off the pandas method of the same name:

    In [45]: array = xray.DataArray([np.nan, 1, np.nan, 3], dims='x')
    
    In [46]: array.fillna(0)
    Out[46]: 
    <xarray.DataArray (x: 4)>
    array([ 0.,  1.,  0.,  3.])
    Coordinates:
      * x        (x) int64 0 1 2 3
    

    fillna works on both Dataset and DataArray objects, and uses index based alignment and broadcasting like standard binary operations. It also can be applied by group, as illustrated in Fill missing values with climatology.

  • New assign() and assign_coords() methods patterned off the new DataFrame.assign method in pandas:

    In [47]: ds = xray.Dataset({'y': ('x', [1, 2, 3])})
    
    In [48]: ds.assign(z = lambda ds: ds.y ** 2)
    Out[48]: 
    <xarray.Dataset>
    Dimensions:  (x: 3)
    Coordinates:
      * x        (x) int64 0 1 2
    Data variables:
        y        (x) int64 1 2 3
        z        (x) int64 1 4 9
    
    In [49]: ds.assign_coords(z = ('x', ['a', 'b', 'c']))
    Out[49]: 
    <xarray.Dataset>
    Dimensions:  (x: 3)
    Coordinates:
      * x        (x) int64 0 1 2
        z        (x) |S1 'a' 'b' 'c'
    Data variables:
        y        (x) int64 1 2 3
    

    These methods return a new Dataset (or DataArray) with updated data or coordinate variables.

  • sel() now supports the method parameter, which works like the paramter of the same name on reindex(). It provides a simple interface for doing nearest-neighbor interpolation:

    In [50]: ds.sel(x=1.1, method='nearest')
    Out[50]: 
    <xray.Dataset>
    Dimensions:  ()
    Coordinates:
        x        int64 1
    Data variables:
        y        int64 2
    
    In [51]: ds.sel(x=[1.1, 2.1], method='pad')
    Out[51]: 
    <xray.Dataset>
    Dimensions:  (x: 2)
    Coordinates:
      * x        (x) int64 1 2
    Data variables:
        y        (x) int64 2 3
    

    See Nearest neighbor lookups for more details.

  • You can now control the underlying backend used for accessing remote datasets (via OPeNDAP) by specifying engine='netcdf4' or engine='pydap'.

  • xray now provides experimental support for reading and writing netCDF4 files directly via h5py with the h5netcdf package, avoiding the netCDF4-Python package. You will need to install h5netcdf and specify engine='h5netcdf' to try this feature.

  • Accessing data from remote datasets now has retrying logic (with exponential backoff) that should make it robust to occasional bad responses from DAP servers.

  • You can control the width of the Dataset repr with xray.set_options. It can be used either as a context manager, in which case the default is restored outside the context:

    In [52]: ds = xray.Dataset({'x': np.arange(1000)})
    
    In [53]: with xray.set_options(display_width=40):
       ....:     print(ds)
       ....: 
    <xarray.Dataset>
    Dimensions:  (x: 1000)
    Coordinates:
      * x        (x) int64 0 1 2 3 4 5 6 ...
    Data variables:
        *empty*
    

    Or to set a global option:

    In [54]: xray.set_options(display_width=80)
    

    The default value for the display_width option is 80.

Deprecations
  • The method load_data() has been renamed to the more succinct load().

v0.4.1 (18 March 2015)

The release contains bug fixes and several new features. All changes should be fully backwards compatible.

Enhancements
  • New documentation sections on Time series data and Combining multiple files.

  • resample() lets you resample a dataset or data array to a new temporal resolution. The syntax is the same as pandas, except you need to supply the time dimension explicitly:

    In [55]: time = pd.date_range('2000-01-01', freq='6H', periods=10)
    
    In [56]: array = xray.DataArray(np.arange(10), [('time', time)])
    
    In [57]: array.resample('1D', dim='time')
    Out[57]: 
    <xarray.DataArray (time: 3)>
    array([ 1.5,  5.5,  8.5])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
    

    You can specify how to do the resampling with the how argument and other options such as closed and label let you control labeling:

    In [58]: array.resample('1D', dim='time', how='sum', label='right')
    Out[58]: 
    <xarray.DataArray (time: 3)>
    array([ 6, 22, 17])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-02 2000-01-03 2000-01-04
    

    If the desired temporal resolution is higher than the original data (upsampling), xray will insert missing values:

    In [59]: array.resample('3H', 'time')
    Out[59]: 
    <xarray.DataArray (time: 19)>
    array([  0.,  nan,   1., ...,   8.,  nan,   9.])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-01 2000-01-01T03:00:00 ...
    
  • first and last methods on groupby objects let you take the first or last examples from each group along the grouped axis:

    In [60]: array.groupby('time.day').first()
    Out[60]: 
    <xarray.DataArray (day: 3)>
    array([0, 4, 8])
    Coordinates:
      * day      (day) int64 1 2 3
    

    These methods combine well with resample:

    In [61]: array.resample('1D', dim='time', how='first')
    Out[61]: 
    <xarray.DataArray (time: 3)>
    array([0, 4, 8])
    Coordinates:
      * time     (time) datetime64[ns] 2000-01-01 2000-01-02 2000-01-03
    
  • swap_dims() allows for easily swapping one dimension out for another:

    In [62]: ds = xray.Dataset({'x': range(3), 'y': ('x', list('abc'))})
    
    In [63]: ds
    Out[63]: 
    <xarray.Dataset>
    Dimensions:  (x: 3)
    Coordinates:
      * x        (x) int64 0 1 2
    Data variables:
        y        (x) |S1 'a' 'b' 'c'
    
    In [64]: ds.swap_dims({'x': 'y'})
    Out[64]: 
    <xarray.Dataset>
    Dimensions:  (y: 3)
    Coordinates:
      * y        (y) |S1 'a' 'b' 'c'
        x        (y) int64 0 1 2
    Data variables:
        *empty*
    

    This was possible in earlier versions of xray, but required some contortions.

  • open_dataset() and to_netcdf() now accept an engine argument to explicitly select which underlying library (netcdf4 or scipy) is used for reading/writing a netCDF file.

Bug fixes
  • Fixed a bug where data netCDF variables read from disk with engine='scipy' could still be associated with the file on disk, even after closing the file (GH341). This manifested itself in warnings about mmapped arrays and segmentation faults (if the data was accessed).
  • Silenced spurious warnings about all-NaN slices when using nan-aware aggregation methods (GH344).
  • Dataset aggregations with keep_attrs=True now preserve attributes on data variables, not just the dataset itself.
  • Tests for xray now pass when run on Windows (GH360).
  • Fixed a regression in v0.4 where saving to netCDF could fail with the error ValueError: could not automatically determine time units.

v0.4 (2 March, 2015)

This is one of the biggest releases yet for xray: it includes some major changes that may break existing code, along with the usual collection of minor enhancements and bug fixes. On the plus side, this release includes all hitherto planned breaking changes, so the upgrade path for xray should be smoother going forward.

Breaking changes
  • We now automatically align index labels in arithmetic, dataset construction, merging and updating. This means the need for manually invoking methods like align() and reindex_like() should be vastly reduced.

    For arithmetic, we align based on the intersection of labels:

    In [65]: lhs = xray.DataArray([1, 2, 3], [('x', [0, 1, 2])])
    
    In [66]: rhs = xray.DataArray([2, 3, 4], [('x', [1, 2, 3])])
    
    In [67]: lhs + rhs
    Out[67]: 
    <xarray.DataArray (x: 2)>
    array([4, 6])
    Coordinates:
      * x        (x) int64 1 2
    

    For dataset construction and merging, we align based on the union of labels:

    In [68]: xray.Dataset({'foo': lhs, 'bar': rhs})
    Out[68]: 
    <xarray.Dataset>
    Dimensions:  (x: 4)
    Coordinates:
      * x        (x) int64 0 1 2 3
    Data variables:
        foo      (x) float64 1.0 2.0 3.0 nan
        bar      (x) float64 nan 2.0 3.0 4.0
    

    For update and __setitem__, we align based on the original object:

    In [69]: lhs.coords['rhs'] = rhs
    
    In [70]: lhs
    Out[70]: 
    <xarray.DataArray (x: 3)>
    array([1, 2, 3])
    Coordinates:
      * x        (x) int64 0 1 2
        rhs      (x) float64 nan 2.0 3.0
    
  • Aggregations like mean or median now skip missing values by default:

    In [71]: xray.DataArray([1, 2, np.nan, 3]).mean()
    Out[71]: 
    <xarray.DataArray ()>
    array(2.0)
    

    You can turn this behavior off by supplying the keyword arugment skipna=False.

    These operations are lightning fast thanks to integration with bottleneck, which is a new optional dependency for xray (numpy is used if bottleneck is not installed).

  • Scalar coordinates no longer conflict with constant arrays with the same value (e.g., in arithmetic, merging datasets and concat), even if they have different shape (GH243). For example, the coordinate c here persists through arithmetic, even though it has different shapes on each DataArray:

    In [72]: a = xray.DataArray([1, 2], coords={'c': 0}, dims='x')
    
    In [73]: b = xray.DataArray([1, 2], coords={'c': ('x', [0, 0])}, dims='x')
    
    In [74]: (a + b).coords
    Out[74]: 
    Coordinates:
      * x        (x) int64 0 1
        c        (x) int64 0 0
    

    This functionality can be controlled through the compat option, which has also been added to the Dataset constructor.

  • Datetime shortcuts such as 'time.month' now return a DataArray with the name 'month', not 'time.month' (GH345). This makes it easier to index the resulting arrays when they are used with groupby:

    In [75]: time = xray.DataArray(pd.date_range('2000-01-01', periods=365),
       ....:                       dims='time', name='time')
       ....: 
    
    In [76]: counts = time.groupby('time.month').count()
    
    In [77]: counts.sel(month=2)
    Out[77]: 
    <xarray.DataArray 'time' ()>
    array(29)
    Coordinates:
        month    int64 2
    

    Previously, you would need to use something like counts.sel(**{'time.month': 2}}), which is much more awkward.

  • The season datetime shortcut now returns an array of string labels such ‘DJF’:

    In [78]: ds = xray.Dataset({'t': pd.date_range('2000-01-01', periods=12, freq='M')})
    
    In [79]: ds['t.season']
    Out[79]: 
    <xarray.DataArray 'season' (t: 12)>
    array(['DJF', 'DJF', 'MAM', ..., 'SON', 'SON', 'DJF'], 
          dtype='|S3')
    Coordinates:
      * t        (t) datetime64[ns] 2000-01-31 2000-02-29 2000-03-31 2000-04-30 ...
    

    Previously, it returned numbered seasons 1 through 4.

  • We have updated our use of the terms of “coordinates” and “variables”. What were known in previous versions of xray as “coordinates” and “variables” are now referred to throughout the documentation as “coordinate variables” and “data variables”. This brings xray in closer alignment to CF Conventions. The only visible change besides the documentation is that Dataset.vars has been renamed Dataset.data_vars.

  • You will need to update your code if you have been ignoring deprecation warnings: methods and attributes that were deprecated in xray v0.3 or earlier (e.g., dimensions, attributes`) have gone away.

Enhancements
  • Support for reindex() with a fill method. This provides a useful shortcut for upsampling:

    In [80]: data = xray.DataArray([1, 2, 3], dims='x')
    
    In [81]: data.reindex(x=[0.5, 1, 1.5, 2, 2.5], method='pad')
    Out[81]: 
    <xarray.DataArray (x: 5)>
    array([1, 2, 2, 3, 3])
    Coordinates:
      * x        (x) float64 0.5 1.0 1.5 2.0 2.5
    

    This will be especially useful once pandas 0.16 is released, at which point xray will immediately support reindexing with method=’nearest’.

  • Use functions that return generic ndarrays with DataArray.groupby.apply and Dataset.apply (GH327 and GH329). Thanks Jeff Gerard!

  • Consolidated the functionality of dumps (writing a dataset to a netCDF3 bytestring) into to_netcdf() (GH333).

  • to_netcdf() now supports writing to groups in netCDF4 files (GH333). It also finally has a full docstring – you should read it!

  • open_dataset() and to_netcdf() now work on netCDF3 files when netcdf4-python is not installed as long as scipy is available (GH333).

  • The new Dataset.drop and DataArray.drop methods makes it easy to drop explicitly listed variables or index labels:

    # drop variables
    In [82]: ds = xray.Dataset({'x': 0, 'y': 1})
    
    In [83]: ds.drop('x')
    Out[83]: 
    <xarray.Dataset>
    Dimensions:  ()
    Coordinates:
        *empty*
    Data variables:
        y        int64 1
    
    # drop index labels
    In [84]: arr = xray.DataArray([1, 2, 3], coords=[('x', list('abc'))])
    
    In [85]: arr.drop(['a', 'c'], dim='x')
    Out[85]: 
    <xarray.DataArray (x: 1)>
    array([2])
    Coordinates:
      * x        (x) |S1 'b'
    
  • broadcast_equals() has been added to correspond to the new compat option.

  • Long attributes are now truncated at 500 characters when printing a dataset (GH338). This should make things more convenient for working with datasets interactively.

  • Added a new documentation example, Calculating Seasonal Averages from Timeseries of Monthly Means. Thanks Joe Hamman!

Bug fixes
  • Several bug fixes related to decoding time units from netCDF files (GH316, GH330). Thanks Stefan Pfenninger!
  • xray no longer requires decode_coords=False when reading datasets with unparseable coordinate attributes (GH308).
  • Fixed DataArray.loc indexing with ... (GH318).
  • Fixed an edge case that resulting in an error when reindexing multi-dimensional variables (GH315).
  • Slicing with negative step sizes (GH312).
  • Invalid conversion of string arrays to numeric dtype (GH305).
  • Fixed``repr()`` on dataset objects with non-standard dates (GH347).
Deprecations
  • dump and dumps have been deprecated in favor of to_netcdf().
  • drop_vars has been deprecated in favor of drop().
Future plans

The biggest feature I’m excited about working toward in the immediate future is supporting out-of-core operations in xray using Dask, a part of the Blaze project. For a preview of using Dask with weather data, read this blog post by Matthew Rocklin. See GH328 for more details.

v0.3.2 (23 December, 2014)

This release focused on bug-fixes, speedups and resolving some niggling inconsistencies.

There are a few cases where the behavior of xray differs from the previous version. However, I expect that in almost all cases your code will continue to run unmodified.

Warning

xray now requires pandas v0.15.0 or later. This was necessary for supporting TimedeltaIndex without too many painful hacks.

Backwards incompatible changes
  • Arrays of datetime.datetime objects are now automatically cast to datetime64[ns] arrays when stored in an xray object, using machinery borrowed from pandas:

    In [86]: from datetime import datetime
    
    In [87]: xray.Dataset({'t': [datetime(2000, 1, 1)]})
    Out[87]: 
    <xarray.Dataset>
    Dimensions:  (t: 1)
    Coordinates:
      * t        (t) datetime64[ns] 2000-01-01
    Data variables:
        *empty*
    
  • xray now has support (including serialization to netCDF) for TimedeltaIndex. datetime.timedelta objects are thus accordingly cast to timedelta64[ns] objects when appropriate.

  • Masked arrays are now properly coerced to use NaN as a sentinel value (GH259).

Enhancements
  • Due to popular demand, we have added experimental attribute style access as a shortcut for dataset variables, coordinates and attributes:

    In [88]: ds = xray.Dataset({'tmin': ([], 25, {'units': 'celcius'})})
    
    In [89]: ds.tmin.units
    Out[89]: 'celcius'
    

    Tab-completion for these variables should work in editors such as IPython. However, setting variables or attributes in this fashion is not yet supported because there are some unresolved ambiguities (GH300).

  • You can now use a dictionary for indexing with labeled dimensions. This provides a safe way to do assignment with labeled dimensions:

    In [90]: array = xray.DataArray(np.zeros(5), dims=['x'])
    
    In [91]: array[dict(x=slice(3))] = 1
    
    In [92]: array
    Out[92]: 
    <xarray.DataArray (x: 5)>
    array([ 1.,  1.,  1.,  0.,  0.])
    Coordinates:
      * x        (x) int64 0 1 2 3 4
    
  • Non-index coordinates can now be faithfully written to and restored from netCDF files. This is done according to CF conventions when possible by using the coordinates attribute on a data variable. When not possible, xray defines a global coordinates attribute.

  • Preliminary support for converting xray.DataArray objects to and from CDAT cdms2 variables.

  • We sped up any operation that involves creating a new Dataset or DataArray (e.g., indexing, aggregation, arithmetic) by a factor of 30 to 50%. The full speed up requires cyordereddict to be installed.

Bug fixes
  • Fix for to_dataframe() with 0d string/object coordinates (GH287)
  • Fix for to_netcdf with 0d string variable (GH284)
  • Fix writing datetime64 arrays to netcdf if NaT is present (GH270)
  • Fix align silently upcasts data arrays when NaNs are inserted (GH264)
Future plans
  • I am contemplating switching to the terms “coordinate variables” and “data variables” instead of the (currently used) “coordinates” and “variables”, following their use in CF Conventions (GH293). This would mostly have implications for the documentation, but I would also change the Dataset attribute vars to data.
  • I no longer certain that automatic label alignment for arithmetic would be a good idea for xray – it is a feature from pandas that I have not missed (GH186).
  • The main API breakage that I do anticipate in the next release is finally making all aggregation operations skip missing values by default (GH130). I’m pretty sick of writing ds.reduce(np.nanmean, 'time').
  • The next version of xray (0.4) will remove deprecated features and aliases whose use currently raises a warning.

If you have opinions about any of these anticipated changes, I would love to hear them – please add a note to any of the referenced GitHub issues.

v0.3.1 (22 October, 2014)

This is mostly a bug-fix release to make xray compatible with the latest release of pandas (v0.15).

We added several features to better support working with missing values and exporting xray objects to pandas. We also reorganized the internal API for serializing and deserializing datasets, but this change should be almost entirely transparent to users.

Other than breaking the experimental DataStore API, there should be no backwards incompatible changes.

New features
  • Added count() and dropna() methods, copied from pandas, for working with missing values (GH247, GH58).
  • Added DataArray.to_pandas for converting a data array into the pandas object with the same dimensionality (1D to Series, 2D to DataFrame, etc.) (GH255).
  • Support for reading gzipped netCDF3 files (GH239).
  • Reduced memory usage when writing netCDF files (GH251).
  • ‘missing_value’ is now supported as an alias for the ‘_FillValue’ attribute on netCDF variables (GH245).
  • Trivial indexes, equivalent to range(n) where n is the length of the dimension, are no longer written to disk (GH245).
Bug fixes
  • Compatibility fixes for pandas v0.15 (GH262).
  • Fixes for display and indexing of NaT (not-a-time) (GH238, GH240)
  • Fix slicing by label was an argument is a data array (GH250).
  • Test data is now shipped with the source distribution (GH253).
  • Ensure order does not matter when doing arithmetic with scalar data arrays (GH254).
  • Order of dimensions preserved with DataArray.to_dataframe (GH260).

v0.3 (21 September 2014)

New features
  • Revamped coordinates: “coordinates” now refer to all arrays that are not used to index a dimension. Coordinates are intended to allow for keeping track of arrays of metadata that describe the grid on which the points in “variable” arrays lie. They are preserved (when unambiguous) even though mathematical operations.
  • Dataset math Dataset objects now support all arithmetic operations directly. Dataset-array operations map across all dataset variables; dataset-dataset operations act on each pair of variables with the same name.
  • GroupBy math: This provides a convenient shortcut for normalizing by the average value of a group.
  • The dataset __repr__ method has been entirely overhauled; dataset objects now show their values when printed.
  • You can now index a dataset with a list of variables to return a new dataset: ds[['foo', 'bar']].
Backwards incompatible changes
  • Dataset.__eq__ and Dataset.__ne__ are now element-wise operations instead of comparing all values to obtain a single boolean. Use the method equals() instead.
Deprecations
  • Dataset.noncoords is deprecated: use Dataset.vars instead.
  • Dataset.select_vars deprecated: index a Dataset with a list of variable names instead.
  • DataArray.select_vars and DataArray.drop_vars deprecated: use reset_coords() instead.

v0.2 (14 August 2014)

This is major release that includes some new features and quite a few bug fixes. Here are the highlights:

  • There is now a direct constructor for DataArray objects, which makes it possible to create a DataArray without using a Dataset. This is highlighted in the refreshed tutorial.
  • You can perform aggregation operations like mean directly on Dataset objects, thanks to Joe Hamman. These aggregation methods also worked on grouped datasets.
  • xray now works on Python 2.6, thanks to Anna Kuznetsova.
  • A number of methods and attributes were given more sensible (usually shorter) names: labeled -> sel, indexed -> isel, select -> select_vars, unselect -> drop_vars, dimensions -> dims, coordinates -> coords, attributes -> attrs.
  • New load_data() and close() methods for datasets facilitate lower level of control of data loaded from disk.

v0.1.1 (20 May 2014)

xray 0.1.1 is a bug-fix release that includes changes that should be almost entirely backwards compatible with v0.1:

  • Python 3 support (GH53)
  • Required numpy version relaxed to 1.7 (GH129)
  • Return numpy.datetime64 arrays for non-standard calendars (GH126)
  • Support for opening datasets associated with NetCDF4 groups (GH127)
  • Bug-fixes for concatenating datetime arrays (GH134)

Special thanks to new contributors Thomas Kluyver, Joe Hamman and Alistair Miles.

v0.1 (2 May 2014)

Initial release.

See also

Get in touch

  • To ask questions or discuss xarray, use the mailing list.
  • Report bugs, suggest feature ideas or view the source code on GitHub.
  • For interactive discussion, we have a chatroom on Gitter.

License

xarray is available under the open source Apache License.

History

xarray is an evolution of an internal tool developed at The Climate Corporation. It was originally written by Climate Corp researchers Stephan Hoyer, Alex Kleeman and Eugene Brevdo and was released as open source in May 2014. The project was renamed from “xray” in January 2016.