Reshaping and reorganizing data

These methods allow you to reorganize

Reordering dimensions

To reorder dimensions on a DataArray or across all variables on a Dataset, use transpose(). An ellipsis () can be use to represent all other dimensions:

In [1]: ds = xr.Dataset({"foo": (("x", "y", "z"), [[[42]]]), "bar": (("y", "z"), [[24]])})

In [2]: ds.transpose("y", "z", "x")
Out[2]: 
<xarray.Dataset>
Dimensions:  (x: 1, y: 1, z: 1)
Dimensions without coordinates: x, y, z
Data variables:
    foo      (y, z, x) int64 42
    bar      (y, z) int64 24

In [3]: ds.transpose(..., "x")  # equivalent
Out[3]: 
<xarray.Dataset>
Dimensions:  (x: 1, y: 1, z: 1)
Dimensions without coordinates: x, y, z
Data variables:
    foo      (y, z, x) int64 42
    bar      (y, z) int64 24

In [4]: ds.transpose()  # reverses all dimensions
Out[4]: 
<xarray.Dataset>
Dimensions:  (x: 1, y: 1, z: 1)
Dimensions without coordinates: x, y, z
Data variables:
    foo      (z, y, x) int64 42
    bar      (z, y) int64 24

Expand and squeeze dimensions

To expand a DataArray or all variables on a Dataset along a new dimension, use expand_dims()

In [5]: expanded = ds.expand_dims("w")

In [6]: expanded
Out[6]: 
<xarray.Dataset>
Dimensions:  (w: 1, x: 1, y: 1, z: 1)
Dimensions without coordinates: w, x, y, z
Data variables:
    foo      (w, x, y, z) int64 42
    bar      (w, y, z) int64 24

This method attaches a new dimension with size 1 to all data variables.

To remove such a size-1 dimension from the DataArray or Dataset, use squeeze()

In [7]: expanded.squeeze("w")
Out[7]: 
<xarray.Dataset>
Dimensions:  (x: 1, y: 1, z: 1)
Dimensions without coordinates: x, y, z
Data variables:
    foo      (x, y, z) int64 42
    bar      (y, z) int64 24

Converting between datasets and arrays

To convert from a Dataset to a DataArray, use to_array():

In [8]: arr = ds.to_array()

In [9]: arr
Out[9]: 
<xarray.DataArray (variable: 2, x: 1, y: 1, z: 1)>
array([[[[42]]],


       [[[24]]]])
Coordinates:
  * variable  (variable) <U3 'foo' 'bar'
Dimensions without coordinates: x, y, z

This method broadcasts all data variables in the dataset against each other, then concatenates them along a new dimension into a new array while preserving coordinates.

To convert back from a DataArray to a Dataset, use to_dataset():

In [10]: arr.to_dataset(dim="variable")
Out[10]: 
<xarray.Dataset>
Dimensions:  (x: 1, y: 1, z: 1)
Dimensions without coordinates: x, y, z
Data variables:
    foo      (x, y, z) int64 42
    bar      (x, y, z) int64 24

The broadcasting behavior of to_array means that the resulting array includes the union of data variable dimensions:

In [11]: ds2 = xr.Dataset({"a": 0, "b": ("x", [3, 4, 5])})

# the input dataset has 4 elements
In [12]: ds2
Out[12]: 
<xarray.Dataset>
Dimensions:  (x: 3)
Dimensions without coordinates: x
Data variables:
    a        int64 0
    b        (x) int64 3 4 5

# the resulting array has 6 elements
In [13]: ds2.to_array()
Out[13]: 
<xarray.DataArray (variable: 2, x: 3)>
array([[0, 0, 0],
       [3, 4, 5]])
Coordinates:
  * variable  (variable) <U1 'a' 'b'
Dimensions without coordinates: x

Otherwise, the result could not be represented as an orthogonal array.

If you use to_dataset without supplying the dim argument, the DataArray will be converted into a Dataset of one variable:

In [14]: arr.to_dataset(name="combined")
Out[14]: 
<xarray.Dataset>
Dimensions:   (variable: 2, x: 1, y: 1, z: 1)
Coordinates:
  * variable  (variable) <U3 'foo' 'bar'
Dimensions without coordinates: x, y, z
Data variables:
    combined  (variable, x, y, z) int64 42 24

Stack and unstack

As part of xarray’s nascent support for pandas.MultiIndex, we have implemented stack() and unstack() method, for combining or splitting dimensions:

In [15]: array = xr.DataArray(
   ....:     np.random.randn(2, 3), coords=[("x", ["a", "b"]), ("y", [0, 1, 2])]
   ....: )
   ....: 

In [16]: stacked = array.stack(z=("x", "y"))

In [17]: stacked
Out[17]: 
<xarray.DataArray (z: 6)>
array([ 0.469, -0.283, -1.509, -1.136,  1.212, -0.173])
Coordinates:
  * z        (z) MultiIndex
  - x        (z) object 'a' 'a' 'a' 'b' 'b' 'b'
  - y        (z) int64 0 1 2 0 1 2

In [18]: stacked.unstack("z")
Out[18]: 
<xarray.DataArray (x: 2, y: 3)>
array([[ 0.469, -0.283, -1.509],
       [-1.136,  1.212, -0.173]])
Coordinates:
  * x        (x) object 'a' 'b'
  * y        (y) int64 0 1 2

As elsewhere in xarray, an ellipsis () can be used to represent all unlisted dimensions:

In [19]: stacked = array.stack(z=[..., "x"])

In [20]: stacked
Out[20]: 
<xarray.DataArray (z: 6)>
array([ 0.469, -1.136, -0.283,  1.212, -1.509, -0.173])
Coordinates:
  * z        (z) MultiIndex
  - y        (z) int64 0 0 1 1 2 2
  - x        (z) object 'a' 'b' 'a' 'b' 'a' 'b'

These methods are modeled on the pandas.DataFrame methods of the same name, although in xarray they always create new dimensions rather than adding to the existing index or columns.

Like DataFrame.unstack, xarray’s unstack always succeeds, even if the multi-index being unstacked does not contain all possible levels. Missing levels are filled in with NaN in the resulting object:

In [21]: stacked2 = stacked[::2]

In [22]: stacked2
Out[22]: 
<xarray.DataArray (z: 3)>
array([ 0.469, -0.283, -1.509])
Coordinates:
  * z        (z) MultiIndex
  - y        (z) int64 0 1 2
  - x        (z) object 'a' 'a' 'a'

In [23]: stacked2.unstack("z")
Out[23]: 
<xarray.DataArray (y: 3, x: 1)>
array([[ 0.469],
       [-0.283],
       [-1.509]])
Coordinates:
  * y        (y) int64 0 1 2
  * x        (x) object 'a'

However, xarray’s stack has an important difference from pandas: unlike pandas, it does not automatically drop missing values. Compare:

In [24]: array = xr.DataArray([[np.nan, 1], [2, 3]], dims=["x", "y"])

In [25]: array.stack(z=("x", "y"))
Out[25]: 
<xarray.DataArray (z: 4)>
array([nan,  1.,  2.,  3.])
Coordinates:
  * z        (z) MultiIndex
  - x        (z) int64 0 0 1 1
  - y        (z) int64 0 1 0 1

In [26]: array.to_pandas().stack()
Out[26]: 
x  y
0  1    1.0
1  0    2.0
   1    3.0
dtype: float64

We departed from pandas’s behavior here because predictable shapes for new array dimensions is necessary for Parallel computing with Dask.

Stacking different variables together

These stacking and unstacking operations are particularly useful for reshaping xarray objects for use in machine learning packages, such as scikit-learn, that usually require two-dimensional numpy arrays as inputs. For datasets with only one variable, we only need stack and unstack, but combining multiple variables in a xarray.Dataset is more complicated. If the variables in the dataset have matching numbers of dimensions, we can call to_array() and then stack along the the new coordinate. But to_array() will broadcast the dataarrays together, which will effectively tile the lower dimensional variable along the missing dimensions. The method xarray.Dataset.to_stacked_array() allows combining variables of differing dimensions without this wasteful copying while xarray.DataArray.to_unstacked_dataset() reverses this operation. Just as with xarray.Dataset.stack() the stacked coordinate is represented by a pandas.MultiIndex object. These methods are used like this:

In [27]: data = xr.Dataset(
   ....:     data_vars={"a": (("x", "y"), [[0, 1, 2], [3, 4, 5]]), "b": ("x", [6, 7])},
   ....:     coords={"y": ["u", "v", "w"]},
   ....: )
   ....: 

In [28]: data
Out[28]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * y        (y) <U1 'u' 'v' 'w'
Dimensions without coordinates: x
Data variables:
    a        (x, y) int64 0 1 2 3 4 5
    b        (x) int64 6 7

In [29]: stacked = data.to_stacked_array("z", sample_dims=["x"])

In [30]: stacked
Out[30]: 
<xarray.DataArray 'a' (x: 2, z: 4)>
array([[0, 1, 2, 6],
       [3, 4, 5, 7]])
Coordinates:
  * z         (z) MultiIndex
  - variable  (z) object 'a' 'a' 'a' 'b'
  - y         (z) object 'u' 'v' 'w' nan
Dimensions without coordinates: x

In [31]: unstacked = stacked.to_unstacked_dataset("z")

In [32]: unstacked
Out[32]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 3)
Coordinates:
  * y        (y) object 'u' 'v' 'w'
Dimensions without coordinates: x
Data variables:
    a        (x, y) int64 0 1 2 3 4 5
    b        (x) int64 6 7

In this example, stacked is a two dimensional array that we can easily pass to a scikit-learn or another generic numerical method.

Note

Unlike with stack, in to_stacked_array, the user specifies the dimensions they do not want stacked. For a machine learning task, these unstacked dimensions can be interpreted as the dimensions over which samples are drawn, whereas the stacked coordinates are the features. Naturally, all variables should possess these sampling dimensions.

Set and reset index

Complementary to stack / unstack, xarray’s .set_index, .reset_index and .reorder_levels allow easy manipulation of DataArray or Dataset multi-indexes without modifying the data and its dimensions.

You can create a multi-index from several 1-dimensional variables and/or coordinates using set_index():

In [33]: da = xr.DataArray(
   ....:     np.random.rand(4),
   ....:     coords={
   ....:         "band": ("x", ["a", "a", "b", "b"]),
   ....:         "wavenumber": ("x", np.linspace(200, 400, 4)),
   ....:     },
   ....:     dims="x",
   ....: )
   ....: 

In [34]: da
Out[34]: 
<xarray.DataArray (x: 4)>
array([0.123, 0.543, 0.373, 0.448])
Coordinates:
    band        (x) <U1 'a' 'a' 'b' 'b'
    wavenumber  (x) float64 200.0 266.7 333.3 400.0
Dimensions without coordinates: x

In [35]: mda = da.set_index(x=["band", "wavenumber"])

In [36]: mda
Out[36]: 
<xarray.DataArray (x: 4)>
array([0.123, 0.543, 0.373, 0.448])
Coordinates:
  * x           (x) MultiIndex
  - band        (x) object 'a' 'a' 'b' 'b'
  - wavenumber  (x) float64 200.0 266.7 333.3 400.0

These coordinates can now be used for indexing, e.g.,

In [37]: mda.sel(band="a")
Out[37]: 
<xarray.DataArray (wavenumber: 2)>
array([0.123, 0.543])
Coordinates:
  * wavenumber  (wavenumber) float64 200.0 266.7

Conversely, you can use reset_index() to extract multi-index levels as coordinates (this is mainly useful for serialization):

In [38]: mda.reset_index("x")
Out[38]: 
<xarray.DataArray (x: 4)>
array([0.123, 0.543, 0.373, 0.448])
Coordinates:
    band        (x) object 'a' 'a' 'b' 'b'
    wavenumber  (x) float64 200.0 266.7 333.3 400.0
Dimensions without coordinates: x

reorder_levels() allows changing the order of multi-index levels:

In [39]: mda.reorder_levels(x=["wavenumber", "band"])
Out[39]: 
<xarray.DataArray (x: 4)>
array([0.123, 0.543, 0.373, 0.448])
Coordinates:
  * x           (x) MultiIndex
  - wavenumber  (x) float64 200.0 266.7 333.3 400.0
  - band        (x) object 'a' 'a' 'b' 'b'

As of xarray v0.9 coordinate labels for each dimension are optional. You can also use .set_index / .reset_index to add / remove labels for one or several dimensions:

In [40]: array = xr.DataArray([1, 2, 3], dims="x")

In [41]: array
Out[41]: 
<xarray.DataArray (x: 3)>
array([1, 2, 3])
Dimensions without coordinates: x

In [42]: array["c"] = ("x", ["a", "b", "c"])

In [43]: array.set_index(x="c")
Out[43]: 
<xarray.DataArray (x: 3)>
array([1, 2, 3])
Coordinates:
  * x        (x) object 'a' 'b' 'c'

In [44]: array = array.set_index(x="c")

In [45]: array = array.reset_index("x", drop=True)

Shift and roll

To adjust coordinate labels, you can use the shift() and roll() methods:

In [46]: array = xr.DataArray([1, 2, 3, 4], dims="x")

In [47]: array.shift(x=2)
Out[47]: 
<xarray.DataArray (x: 4)>
array([nan, nan,  1.,  2.])
Dimensions without coordinates: x

In [48]: array.roll(x=2, roll_coords=True)
Out[48]: 
<xarray.DataArray (x: 4)>
array([3, 4, 1, 2])
Dimensions without coordinates: x

Sort

One may sort a DataArray/Dataset via sortby() and sortby(). The input can be an individual or list of 1D DataArray objects:

In [49]: ds = xr.Dataset(
   ....:     {"A": (("x", "y"), [[1, 2], [3, 4]]), "B": (("x", "y"), [[5, 6], [7, 8]])},
   ....:     coords={"x": ["b", "a"], "y": [1, 0]},
   ....: )
   ....: 

In [50]: dax = xr.DataArray([100, 99], [("x", [0, 1])])

In [51]: day = xr.DataArray([90, 80], [("y", [0, 1])])

In [52]: ds.sortby([day, dax])
Out[52]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * x        (x) object 'b' 'a'
  * y        (y) int64 1 0
Data variables:
    A        (x, y) int64 1 2 3 4
    B        (x, y) int64 5 6 7 8

As a shortcut, you can refer to existing coordinates by name:

In [53]: ds.sortby("x")
Out[53]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 1 0
Data variables:
    A        (x, y) int64 3 4 1 2
    B        (x, y) int64 7 8 5 6

In [54]: ds.sortby(["y", "x"])
Out[54]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * x        (x) <U1 'a' 'b'
  * y        (y) int64 0 1
Data variables:
    A        (x, y) int64 4 3 2 1
    B        (x, y) int64 8 7 6 5

In [55]: ds.sortby(["y", "x"], ascending=False)
Out[55]: 
<xarray.Dataset>
Dimensions:  (x: 2, y: 2)
Coordinates:
  * x        (x) <U1 'b' 'a'
  * y        (y) int64 1 0
Data variables:
    A        (x, y) int64 1 2 3 4
    B        (x, y) int64 5 6 7 8