Working with pandas#

One of the most important features of xarray is the ability to convert to and from pandas objects to interact with the rest of the PyData ecosystem. For example, for plotting labeled data, we highly recommend using the visualization built in to pandas itself or provided by the pandas aware libraries such as Seaborn.

Hierarchical and tidy data#

Tabular data is easiest to work with when it meets the criteria for tidy data:

Each column holds a different variable.
Each rows holds a different observation.

In this “tidy data” format, we can represent any Dataset and DataArray in terms of DataFrame and Series, respectively (and vice-versa). The representation works by flattening non-coordinates to 1D, and turning the tensor product of coordinate indexes into a pandas.MultiIndex.

Dataset and DataFrame#

To convert any dataset to a DataFrame in tidy form, use the Dataset.to_dataframe() method:

df = ds.to_dataframe()
df

		foo	along_x	scalar
x	y
10	a	0.469112	0.119209	123
	b	-0.282863	0.119209	123
	c	-1.509059	0.119209	123
20	a	-1.135632	-1.044236	123
	b	1.212112	-1.044236	123
	c	-0.173215	-1.044236	123

We see that each variable and coordinate in the Dataset is now a column in the DataFrame, with the exception of indexes which are in the index. To convert the DataFrame to any other convenient representation, use DataFrame methods like reset_index(), stack() and unstack().

For datasets containing dask arrays where the data should be lazily loaded, see the Dataset.to_dask_dataframe() method.

To create a Dataset from a DataFrame, use the Dataset.from_dataframe() class method or the equivalent pandas.DataFrame.to_xarray() method:

Notice that the dimensions of variables in the Dataset have now expanded after the round-trip conversion to a DataFrame. This is because every object in a DataFrame must have the same indices, so we need to broadcast the data of each array to the full size of the new MultiIndex.

Likewise, all the coordinates (other than indexes) ended up as variables, because pandas does not distinguish non-index coordinates.

Lossless and reversible conversion#

The previous Dataset example shows that the conversion is not reversible (lossy roundtrip) and that the size of the Dataset increases.

Particularly after a roundtrip, the following deviations are noted:

a non-dimension Dataset coordinate is converted into variable
a non-dimension DataArray coordinate is not converted
dtype is not always the same (e.g. “str” is converted to “object”)
attrs metadata is not conserved

To avoid these problems, the third-party ntv-pandas library offers lossless and reversible conversions between Dataset/ DataArray and pandas DataFrame objects.

This solution is particularly interesting for converting any DataFrame into a Dataset (the converter finds the multidimensional structure hidden by the tabular structure).

The ntv-pandas examples show how to improve the conversion for the previous Dataset example and for more complex examples.

Multi-dimensional data#

Tidy data is great, but it sometimes you want to preserve dimensions instead of automatically stacking them into a MultiIndex.

DataArray.to_pandas() is a shortcut that lets you convert a DataArray directly into a pandas object with the same dimensionality, if available in pandas (i.e., a 1D array is converted to a Series and 2D to DataFrame):

arr = xr.DataArray(
    np.random.randn(2, 3), coords=[("x", [10, 20]), ("y", ["a", "b", "c"])]
)
df = arr.to_pandas()
df

y	a	b	c
x
10	-0.861849	-2.104569	-0.494929
20	1.071804	0.721555	-0.706771

To perform the inverse operation of converting any pandas objects into a data array with the same shape, simply use the DataArray constructor:

xr.DataArray(df)

<xarray.DataArray (x: 2, y: 3)> Size: 48B
array([[-0.86184896, -2.10456922, -0.49492927],
       [ 1.07180381,  0.72155516, -0.70677113]])
Coordinates:
  * x        (x) int64 16B 10 20
  * y        (y) object 24B 'a' 'b' 'c'

Both the DataArray and Dataset constructors directly convert pandas objects into xarray objects with the same shape. This means that they preserve all use of multi-indexes:

However, you will need to set dimension names explicitly, either with the dims argument on in the DataArray constructor or by calling rename on the new object.

Working with pandas#

Hierarchical and tidy data#

Dataset and DataFrame#

DataArray and Series#

Lossless and reversible conversion#

Multi-dimensional data#

Transitioning from pandas.Panel to xarray#