.. currentmodule:: xarray
.. _pandas:
===================
Working with pandas
===================
One of the most important features of xarray is the ability to convert to and
from :py:mod:`pandas` objects to interact with the rest of the PyData
ecosystem. For example, for plotting labeled data, we highly recommend
using the `visualization built in to pandas itself`__ or provided by the pandas
aware libraries such as `Seaborn`__.
__ https://pandas.pydata.org/pandas-docs/stable/visualization.html
__ https://seaborn.pydata.org/
.. ipython:: python
:suppress:
import numpy as np
import pandas as pd
import xarray as xr
np.random.seed(123456)
Hierarchical and tidy data
~~~~~~~~~~~~~~~~~~~~~~~~~~
Tabular data is easiest to work with when it meets the criteria for
`tidy data`__:
* Each column holds a different variable.
* Each rows holds a different observation.
__ https://www.jstatsoft.org/v59/i10/
In this "tidy data" format, we can represent any :py:class:`Dataset` and
:py:class:`DataArray` in terms of :py:class:`~pandas.DataFrame` and
:py:class:`~pandas.Series`, respectively (and vice-versa). The representation
works by flattening non-coordinates to 1D, and turning the tensor product of
coordinate indexes into a :py:class:`pandas.MultiIndex`.
Dataset and DataFrame
---------------------
To convert any dataset to a ``DataFrame`` in tidy form, use the
:py:meth:`Dataset.to_dataframe()` method:
.. ipython:: python
ds = xr.Dataset(
{"foo": (("x", "y"), np.random.randn(2, 3))},
coords={
"x": [10, 20],
"y": ["a", "b", "c"],
"along_x": ("x", np.random.randn(2)),
"scalar": 123,
},
)
ds
df = ds.to_dataframe()
df
We see that each variable and coordinate in the Dataset is now a column in the
DataFrame, with the exception of indexes which are in the index.
To convert the ``DataFrame`` to any other convenient representation,
use ``DataFrame`` methods like :py:meth:`~pandas.DataFrame.reset_index`,
:py:meth:`~pandas.DataFrame.stack` and :py:meth:`~pandas.DataFrame.unstack`.
For datasets containing dask arrays where the data should be lazily loaded, see the
:py:meth:`Dataset.to_dask_dataframe()` method.
To create a ``Dataset`` from a ``DataFrame``, use the
:py:meth:`Dataset.from_dataframe` class method or the equivalent
:py:meth:`pandas.DataFrame.to_xarray` method:
.. ipython:: python
xr.Dataset.from_dataframe(df)
Notice that that dimensions of variables in the ``Dataset`` have now
expanded after the round-trip conversion to a ``DataFrame``. This is because
every object in a ``DataFrame`` must have the same indices, so we need to
broadcast the data of each array to the full size of the new ``MultiIndex``.
Likewise, all the coordinates (other than indexes) ended up as variables,
because pandas does not distinguish non-index coordinates.
DataArray and Series
--------------------
``DataArray`` objects have a complementary representation in terms of a
:py:class:`~pandas.Series`. Using a Series preserves the ``Dataset`` to
``DataArray`` relationship, because ``DataFrames`` are dict-like containers
of ``Series``. The methods are very similar to those for working with
DataFrames:
.. ipython:: python
s = ds["foo"].to_series()
s
# or equivalently, with Series.to_xarray()
xr.DataArray.from_series(s)
Both the ``from_series`` and ``from_dataframe`` methods use reindexing, so they
work even if the hierarchical index is not a full tensor product:
.. ipython:: python
s[::2]
s[::2].to_xarray()
Lossless and reversible conversion
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The previous ``Dataset`` example shows that the conversion is not reversible (lossy roundtrip) and
that the size of the ``Dataset`` increases.
Particularly after a roundtrip, the following deviations are noted:
- a non-dimension Dataset ``coordinate`` is converted into ``variable``
- a non-dimension DataArray ``coordinate`` is not converted
- ``dtype`` is not always the same (e.g. "str" is converted to "object")
- ``attrs`` metadata is not conserved
To avoid these problems, the third-party `ntv-pandas `__ library offers lossless and reversible conversions between
``Dataset``/ ``DataArray`` and pandas ``DataFrame`` objects.
This solution is particularly interesting for converting any ``DataFrame`` into a ``Dataset`` (the converter finds the multidimensional structure hidden by the tabular structure).
The `ntv-pandas examples `__ show how to improve the conversion for the previous ``Dataset`` example and for more complex examples.
Multi-dimensional data
~~~~~~~~~~~~~~~~~~~~~~
Tidy data is great, but it sometimes you want to preserve dimensions instead of
automatically stacking them into a ``MultiIndex``.
:py:meth:`DataArray.to_pandas()` is a shortcut that lets you convert a
DataArray directly into a pandas object with the same dimensionality, if
available in pandas (i.e., a 1D array is converted to a
:py:class:`~pandas.Series` and 2D to :py:class:`~pandas.DataFrame`):
.. ipython:: python
arr = xr.DataArray(
np.random.randn(2, 3), coords=[("x", [10, 20]), ("y", ["a", "b", "c"])]
)
df = arr.to_pandas()
df
To perform the inverse operation of converting any pandas objects into a data
array with the same shape, simply use the :py:class:`DataArray`
constructor:
.. ipython:: python
xr.DataArray(df)
Both the ``DataArray`` and ``Dataset`` constructors directly convert pandas
objects into xarray objects with the same shape. This means that they
preserve all use of multi-indexes:
.. ipython:: python
index = pd.MultiIndex.from_arrays(
[["a", "a", "b"], [0, 1, 2]], names=["one", "two"]
)
df = pd.DataFrame({"x": 1, "y": 2}, index=index)
ds = xr.Dataset(df)
ds
However, you will need to set dimension names explicitly, either with the
``dims`` argument on in the ``DataArray`` constructor or by calling
:py:class:`~Dataset.rename` on the new object.
.. _panel transition:
Transitioning from pandas.Panel to xarray
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
``Panel``, pandas' data structure for 3D arrays, was always a second class
data structure compared to the Series and DataFrame. To allow pandas
developers to focus more on its core functionality built around the
DataFrame, pandas removed ``Panel`` in favor of directing users who use
multi-dimensional arrays to xarray.
Xarray has most of ``Panel``'s features, a more explicit API (particularly around
indexing), and the ability to scale to >3 dimensions with the same interface.
As discussed in the :ref:`data structures section of the docs `, there are two primary data structures in
xarray: ``DataArray`` and ``Dataset``. You can imagine a ``DataArray`` as a
n-dimensional pandas ``Series`` (i.e. a single typed array), and a ``Dataset``
as the ``DataFrame`` equivalent (i.e. a dict of aligned ``DataArray`` objects).
So you can represent a Panel, in two ways:
- As a 3-dimensional ``DataArray``,
- Or as a ``Dataset`` containing a number of 2-dimensional DataArray objects.
Let's take a look:
.. ipython:: python
data = np.random.RandomState(0).rand(2, 3, 4)
items = list("ab")
major_axis = list("mno")
minor_axis = pd.date_range(start="2000", periods=4, name="date")
With old versions of pandas (prior to 0.25), this could stored in a ``Panel``:
.. ipython::
:verbatim:
In [1]: pd.Panel(data, items, major_axis, minor_axis)
Out[1]:
Dimensions: 2 (items) x 3 (major_axis) x 4 (minor_axis)
Items axis: a to b
Major_axis axis: m to o
Minor_axis axis: 2000-01-01 00:00:00 to 2000-01-04 00:00:00
To put this data in a ``DataArray``, write:
.. ipython:: python
array = xr.DataArray(data, [items, major_axis, minor_axis])
array
As you can see, there are three dimensions (each is also a coordinate). Two of
the axes of were unnamed, so have been assigned ``dim_0`` and ``dim_1``
respectively, while the third retains its name ``date``.
You can also easily convert this data into ``Dataset``:
.. ipython:: python
array.to_dataset(dim="dim_0")
Here, there are two data variables, each representing a DataFrame on panel's
``items`` axis, and labeled as such. Each variable is a 2D array of the
respective values along the ``items`` dimension.
While the xarray docs are relatively complete, a few items stand out for Panel users:
- A DataArray's data is stored as a numpy array, and so can only contain a single
type. As a result, a Panel that contains :py:class:`~pandas.DataFrame` objects
with multiple types will be converted to ``dtype=object``. A ``Dataset`` of
multiple ``DataArray`` objects each with its own dtype will allow original
types to be preserved.
- :ref:`Indexing ` is similar to pandas, but more explicit and
leverages xarray's naming of dimensions.
- Because of those features, making much higher dimensional data is very
practical.
- Variables in ``Dataset`` objects can use a subset of its dimensions. For
example, you can have one dataset with Person x Score x Time, and another with
Person x Score.
- You can use coordinates are used for both dimensions and for variables which
_label_ the data variables, so you could have a coordinate Age, that labelled
the Person dimension of a Dataset of Person x Score x Time.
While xarray may take some getting used to, it's worth it! If anything is unclear,
please `post an issue on GitHub `__ or
`StackOverflow `__,
and we'll endeavor to respond to the specific case or improve the general docs.