Hierarchical data#

Why Hierarchical Data?#

Many real-world datasets are composed of multiple differing components, and it can often be useful to think of these in terms of a hierarchy of related groups of data. Examples of data which one might want organise in a grouped or hierarchical manner include:

  • Simulation data at multiple resolutions,

  • Observational data about the same system but from multiple different types of sensors,

  • Mixed experimental and theoretical data,

  • A systematic study recording the same experiment but with different parameters,

  • Heterogenous data, such as demographic and metereological data,

or even any combination of the above.

Often datasets like this cannot easily fit into a single Dataset object, or are more usefully thought of as groups of related Dataset objects. For this purpose we provide the xarray.DataTree class.

This page explains in detail how to understand and use the different features of the DataTree class for your own hierarchical data needs.

Node Relationships#

Creating a Family Tree#

The three main ways of creating a DataTree object are described briefly in Creating a DataTree. Here we go into more detail about how to create a tree node-by-node, using a famous family tree from the Simpsons cartoon as an example.

Let’s start by defining nodes representing the two siblings, Bart and Lisa Simpson:

In [1]: bart = xr.DataTree(name="Bart")

In [2]: lisa = xr.DataTree(name="Lisa")

Each of these node objects knows their own name, but they currently have no relationship to one another. We can connect them by creating another node representing a common parent, Homer Simpson:

In [3]: homer = xr.DataTree(name="Homer", children={"Bart": bart, "Lisa": lisa})

Here we set the children of Homer in the node’s constructor. We now have a small family tree

In [4]: homer
Out[4]: 
<xarray.DataTree 'Homer'>
Group: /
├── Group: /Bart
└── Group: /Lisa

where we can see how these individual Simpson family members are related to one another. The nodes representing Bart and Lisa are now connected - we can confirm their sibling rivalry by examining the siblings property:

In [5]: list(homer["Bart"].siblings)
Out[5]: ['Lisa']

But oops, we forgot Homer’s third daughter, Maggie! Let’s add her by updating Homer’s children property to include her:

In [6]: maggie = xr.DataTree(name="Maggie")

In [7]: homer.children = {"Bart": bart, "Lisa": lisa, "Maggie": maggie}

In [8]: homer
Out[8]: 
<xarray.DataTree 'Homer'>
Group: /
├── Group: /Bart
├── Group: /Lisa
└── Group: /Maggie

Let’s check that Maggie knows who her Dad is:

In [9]: maggie.parent.name
Out[9]: 'Homer'

That’s good - updating the properties of our nodes does not break the internal consistency of our tree, as changes of parentage are automatically reflected on both nodes.

These children obviously have another parent, Marge Simpson, but DataTree nodes can only have a maximum of one parent. Genealogical family trees are not even technically trees in the mathematical sense - the fact that distant relatives can mate makes them directed acyclic graphs. Trees of DataTree objects cannot represent this.

Homer is currently listed as having no parent (the so-called “root node” of this tree), but we can update his parent property:

In [10]: abe = xr.DataTree(name="Abe")

In [11]: abe.children = {"Homer": homer}

Abe is now the “root” of this tree, which we can see by examining the root property of any node in the tree

In [12]: maggie.root.name
Out[12]: 'Abe'

We can see the whole tree by printing Abe’s node or just part of the tree by printing Homer’s node:

In [13]: abe
Out[13]: 
<xarray.DataTree 'Abe'>
Group: /
└── Group: /Homer
    ├── Group: /Homer/Bart
    ├── Group: /Homer/Lisa
    └── Group: /Homer/Maggie

In [14]: abe["Homer"]
Out[14]: 
<xarray.DataTree 'Homer'>
Group: /Homer
├── Group: /Homer/Bart
├── Group: /Homer/Lisa
└── Group: /Homer/Maggie

In episode 28, Abe Simpson reveals that he had another son, Herbert “Herb” Simpson. We can add Herbert to the family tree without displacing Homer by assign()-ing another child to Abe:

In [15]: herbert = xr.DataTree(name="Herb")

In [16]: abe = abe.assign({"Herbert": herbert})

In [17]: abe
Out[17]: 
<xarray.DataTree 'Abe'>
Group: /
├── Group: /Homer
│   ├── Group: /Homer/Bart
│   ├── Group: /Homer/Lisa
│   └── Group: /Homer/Maggie
└── Group: /Herbert

In [18]: abe["Herbert"].name
Out[18]: 'Herbert'

In [19]: herbert.name
Out[19]: 'Herb'

Note

This example shows a subtlety - the returned tree has Homer’s brother listed as "Herbert", but the original node was named “Herb”. Not only are names overridden when stored as keys like this, but the new node is a copy, so that the original node that was referenced is unchanged (i.e. herbert.name == "Herb" still). In other words, nodes are copied into trees, not inserted into them. This is intentional, and mirrors the behaviour when storing named DataArray objects inside datasets.

Certain manipulations of our tree are forbidden, if they would create an inconsistent result. In episode 51 of the show Futurama, Philip J. Fry travels back in time and accidentally becomes his own Grandfather. If we try similar time-travelling hijinks with Homer, we get a InvalidTreeError raised:

In [20]: abe["Homer"].children = {"Abe": abe}
InvalidTreeError: Cannot set parent, as intended parent is already a descendant of this node.

Ancestry in an Evolutionary Tree#

Let’s use a different example of a tree to discuss more complex relationships between nodes - the phylogenetic tree, or tree of life.

In [21]: vertebrates = xr.DataTree.from_dict(
   ....:     {
   ....:         "/Sharks": None,
   ....:         "/Bony Skeleton/Ray-finned Fish": None,
   ....:         "/Bony Skeleton/Four Limbs/Amphibians": None,
   ....:         "/Bony Skeleton/Four Limbs/Amniotic Egg/Hair/Primates": None,
   ....:         "/Bony Skeleton/Four Limbs/Amniotic Egg/Hair/Rodents & Rabbits": None,
   ....:         "/Bony Skeleton/Four Limbs/Amniotic Egg/Two Fenestrae/Dinosaurs": None,
   ....:         "/Bony Skeleton/Four Limbs/Amniotic Egg/Two Fenestrae/Birds": None,
   ....:     },
   ....:     name="Vertebrae",
   ....: )
   ....: 

In [22]: primates = vertebrates["/Bony Skeleton/Four Limbs/Amniotic Egg/Hair/Primates"]

In [23]: dinosaurs = vertebrates[
   ....:     "/Bony Skeleton/Four Limbs/Amniotic Egg/Two Fenestrae/Dinosaurs"
   ....: ]
   ....: 

We have used the from_dict() constructor method as a prefered way to quickly create a whole tree, and Filesystem-like Paths (to be explained shortly) to select two nodes of interest.

In [24]: vertebrates
Out[24]: 
<xarray.DataTree 'Vertebrae'>
Group: /
├── Group: /Sharks
└── Group: /Bony Skeleton
    ├── Group: /Bony Skeleton/Ray-finned Fish
    └── Group: /Bony Skeleton/Four Limbs
        ├── Group: /Bony Skeleton/Four Limbs/Amphibians
        └── Group: /Bony Skeleton/Four Limbs/Amniotic Egg
            ├── Group: /Bony Skeleton/Four Limbs/Amniotic Egg/Hair
            │   ├── Group: /Bony Skeleton/Four Limbs/Amniotic Egg/Hair/Primates
            │   └── Group: /Bony Skeleton/Four Limbs/Amniotic Egg/Hair/Rodents & Rabbits
            └── Group: /Bony Skeleton/Four Limbs/Amniotic Egg/Two Fenestrae
                ├── Group: /Bony Skeleton/Four Limbs/Amniotic Egg/Two Fenestrae/Dinosaurs
                └── Group: /Bony Skeleton/Four Limbs/Amniotic Egg/Two Fenestrae/Birds

This tree shows various families of species, grouped by their common features (making it technically a “Cladogram”, rather than an evolutionary tree).

Here both the species and the features used to group them are represented by DataTree node objects - there is no distinction in types of node. We can however get a list of only the nodes we used to represent species by using the fact that all those nodes have no children - they are “leaf nodes”. We can check if a node is a leaf with is_leaf(), and get a list of all leaves with the leaves property:

In [25]: primates.is_leaf
Out[25]: True

In [26]: [node.name for node in vertebrates.leaves]
Out[26]: 
['Sharks',
 'Ray-finned Fish',
 'Amphibians',
 'Primates',
 'Rodents & Rabbits',
 'Dinosaurs',
 'Birds']

Pretending that this is a true evolutionary tree for a moment, we can find the features of the evolutionary ancestors (so-called “ancestor” nodes), the distinguishing feature of the common ancestor of all vertebrate life (the root node), and even the distinguishing feature of the common ancestor of any two species (the common ancestor of two nodes):

In [27]: [node.name for node in reversed(primates.parents)]
Out[27]: ['Vertebrae', 'Bony Skeleton', 'Four Limbs', 'Amniotic Egg', 'Hair']

In [28]: primates.root.name
Out[28]: 'Vertebrae'

In [29]: primates.find_common_ancestor(dinosaurs).name
Out[29]: 'Amniotic Egg'

We can only find a common ancestor between two nodes that lie in the same tree. If we try to find the common evolutionary ancestor between primates and an Alien species that has no relationship to Earth’s evolutionary tree, an error will be raised.

In [30]: alien = xr.DataTree(name="Xenomorph")

In [31]: primates.find_common_ancestor(alien)
NotFoundInTreeError: Cannot find common ancestor because nodes do not lie within the same tree

Manipulating Trees#

Subsetting Tree Nodes#

We can subset our tree to select only nodes of interest in various ways.

Similarly to on a real filesystem, matching nodes by common patterns in their paths is often useful. We can use xarray.DataTree.match() for this:

In [52]: dt = xr.DataTree.from_dict(
   ....:     {
   ....:         "/a/A": None,
   ....:         "/a/B": None,
   ....:         "/b/A": None,
   ....:         "/b/B": None,
   ....:     }
   ....: )
   ....: 

In [53]: result = dt.match("*/B")

In [54]: result
Out[54]: 
<xarray.DataTree>
Group: /
├── Group: /a
│   └── Group: /a/B
└── Group: /b
    └── Group: /b/B

We can also subset trees by the contents of the nodes. xarray.DataTree.filter() retains only the nodes of a tree that meet a certain condition. For example, we could recreate the Simpson’s family tree with the ages of each individual, then filter for only the adults: First lets recreate the tree but with an age data variable in every node:

In [55]: simpsons = xr.DataTree.from_dict(
   ....:     {
   ....:         "/": xr.Dataset({"age": 83}),
   ....:         "/Herbert": xr.Dataset({"age": 40}),
   ....:         "/Homer": xr.Dataset({"age": 39}),
   ....:         "/Homer/Bart": xr.Dataset({"age": 10}),
   ....:         "/Homer/Lisa": xr.Dataset({"age": 8}),
   ....:         "/Homer/Maggie": xr.Dataset({"age": 1}),
   ....:     },
   ....:     name="Abe",
   ....: )
   ....: 

In [56]: simpsons
Out[56]: 
<xarray.DataTree 'Abe'>
Group: /
│   Dimensions:  ()
│   Data variables:
│       age      int64 8B 83
├── Group: /Herbert
│       Dimensions:  ()
│       Data variables:
│           age      int64 8B 40
└── Group: /Homer
    │   Dimensions:  ()
    │   Data variables:
    │       age      int64 8B 39
    ├── Group: /Homer/Bart
    │       Dimensions:  ()
    │       Data variables:
    │           age      int64 8B 10
    ├── Group: /Homer/Lisa
    │       Dimensions:  ()
    │       Data variables:
    │           age      int64 8B 8
    └── Group: /Homer/Maggie
            Dimensions:  ()
            Data variables:
                age      int64 8B 1

Now let’s filter out the minors:

In [57]: simpsons.filter(lambda node: node["age"] > 18)
Out[57]: 
<xarray.DataTree 'Abe'>
Group: /
│   Dimensions:  ()
│   Data variables:
│       age      int64 8B 83
├── Group: /Herbert
│       Dimensions:  ()
│       Data variables:
│           age      int64 8B 40
└── Group: /Homer
        Dimensions:  ()
        Data variables:
            age      int64 8B 39

The result is a new tree, containing only the nodes matching the condition.

(Yes, under the hood filter() is just syntactic sugar for the pattern we showed you in Iterating over trees !)

Tree Contents#

Hollow Trees#

A concept that can sometimes be useful is that of a “Hollow Tree”, which means a tree with data stored only at the leaf nodes. This is useful because certain useful tree manipulation operations only make sense for hollow trees.

You can check if a tree is a hollow tree by using the is_hollow property. We can see that the Simpson’s family is not hollow because the data variable "age" is present at some nodes which have children (i.e. Abe and Homer).

In [58]: simpsons.is_hollow
Out[58]: False

Computation#

DataTree objects are also useful for performing computations, not just for organizing data.

Operations and Methods on Trees#

To show how applying operations across a whole tree at once can be useful, let’s first create a example scientific dataset.

In [59]: def time_stamps(n_samples, T):
   ....:     """Create an array of evenly-spaced time stamps"""
   ....:     return xr.DataArray(
   ....:         data=np.linspace(0, 2 * np.pi * T, n_samples), dims=["time"]
   ....:     )
   ....: 

In [60]: def signal_generator(t, f, A, phase):
   ....:     """Generate an example electrical-like waveform"""
   ....:     return A * np.sin(f * t.data + phase)
   ....: 

In [61]: time_stamps1 = time_stamps(n_samples=15, T=1.5)

In [62]: time_stamps2 = time_stamps(n_samples=10, T=1.0)

In [63]: voltages = xr.DataTree.from_dict(
   ....:     {
   ....:         "/oscilloscope1": xr.Dataset(
   ....:             {
   ....:                 "potential": (
   ....:                     "time",
   ....:                     signal_generator(time_stamps1, f=2, A=1.2, phase=0.5),
   ....:                 ),
   ....:                 "current": (
   ....:                     "time",
   ....:                     signal_generator(time_stamps1, f=2, A=1.2, phase=1),
   ....:                 ),
   ....:             },
   ....:             coords={"time": time_stamps1},
   ....:         ),
   ....:         "/oscilloscope2": xr.Dataset(
   ....:             {
   ....:                 "potential": (
   ....:                     "time",
   ....:                     signal_generator(time_stamps2, f=1.6, A=1.6, phase=0.2),
   ....:                 ),
   ....:                 "current": (
   ....:                     "time",
   ....:                     signal_generator(time_stamps2, f=1.6, A=1.6, phase=0.7),
   ....:                 ),
   ....:             },
   ....:             coords={"time": time_stamps2},
   ....:         ),
   ....:     }
   ....: )
   ....: 

In [64]: voltages
Out[64]: 
<xarray.DataTree>
Group: /
├── Group: /oscilloscope1
│       Dimensions:    (time: 15)
│       Coordinates:
│         * time       (time) float64 120B 0.0 0.6732 1.346 2.02 ... 8.078 8.752 9.425
│       Data variables:
│           potential  (time) float64 120B 0.5753 1.155 -0.06141 ... -0.8987 0.5753
│           current    (time) float64 120B 1.01 0.8568 -0.6285 ... -1.191 -0.4074 1.01
└── Group: /oscilloscope2
        Dimensions:    (time: 10)
        Coordinates:
          * time       (time) float64 80B 0.0 0.6981 1.396 2.094 ... 4.887 5.585 6.283
        Data variables:
            potential  (time) float64 80B 0.3179 1.549 1.04 ... 1.578 0.4555 -1.179
            current    (time) float64 80B 1.031 1.552 0.3297 ... 1.259 -0.3356 -1.553

Most xarray computation methods also exist as methods on datatree objects, so you can for example take the mean value of these two timeseries at once:

In [65]: voltages.mean(dim="time")
Out[65]: 
<xarray.DataTree>
Group: /
├── Group: /oscilloscope1
│       Dimensions:    ()
│       Data variables:
│           potential  float64 8B 0.03835
│           current    float64 8B 0.06732
└── Group: /oscilloscope2
        Dimensions:    ()
        Data variables:
            potential  float64 8B 0.169
            current    float64 8B 0.1025

This works by mapping the standard xarray.Dataset.mean() method over the dataset stored in each node of the tree one-by-one.

The arguments passed to the method are used for every node, so the values of the arguments you pass might be valid for one node and invalid for another

In [66]: voltages.isel(time=12)
IndexError: index 12 is out of bounds for axis 0 with size 10

Notice that the error raised helpfully indicates which node of the tree the operation failed on.

Arithmetic Methods on Trees#

Arithmetic methods are also implemented, so you can e.g. add a scalar to every dataset in the tree at once. For example, we can advance the timeline of the Simpsons by a decade just by

In [67]: simpsons + 10
Out[67]: 
<xarray.DataTree 'Abe'>
Group: /
│   Dimensions:  ()
│   Data variables:
│       age      int64 8B 93
├── Group: /Herbert
│       Dimensions:  ()
│       Data variables:
│           age      int64 8B 50
└── Group: /Homer
    │   Dimensions:  ()
    │   Data variables:
    │       age      int64 8B 49
    ├── Group: /Homer/Bart
    │       Dimensions:  ()
    │       Data variables:
    │           age      int64 8B 20
    ├── Group: /Homer/Lisa
    │       Dimensions:  ()
    │       Data variables:
    │           age      int64 8B 18
    └── Group: /Homer/Maggie
            Dimensions:  ()
            Data variables:
                age      int64 8B 11

See that the same change (fast-forwarding by adding 10 years to the age of each character) has been applied to every node.

Mapping Custom Functions Over Trees#

You can map custom computation over each node in a tree using xarray.DataTree.map_over_datasets(). You can map any function, so long as it takes xarray.Dataset objects as one (or more) of the input arguments, and returns one (or more) xarray datasets.

Note

Functions passed to map_over_datasets() cannot alter nodes in-place. Instead they must return new xarray.Dataset objects.

For example, we can define a function to calculate the Root Mean Square of a timeseries

In [68]: def rms(signal):
   ....:     return np.sqrt(np.mean(signal**2))
   ....: 

Then calculate the RMS value of these signals:

In [69]: voltages.map_over_datasets(rms)
Out[69]: 
<xarray.DataTree>
Group: /
├── Group: /oscilloscope1
│       Dimensions:    ()
│       Data variables:
│           potential  float64 8B 0.8331
│           current    float64 8B 0.8602
└── Group: /oscilloscope2
        Dimensions:    ()
        Data variables:
            potential  float64 8B 1.099
            current    float64 8B 1.158

We can also use map_over_datasets() to apply a function over the data in multiple trees, by passing the trees as positional arguments.

Operating on Multiple Trees#

The examples so far have involved mapping functions or methods over the nodes of a single tree, but we can generalize this to mapping functions over multiple trees at once.

Iterating Over Multiple Trees#

To iterate over the corresponding nodes in multiple trees, use group_subtrees() instead of subtree_with_keys. This combines well with xarray.DataTree.from_dict() to build a new tree:

In [70]: dt1 = xr.DataTree.from_dict({"a": xr.Dataset({"x": 1}), "b": xr.Dataset({"x": 2})})

In [71]: dt2 = xr.DataTree.from_dict(
   ....:     {"a": xr.Dataset({"x": 10}), "b": xr.Dataset({"x": 20})}
   ....: )
   ....: 

In [72]: result = {}

In [73]: for path, (node1, node2) in xr.group_subtrees(dt1, dt2):
   ....:     result[path] = node1.dataset + node2.dataset
   ....: 

In [74]: xr.DataTree.from_dict(result)
Out[74]: 
<xarray.DataTree>
Group: /
├── Group: /a
│       Dimensions:  ()
│       Data variables:
│           x        int64 8B 11
└── Group: /b
        Dimensions:  ()
        Data variables:
            x        int64 8B 22

Alternatively, you apply a function directly to paired datasets at every node using xarray.map_over_datasets():

In [75]: xr.map_over_datasets(lambda x, y: x + y, dt1, dt2)
Out[75]: 
<xarray.DataTree>
Group: /
├── Group: /a
│       Dimensions:  ()
│       Data variables:
│           x        int64 8B 11
└── Group: /b
        Dimensions:  ()
        Data variables:
            x        int64 8B 22

Comparing Trees for Isomorphism#

For it to make sense to map a single non-unary function over the nodes of multiple trees at once, each tree needs to have the same structure. Specifically two trees can only be considered similar, or “isomorphic”, if the full paths to all of their descendent nodes are the same.

Applying group_subtrees() to trees with different structures raises TreeIsomorphismError:

In [76]: tree = xr.DataTree.from_dict({"a": None, "a/b": None, "a/c": None})

In [77]: simple_tree = xr.DataTree.from_dict({"a": None})

In [78]: for _ in xr.group_subtrees(tree, simple_tree):
   ....:     ...
   ....: 
TreeIsomorphismError: children at node 'a' do not match: ['b', 'c'] vs []

We can explicitly also check if any two trees are isomorphic using the isomorphic() method:

In [79]: tree.isomorphic(simple_tree)
Out[79]: False

Corresponding tree nodes do not need to have the same data in order to be considered isomorphic:

In [80]: tree_with_data = xr.DataTree.from_dict({"a": xr.Dataset({"foo": 1})})

In [81]: simple_tree.isomorphic(tree_with_data)
Out[81]: True

They also do not need to define child nodes in the same order:

In [82]: reordered_tree = xr.DataTree.from_dict({"a": None, "a/c": None, "a/b": None})

In [83]: tree.isomorphic(reordered_tree)
Out[83]: True

Arithmetic Between Multiple Trees#

Arithmetic operations like multiplication are binary operations, so as long as we have two isomorphic trees, we can do arithmetic between them.

In [84]: currents = xr.DataTree.from_dict(
   ....:     {
   ....:         "/oscilloscope1": xr.Dataset(
   ....:             {
   ....:                 "current": (
   ....:                     "time",
   ....:                     signal_generator(time_stamps1, f=2, A=1.2, phase=1),
   ....:                 ),
   ....:             },
   ....:             coords={"time": time_stamps1},
   ....:         ),
   ....:         "/oscilloscope2": xr.Dataset(
   ....:             {
   ....:                 "current": (
   ....:                     "time",
   ....:                     signal_generator(time_stamps2, f=1.6, A=1.6, phase=0.7),
   ....:                 ),
   ....:             },
   ....:             coords={"time": time_stamps2},
   ....:         ),
   ....:     }
   ....: )
   ....: 

In [85]: currents
Out[85]: 
<xarray.DataTree>
Group: /
├── Group: /oscilloscope1
│       Dimensions:  (time: 15)
│       Coordinates:
│         * time     (time) float64 120B 0.0 0.6732 1.346 2.02 ... 8.078 8.752 9.425
│       Data variables:
│           current  (time) float64 120B 1.01 0.8568 -0.6285 ... -1.191 -0.4074 1.01
└── Group: /oscilloscope2
        Dimensions:  (time: 10)
        Coordinates:
          * time     (time) float64 80B 0.0 0.6981 1.396 2.094 ... 4.887 5.585 6.283
        Data variables:
            current  (time) float64 80B 1.031 1.552 0.3297 ... 1.259 -0.3356 -1.553

In [86]: currents.isomorphic(voltages)
Out[86]: True

We could use this feature to quickly calculate the electrical power in our signal, P=IV.

In [87]: power = currents * voltages

In [88]: power
Out[88]: 
<xarray.DataTree>
Group: /
├── Group: /oscilloscope1
│       Dimensions:  (time: 15)
│       Coordinates:
│         * time     (time) float64 120B 0.0 0.6732 1.346 2.02 ... 8.078 8.752 9.425
│       Data variables:
│           current  (time) float64 120B 1.02 0.7341 0.395 1.292 ... 1.419 0.166 1.02
└── Group: /oscilloscope2
        Dimensions:  (time: 10)
        Coordinates:
          * time     (time) float64 80B 0.0 0.6981 1.396 2.094 ... 4.887 5.585 6.283
        Data variables:
            current  (time) float64 80B 1.062 2.408 0.1087 1.594 ... 1.585 0.1126 2.412

Alignment and Coordinate Inheritance#

Data Alignment#

The data in different datatree nodes are not totally independent. In particular dimensions (and indexes) in child nodes must be exactly aligned with those in their parent nodes. Exact aligment means that shared dimensions must be the same length, and indexes along those dimensions must be equal.

Note

If you were a previous user of the prototype xarray-contrib/datatree package, this is different from what you’re used to! In that package the data model was that the data stored in each node actually was completely unrelated. The data model is now slightly stricter. This allows us to provide features like Coordinate Inheritance.

To demonstrate, let’s first generate some example datasets which are not aligned with one another:

# (drop the attributes just to make the printed representation shorter)
In [89]: ds = xr.tutorial.open_dataset("air_temperature").drop_attrs()

In [90]: ds_daily = ds.resample(time="D").mean("time")

In [91]: ds_weekly = ds.resample(time="W").mean("time")

In [92]: ds_monthly = ds.resample(time="ME").mean("time")

These datasets have different lengths along the time dimension, and are therefore not aligned along that dimension.

In [93]: ds_daily.sizes
Out[93]: Frozen({'time': 730, 'lat': 25, 'lon': 53})

In [94]: ds_weekly.sizes
Out[94]: Frozen({'time': 105, 'lat': 25, 'lon': 53})

In [95]: ds_monthly.sizes
Out[95]: Frozen({'time': 24, 'lat': 25, 'lon': 53})

We cannot store these non-alignable variables on a single Dataset object, because they do not exactly align:

In [96]: xr.align(ds_daily, ds_weekly, ds_monthly, join="exact")
ValueError: cannot align objects with join='exact' where index/labels/sizes are not equal along these coordinates (dimensions): 'time' ('time',)

But we previously said that multi-resolution data is a good use case for DataTree, so surely we should be able to store these in a single DataTree? If we first try to create a DataTree with these different-length time dimensions present in both parents and children, we will still get an alignment error:

In [97]: xr.DataTree.from_dict({"daily": ds_daily, "daily/weekly": ds_weekly})
ValueError: group '/daily/weekly' is not aligned with its parents:
Group:
    Dimensions:  (time: 105, lat: 25, lon: 53)
    Coordinates:
      * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
      * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
      * time     (time) datetime64[ns] 840B 2013-01-06 2013-01-13 ... 2015-01-04
    Data variables:
        air      (time, lat, lon) float64 1MB 245.3 245.2 245.0 ... 296.6 296.2
From parents:
    Dimensions:  (time: 730, lat: 25, lon: 53)
    Coordinates:
      * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
      * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
      * time     (time) datetime64[ns] 6kB 2013-01-01 2013-01-02 ... 2014-12-31

This is because DataTree checks that data in child nodes align exactly with their parents.

Note

This requirement of aligned dimensions is similar to netCDF’s concept of inherited dimensions, as in netCDF-4 files dimensions are visible to all child groups.

This alignment check is performed up through the tree, all the way to the root, and so is therefore equivalent to requiring that this align() command succeeds:

xr.align(child.dataset, *(parent.dataset for parent in child.parents), join="exact")

To represent our unalignable data in a single DataTree, we must instead place all variables which are a function of these different-length dimensions into nodes that are not direct descendents of one another, e.g. organize them as siblings.

In [98]: dt = xr.DataTree.from_dict(
   ....:     {"daily": ds_daily, "weekly": ds_weekly, "monthly": ds_monthly}
   ....: )
   ....: 

In [99]: dt
Out[99]: 
<xarray.DataTree>
Group: /
├── Group: /daily
│       Dimensions:  (time: 730, lat: 25, lon: 53)
│       Coordinates:
│         * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
│         * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
│         * time     (time) datetime64[ns] 6kB 2013-01-01 2013-01-02 ... 2014-12-31
│       Data variables:
│           air      (time, lat, lon) float64 8MB 241.9 242.3 242.7 ... 295.9 295.5
├── Group: /weekly
│       Dimensions:  (time: 105, lat: 25, lon: 53)
│       Coordinates:
│         * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
│         * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
│         * time     (time) datetime64[ns] 840B 2013-01-06 2013-01-13 ... 2015-01-04
│       Data variables:
│           air      (time, lat, lon) float64 1MB 245.3 245.2 245.0 ... 296.6 296.2
└── Group: /monthly
        Dimensions:  (time: 24, lat: 25, lon: 53)
        Coordinates:
          * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
          * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
          * time     (time) datetime64[ns] 192B 2013-01-31 2013-02-28 ... 2014-12-31
        Data variables:
            air      (time, lat, lon) float64 254kB 244.5 244.7 244.7 ... 297.7 297.7

Now we have a valid DataTree structure which contains all the data at each different time frequency, stored in a separate group.

This is a useful way to organise our data because we can still operate on all the groups at once. For example we can extract all three timeseries at a specific lat-lon location:

In [100]: dt.sel(lat=75, lon=300)
Out[100]: 
<xarray.DataTree>
Group: /
├── Group: /daily
│       Dimensions:  (time: 730)
│       Coordinates:
│           lat      float32 4B 75.0
│           lon      float32 4B 300.0
│         * time     (time) datetime64[ns] 6kB 2013-01-01 2013-01-02 ... 2014-12-31
│       Data variables:
│           air      (time) float64 6kB 242.7 245.6 244.9 249.8 ... 254.8 255.6 256.8
├── Group: /weekly
│       Dimensions:  (time: 105)
│       Coordinates:
│           lat      float32 4B 75.0
│           lon      float32 4B 300.0
│         * time     (time) datetime64[ns] 840B 2013-01-06 2013-01-13 ... 2015-01-04
│       Data variables:
│           air      (time) float64 840B 247.2 251.7 256.2 261.4 ... 249.8 248.2 255.7
└── Group: /monthly
        Dimensions:  (time: 24)
        Coordinates:
            lat      float32 4B 75.0
            lon      float32 4B 300.0
          * time     (time) datetime64[ns] 192B 2013-01-31 2013-02-28 ... 2014-12-31
        Data variables:
            air      (time) float64 192B 254.0 252.8 256.9 258.7 ... 265.1 261.8 251.7

or compute the standard deviation of each timeseries to find out how it varies with sampling frequency:

In [101]: dt.std(dim="time")
Out[101]: 
<xarray.DataTree>
Group: /
├── Group: /daily
│       Dimensions:  (lat: 25, lon: 53)
│       Coordinates:
│         * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
│         * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
│       Data variables:
│           air      (lat, lon) float64 11kB 11.63 11.57 11.57 ... 1.715 1.82 1.899
├── Group: /weekly
│       Dimensions:  (lat: 25, lon: 53)
│       Coordinates:
│         * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
│         * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
│       Data variables:
│           air      (lat, lon) float64 11kB 11.29 11.22 11.21 ... 1.651 1.744 1.818
└── Group: /monthly
        Dimensions:  (lat: 25, lon: 53)
        Coordinates:
          * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
          * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
        Data variables:
            air      (lat, lon) float64 11kB 10.8 10.75 10.76 ... 1.608 1.693 1.763

Coordinate Inheritance#

Notice that in the trees we constructed above there is some redundancy - the lat and lon variables appear in each sibling group, but are identical across the groups.

In [102]: dt
Out[102]: 
<xarray.DataTree>
Group: /
├── Group: /daily
│       Dimensions:  (time: 730, lat: 25, lon: 53)
│       Coordinates:
│         * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
│         * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
│         * time     (time) datetime64[ns] 6kB 2013-01-01 2013-01-02 ... 2014-12-31
│       Data variables:
│           air      (time, lat, lon) float64 8MB 241.9 242.3 242.7 ... 295.9 295.5
├── Group: /weekly
│       Dimensions:  (time: 105, lat: 25, lon: 53)
│       Coordinates:
│         * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
│         * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
│         * time     (time) datetime64[ns] 840B 2013-01-06 2013-01-13 ... 2015-01-04
│       Data variables:
│           air      (time, lat, lon) float64 1MB 245.3 245.2 245.0 ... 296.6 296.2
└── Group: /monthly
        Dimensions:  (time: 24, lat: 25, lon: 53)
        Coordinates:
          * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
          * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
          * time     (time) datetime64[ns] 192B 2013-01-31 2013-02-28 ... 2014-12-31
        Data variables:
            air      (time, lat, lon) float64 254kB 244.5 244.7 244.7 ... 297.7 297.7

We can use “Coordinate Inheritance” to define them only once in a parent group and remove this redundancy, whilst still being able to access those coordinate variables from the child groups.

Note

This is also a new feature relative to the prototype xarray-contrib/datatree package.

Let’s instead place only the time-dependent variables in the child groups, and put the non-time-dependent lat and lon variables in the parent (root) group:

In [103]: dt = xr.DataTree.from_dict(
   .....:     {
   .....:         "/": ds.drop_dims("time"),
   .....:         "daily": ds_daily.drop_vars(["lat", "lon"]),
   .....:         "weekly": ds_weekly.drop_vars(["lat", "lon"]),
   .....:         "monthly": ds_monthly.drop_vars(["lat", "lon"]),
   .....:     }
   .....: )
   .....: 

In [104]: dt
Out[104]: 
<xarray.DataTree>
Group: /
│   Dimensions:  (lat: 25, lon: 53)
│   Coordinates:
│     * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
│     * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
├── Group: /daily
│       Dimensions:  (time: 730, lat: 25, lon: 53)
│       Coordinates:
│         * time     (time) datetime64[ns] 6kB 2013-01-01 2013-01-02 ... 2014-12-31
│       Data variables:
│           air      (time, lat, lon) float64 8MB 241.9 242.3 242.7 ... 295.9 295.5
├── Group: /weekly
│       Dimensions:  (time: 105, lat: 25, lon: 53)
│       Coordinates:
│         * time     (time) datetime64[ns] 840B 2013-01-06 2013-01-13 ... 2015-01-04
│       Data variables:
│           air      (time, lat, lon) float64 1MB 245.3 245.2 245.0 ... 296.6 296.2
└── Group: /monthly
        Dimensions:  (time: 24, lat: 25, lon: 53)
        Coordinates:
          * time     (time) datetime64[ns] 192B 2013-01-31 2013-02-28 ... 2014-12-31
        Data variables:
            air      (time, lat, lon) float64 254kB 244.5 244.7 244.7 ... 297.7 297.7

This is preferred to the previous representation because it now makes it clear that all of these datasets share common spatial grid coordinates. Defining the common coordinates just once also ensures that the spatial coordinates for each group cannot become out of sync with one another during operations.

We can still access the coordinates defined in the parent groups from any of the child groups as if they were actually present on the child groups:

In [105]: dt.daily.coords
Out[105]: 
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
  * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
  * time     (time) datetime64[ns] 6kB 2013-01-01 2013-01-02 ... 2014-12-31

In [106]: dt["daily/lat"]
Out[106]: 
<xarray.DataArray 'lat' (lat: 25)> Size: 100B
array([75. , 72.5, 70. , 67.5, 65. , 62.5, 60. , 57.5, 55. , 52.5, 50. , 47.5,
       45. , 42.5, 40. , 37.5, 35. , 32.5, 30. , 27.5, 25. , 22.5, 20. , 17.5,
       15. ], dtype=float32)
Coordinates:
  * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0

As we can still access them, we say that the lat and lon coordinates in the child groups have been “inherited” from their common parent group.

If we print just one of the child nodes, it will still display inherited coordinates, but explicitly mark them as such:

In [107]: print(dt["/daily"])
<xarray.DataTree 'daily'>
Group: /daily
    Dimensions:  (lat: 25, lon: 53, time: 730)
    Coordinates:
      * time     (time) datetime64[ns] 6kB 2013-01-01 2013-01-02 ... 2014-12-31
    Inherited coordinates:
      * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
      * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
    Data variables:
        air      (time, lat, lon) float64 8MB 241.9 242.3 242.7 ... 295.9 295.5

This helps to differentiate which variables are defined on the datatree node that you are currently looking at, and which were defined somewhere above it.

We can also still perform all the same operations on the whole tree:

In [108]: dt.sel(lat=[75], lon=[300])
Out[108]: 
<xarray.DataTree>
Group: /
│   Dimensions:  (lat: 1, lon: 1)
│   Coordinates:
│     * lat      (lat) float32 4B 75.0
│     * lon      (lon) float32 4B 300.0
├── Group: /daily
│       Dimensions:  (time: 730, lat: 1, lon: 1)
│       Coordinates:
│         * time     (time) datetime64[ns] 6kB 2013-01-01 2013-01-02 ... 2014-12-31
│       Data variables:
│           air      (time, lat, lon) float64 6kB 242.7 245.6 244.9 ... 255.6 256.8
├── Group: /weekly
│       Dimensions:  (time: 105, lat: 1, lon: 1)
│       Coordinates:
│         * time     (time) datetime64[ns] 840B 2013-01-06 2013-01-13 ... 2015-01-04
│       Data variables:
│           air      (time, lat, lon) float64 840B 247.2 251.7 256.2 ... 248.2 255.7
└── Group: /monthly
        Dimensions:  (time: 24, lat: 1, lon: 1)
        Coordinates:
          * time     (time) datetime64[ns] 192B 2013-01-31 2013-02-28 ... 2014-12-31
        Data variables:
            air      (time, lat, lon) float64 192B 254.0 252.8 256.9 ... 261.8 251.7

In [109]: dt.std(dim="time")
Out[109]: 
<xarray.DataTree>
Group: /
│   Dimensions:  (lat: 25, lon: 53)
│   Coordinates:
│     * lat      (lat) float32 100B 75.0 72.5 70.0 67.5 65.0 ... 22.5 20.0 17.5 15.0
│     * lon      (lon) float32 212B 200.0 202.5 205.0 207.5 ... 325.0 327.5 330.0
├── Group: /daily
│       Dimensions:  (lat: 25, lon: 53)
│       Data variables:
│           air      (lat, lon) float64 11kB 11.63 11.57 11.57 ... 1.715 1.82 1.899
├── Group: /weekly
│       Dimensions:  (lat: 25, lon: 53)
│       Data variables:
│           air      (lat, lon) float64 11kB 11.29 11.22 11.21 ... 1.651 1.744 1.818
└── Group: /monthly
        Dimensions:  (lat: 25, lon: 53)
        Data variables:
            air      (lat, lon) float64 11kB 10.8 10.75 10.76 ... 1.608 1.693 1.763