You can run this notebook in a live session Binder or view it on Github.

Applying unvectorized functions with apply_ufunc#

This example will illustrate how to conveniently apply an unvectorized function func to xarray objects using apply_ufunc. func expects 1D numpy arrays and returns a 1D numpy array. Our goal is to conveniently apply this function along a dimension of xarray objects that may or may not wrap dask arrays with a signature.

We will illustrate this using np.interp:

Signature: np.interp(x, xp, fp, left=None, right=None, period=None)
Docstring:
    One-dimensional linear interpolation.

Returns the one-dimensional piecewise linear interpolant to a function
with given discrete data points (`xp`, `fp`), evaluated at `x`.

and write an xr_interp function with signature

xr_interp(xarray_object, dimension_name, new_coordinate_to_interpolate_to)

Load data#

First lets load an example dataset

[1]:
import xarray as xr
import numpy as np

xr.set_options(display_style="html")  # fancy HTML repr

air = (
    xr.tutorial.load_dataset("air_temperature")
    .air.sortby("lat")  # np.interp needs coordinate in ascending order
    .isel(time=slice(4), lon=slice(3))
)  # choose a small subset for convenience
air
[1]:
<xarray.DataArray 'air' (time: 4, lat: 25, lon: 3)> Size: 2kB
array([[[296.29, 296.79, 297.1 ],
        [295.9 , 296.2 , 296.79],
        [296.6 , 296.2 , 296.4 ],
        [297.  , 296.7 , 296.1 ],
        [295.4 , 295.7 , 295.79],
        [293.79, 294.1 , 294.6 ],
        [293.1 , 293.29, 293.29],
        [290.2 , 290.79, 291.4 ],
        [287.9 , 288.  , 288.29],
        [286.5 , 286.5 , 285.7 ],
        [284.6 , 284.9 , 284.2 ],
        [282.79, 283.2 , 282.6 ],
        [280.  , 280.7 , 280.2 ],
        [278.4 , 279.  , 279.  ],
        [277.29, 277.4 , 277.79],
        [276.7 , 277.4 , 277.7 ],
        [275.9 , 276.9 , 276.9 ],
        [274.79, 275.2 , 275.6 ],
        [273.7 , 273.6 , 273.79],
        [272.1 , 270.9 , 270.  ],
...
        [293.  , 293.5 , 294.29],
        [291.9 , 291.9 , 292.2 ],
        [289.2 , 289.4 , 289.9 ],
        [286.6 , 287.1 , 287.9 ],
        [284.79, 284.79, 285.4 ],
        [282.79, 282.  , 282.7 ],
        [281.2 , 280.2 , 280.6 ],
        [279.5 , 278.7 , 278.6 ],
        [278.  , 277.7 , 277.6 ],
        [276.4 , 275.9 , 276.4 ],
        [275.6 , 275.7 , 276.1 ],
        [274.5 , 275.6 , 276.29],
        [273.4 , 274.5 , 275.5 ],
        [274.1 , 274.  , 273.5 ],
        [273.29, 272.6 , 271.5 ],
        [272.79, 272.4 , 271.9 ],
        [267.7 , 266.29, 264.4 ],
        [256.6 , 254.7 , 252.1 ],
        [246.3 , 245.3 , 244.2 ],
        [241.89, 241.8 , 241.8 ]]])
Coordinates:
  * lat      (lat) float32 100B 15.0 17.5 20.0 22.5 25.0 ... 67.5 70.0 72.5 75.0
  * lon      (lon) float32 12B 200.0 202.5 205.0
  * time     (time) datetime64[ns] 32B 2013-01-01 ... 2013-01-01T18:00:00
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

The function we will apply is np.interp which expects 1D numpy arrays. This functionality is already implemented in xarray so we use that capability to make sure we are not making mistakes.

[2]:
newlat = np.linspace(15, 75, 100)
air.interp(lat=newlat)
[2]:
<xarray.DataArray 'air' (time: 4, lat: 100, lon: 3)> Size: 10kB
array([[[296.29      , 296.79      , 297.1       ],
        [296.19545455, 296.6469697 , 297.02484848],
        [296.10090909, 296.50393939, 296.94969697],
        ...,
        [242.46060606, 243.46969697, 244.08181818],
        [241.83030303, 242.98484848, 243.79090909],
        [241.2       , 242.5       , 243.5       ]],

       [[296.29      , 297.2       , 297.4       ],
        [296.26818182, 297.07878788, 297.25212121],
        [296.24636364, 296.95757576, 297.10424242],
        ...,
        [242.82727273, 243.37878788, 243.63333333],
        [242.46363636, 243.03939394, 243.36666667],
        [242.1       , 242.7       , 243.1       ]],

       [[296.4       , 296.29      , 296.4       ],
        [296.35151515, 296.34090909, 296.37333333],
        [296.3030303 , 296.39181818, 296.34666667],
        ...,
        [243.41515152, 243.26181818, 243.12424242],
        [242.85757576, 242.73090909, 242.71212121],
        [242.3       , 242.2       , 242.3       ]],

       [[297.5       , 297.7       , 297.5       ],
        [297.37878788, 297.65151515, 297.4030303 ],
        [297.25757576, 297.6030303 , 297.30606061],
        ...,
        [244.02818182, 243.4969697 , 242.96363636],
        [242.95909091, 242.64848485, 242.38181818],
        [241.89      , 241.8       , 241.8       ]]])
Coordinates:
  * lon      (lon) float32 12B 200.0 202.5 205.0
  * time     (time) datetime64[ns] 32B 2013-01-01 ... 2013-01-01T18:00:00
  * lat      (lat) float64 800B 15.0 15.61 16.21 16.82 ... 73.79 74.39 75.0
Attributes:
    long_name:     4xDaily Air temperature at sigma level 995
    units:         degK
    precision:     2
    GRIB_id:       11
    GRIB_name:     TMP
    var_desc:      Air temperature
    dataset:       NMC Reanalysis
    level_desc:    Surface
    statistic:     Individual Obs
    parent_stat:   Other
    actual_range:  [185.16 322.1 ]

Let’s define a function that works with one vector of data along lat at a time.

[3]:
def interp1d_np(data, x, xi):
    return np.interp(xi, x, data)


interped = interp1d_np(air.isel(time=0, lon=0), air.lat, newlat)
expected = air.interp(lat=newlat)

# no errors are raised if values are equal to within floating point precision
np.testing.assert_allclose(expected.isel(time=0, lon=0).values, interped)

No errors are raised so our interpolation is working.#

This function consumes and returns numpy arrays, which means we need to do a lot of work to convert the result back to an xarray object with meaningful metadata. This is where apply_ufunc is very useful.

apply_ufunc#

Apply a vectorized function for unlabeled arrays on xarray objects.

The function will be mapped over the data variable(s) of the input arguments using
xarray’s standard rules for labeled computation, including alignment, broadcasting,
looping over GroupBy/Dataset variables, and merging of coordinates.

apply_ufunc has many capabilities but for simplicity this example will focus on the common task of vectorizing 1D functions over nD xarray objects. We will iteratively build up the right set of arguments to apply_ufunc and read through many error messages in doing so.

[4]:
xr.apply_ufunc(
    interp1d_np,  # first the function
    air.isel(time=0, lon=0),  # now arguments in the order expected by 'interp1_np'
    air.lat,
    newlat,
)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[4], line 1
----> 1 xr.apply_ufunc(
      2     interp1d_np,  # first the function
      3     air.isel(time=0, lon=0),  # now arguments in the order expected by 'interp1_np'
      4     air.lat,
      5     newlat,
      6 )

File ~/checkouts/readthedocs.org/user_builds/xray/checkouts/stable/xarray/core/computation.py:1265, in apply_ufunc(func, input_core_dims, output_core_dims, exclude_dims, vectorize, join, dataset_join, dataset_fill_value, keep_attrs, kwargs, dask, output_dtypes, output_sizes, meta, dask_gufunc_kwargs, on_missing_core_dim, *args)
   1263 # feed DataArray apply_variable_ufunc through apply_dataarray_vfunc
   1264 elif any(isinstance(a, DataArray) for a in args):
-> 1265     return apply_dataarray_vfunc(
   1266         variables_vfunc,
   1267         *args,
   1268         signature=signature,
   1269         join=join,
   1270         exclude_dims=exclude_dims,
   1271         keep_attrs=keep_attrs,
   1272     )
   1273 # feed Variables directly through apply_variable_ufunc
   1274 elif any(isinstance(a, Variable) for a in args):

File ~/checkouts/readthedocs.org/user_builds/xray/checkouts/stable/xarray/core/computation.py:307, in apply_dataarray_vfunc(func, signature, join, exclude_dims, keep_attrs, *args)
    302 result_coords, result_indexes = build_output_coords_and_indexes(
    303     args, signature, exclude_dims, combine_attrs=keep_attrs
    304 )
    306 data_vars = [getattr(a, "variable", a) for a in args]
--> 307 result_var = func(*data_vars)
    309 out: tuple[DataArray, ...] | DataArray
    310 if signature.num_outputs > 1:

File ~/checkouts/readthedocs.org/user_builds/xray/checkouts/stable/xarray/core/computation.py:853, in apply_variable_ufunc(func, signature, exclude_dims, dask, output_dtypes, vectorize, keep_attrs, dask_gufunc_kwargs, *args)
    851 for dim, new_size in var.sizes.items():
    852     if dim in dim_sizes and new_size != dim_sizes[dim]:
--> 853         raise ValueError(
    854             f"size of dimension '{dim}' on inputs was unexpectedly "
    855             f"changed by applied function from {dim_sizes[dim]} to {new_size}. Only "
    856             "dimensions specified in ``exclude_dims`` with "
    857             "xarray.apply_ufunc are allowed to change size. "
    858             "The data returned was:\n\n"
    859             f"{short_array_repr(data)}"
    860         )
    862 var.attrs = attrs
    863 output.append(var)

ValueError: size of dimension 'lat' on inputs was unexpectedly changed by applied function from 25 to 100. Only dimensions specified in ``exclude_dims`` with xarray.apply_ufunc are allowed to change size. The data returned was:

array([296.29    , 296.195455, 296.100909, 296.006364, 295.911818, 296.048485,
       296.218182, 296.387879, 296.557576, 296.672727, 296.769697, 296.866667,
       296.963636, 296.757576, 296.369697, 295.981818, 295.593939, 295.204848,
       294.814545, 294.424242, 294.033939, 293.727273, 293.56    , 293.392727,
       293.225455, 292.924242, 292.221212, 291.518182, 290.815152, 290.130303,
       289.572727, 289.015152, 288.457576, 287.9     , 287.560606, 287.221212,
       286.881818, 286.542424, 286.09697 , 285.636364, 285.175758, 284.715152,
       284.270909, 283.832121, 283.393333, 282.954545, 282.367273, 281.690909,
       281.014545, 280.338182, 279.806061, 279.418182, 279.030303, 278.642424,
       278.299091, 278.03    , 277.760909, 277.491818, 277.254242, 277.111212,
       276.968182, 276.825152, 276.675758, 276.481818, 276.287879, 276.093939,
       275.9     , 275.630909, 275.361818, 275.092727, 274.823636, 274.558788,
       274.294545, 274.030303, 273.766061, 273.409091, 273.021212, 272.633333,
       272.245455, 272.463636, 273.045455, 273.627273, 274.209091, 273.530303,
       271.590909, 269.651515, 267.712121, 265.      , 261.      , 257.      ,
       253.      , 249.624242, 248.121212, 246.618182, 245.115152, 243.721212,
       243.090909, 242.460606, 241.830303, 241.2     ])

apply_ufunc needs to know a lot of information about what our function does so that it can reconstruct the outputs. In this case, the size of dimension lat has changed and we need to explicitly specify that this will happen. xarray helpfully tells us that we need to specify the kwarg exclude_dims.

exclude_dims#

exclude_dims : set, optional
        Core dimensions on the inputs to exclude from alignment and
        broadcasting entirely. Any input coordinates along these dimensions
        will be dropped. Each excluded dimension must also appear in
        ``input_core_dims`` for at least one argument. Only dimensions listed
        here are allowed to change size between input and output objects.
[5]:
xr.apply_ufunc(
    interp1d_np,  # first the function
    air.isel(time=0, lon=0),  # now arguments in the order expected by 'interp1_np'
    air.lat,
    newlat,
    exclude_dims=set(("lat",)),  # dimensions allowed to change size. Must be set!
)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[5], line 1
----> 1 xr.apply_ufunc(
      2     interp1d_np,  # first the function
      3     air.isel(time=0, lon=0),  # now arguments in the order expected by 'interp1_np'
      4     air.lat,
      5     newlat,
      6     exclude_dims=set(("lat",)),  # dimensions allowed to change size. Must be set!
      7 )

File ~/checkouts/readthedocs.org/user_builds/xray/checkouts/stable/xarray/core/computation.py:1180, in apply_ufunc(func, input_core_dims, output_core_dims, exclude_dims, vectorize, join, dataset_join, dataset_fill_value, keep_attrs, kwargs, dask, output_dtypes, output_sizes, meta, dask_gufunc_kwargs, on_missing_core_dim, *args)
   1176         raise TypeError(
   1177             f"Expected exclude_dims to be a 'set'. Received '{type(exclude_dims).__name__}' instead."
   1178         )
   1179     if not exclude_dims <= signature.all_core_dims:
-> 1180         raise ValueError(
   1181             f"each dimension in `exclude_dims` must also be a "
   1182             f"core dimension in the function signature. "
   1183             f"Please make {(exclude_dims - signature.all_core_dims)} a core dimension"
   1184         )
   1186 # handle dask_gufunc_kwargs
   1187 if dask == "parallelized":

ValueError: each dimension in `exclude_dims` must also be a core dimension in the function signature. Please make {'lat'} a core dimension

Core dimensions#

Core dimensions are central to using apply_ufunc. In our case, our function expects to receive a 1D vector along lat — this is the dimension that is “core” to the function’s functionality. Multiple core dimensions are possible. apply_ufunc needs to know which dimensions of each variable are core dimensions.

input_core_dims : Sequence[Sequence], optional
    List of the same length as ``args`` giving the list of core dimensions
    on each input argument that should not be broadcast. By default, we
    assume there are no core dimensions on any input arguments.

    For example, ``input_core_dims=[[], ['time']]`` indicates that all
    dimensions on the first argument and all dimensions other than 'time'
    on the second argument should be broadcast.

    Core dimensions are automatically moved to the last axes of input
    variables before applying ``func``, which facilitates using NumPy style
    generalized ufuncs [2]_.

output_core_dims : List[tuple], optional
    List of the same length as the number of output arguments from
    ``func``, giving the list of core dimensions on each output that were
    not broadcast on the inputs. By default, we assume that ``func``
    outputs exactly one array, with axes corresponding to each broadcast
    dimension.

    Core dimensions are assumed to appear as the last dimensions of each
    output in the provided order.

Next we specify "lat" as input_core_dims on both air and air.lat

[6]:
xr.apply_ufunc(
    interp1d_np,  # first the function
    air.isel(time=0, lon=0),  # now arguments in the order expected by 'interp1_np'
    air.lat,
    newlat,
    input_core_dims=[["lat"], ["lat"], []],
    exclude_dims=set(("lat",)),  # dimensions allowed to change size. Must be set!
)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[6], line 1
----> 1 xr.apply_ufunc(
      2     interp1d_np,  # first the function
      3     air.isel(time=0, lon=0),  # now arguments in the order expected by 'interp1_np'
      4     air.lat,
      5     newlat,
      6     input_core_dims=[["lat"], ["lat"], []],
      7     exclude_dims=set(("lat",)),  # dimensions allowed to change size. Must be set!
      8 )

File ~/checkouts/readthedocs.org/user_builds/xray/checkouts/stable/xarray/core/computation.py:1265, in apply_ufunc(func, input_core_dims, output_core_dims, exclude_dims, vectorize, join, dataset_join, dataset_fill_value, keep_attrs, kwargs, dask, output_dtypes, output_sizes, meta, dask_gufunc_kwargs, on_missing_core_dim, *args)
   1263 # feed DataArray apply_variable_ufunc through apply_dataarray_vfunc
   1264 elif any(isinstance(a, DataArray) for a in args):
-> 1265     return apply_dataarray_vfunc(
   1266         variables_vfunc,
   1267         *args,
   1268         signature=signature,
   1269         join=join,
   1270         exclude_dims=exclude_dims,
   1271         keep_attrs=keep_attrs,
   1272     )
   1273 # feed Variables directly through apply_variable_ufunc
   1274 elif any(isinstance(a, Variable) for a in args):

File ~/checkouts/readthedocs.org/user_builds/xray/checkouts/stable/xarray/core/computation.py:307, in apply_dataarray_vfunc(func, signature, join, exclude_dims, keep_attrs, *args)
    302 result_coords, result_indexes = build_output_coords_and_indexes(
    303     args, signature, exclude_dims, combine_attrs=keep_attrs
    304 )
    306 data_vars = [getattr(a, "variable", a) for a in args]
--> 307 result_var = func(*data_vars)
    309 out: tuple[DataArray, ...] | DataArray
    310 if signature.num_outputs > 1:

File ~/checkouts/readthedocs.org/user_builds/xray/checkouts/stable/xarray/core/computation.py:843, in apply_variable_ufunc(func, signature, exclude_dims, dask, output_dtypes, vectorize, keep_attrs, dask_gufunc_kwargs, *args)
    841 data = as_compatible_data(data)
    842 if data.ndim != len(dims):
--> 843     raise ValueError(
    844         "applied function returned data with an unexpected "
    845         f"number of dimensions. Received {data.ndim} dimension(s) but "
    846         f"expected {len(dims)} dimensions with names {dims!r}, from:\n\n"
    847         f"{short_array_repr(data)}"
    848     )
    850 var = Variable(dims, data, fastpath=True)
    851 for dim, new_size in var.sizes.items():

ValueError: applied function returned data with an unexpected number of dimensions. Received 1 dimension(s) but expected 0 dimensions with names (), from:

array([296.29    , 296.195455, 296.100909, 296.006364, 295.911818, 296.048485,
       296.218182, 296.387879, 296.557576, 296.672727, 296.769697, 296.866667,
       296.963636, 296.757576, 296.369697, 295.981818, 295.593939, 295.204848,
       294.814545, 294.424242, 294.033939, 293.727273, 293.56    , 293.392727,
       293.225455, 292.924242, 292.221212, 291.518182, 290.815152, 290.130303,
       289.572727, 289.015152, 288.457576, 287.9     , 287.560606, 287.221212,
       286.881818, 286.542424, 286.09697 , 285.636364, 285.175758, 284.715152,
       284.270909, 283.832121, 283.393333, 282.954545, 282.367273, 281.690909,
       281.014545, 280.338182, 279.806061, 279.418182, 279.030303, 278.642424,
       278.299091, 278.03    , 277.760909, 277.491818, 277.254242, 277.111212,
       276.968182, 276.825152, 276.675758, 276.481818, 276.287879, 276.093939,
       275.9     , 275.630909, 275.361818, 275.092727, 274.823636, 274.558788,
       274.294545, 274.030303, 273.766061, 273.409091, 273.021212, 272.633333,
       272.245455, 272.463636, 273.045455, 273.627273, 274.209091, 273.530303,
       271.590909, 269.651515, 267.712121, 265.      , 261.      , 257.      ,
       253.      , 249.624242, 248.121212, 246.618182, 245.115152, 243.721212,
       243.090909, 242.460606, 241.830303, 241.2     ])

xarray is telling us that it expected to receive back a numpy array with 0 dimensions but instead received an array with 1 dimension corresponding to newlat. We can fix this by specifying output_core_dims

[7]:
xr.apply_ufunc(
    interp1d_np,  # first the function
    air.isel(time=0, lon=0),  # now arguments in the order expected by 'interp1_np'
    air.lat,
    newlat,
    input_core_dims=[["lat"], ["lat"], []],  # list with one entry per arg
    output_core_dims=[["lat"]],
    exclude_dims=set(("lat",)),  # dimensions allowed to change size. Must be set!
)
[7]:
<xarray.DataArray (lat: 100)> Size: 800B
array([296.29      , 296.19545455, 296.10090909, 296.00636364,
       295.91181818, 296.04848485, 296.21818182, 296.38787879,
       296.55757576, 296.67272727, 296.76969697, 296.86666667,
       296.96363636, 296.75757576, 296.36969697, 295.98181818,
       295.59393939, 295.20484848, 294.81454545, 294.42424242,
       294.03393939, 293.72727273, 293.56      , 293.39272727,
       293.22545455, 292.92424242, 292.22121212, 291.51818182,
       290.81515152, 290.13030303, 289.57272727, 289.01515152,
       288.45757576, 287.9       , 287.56060606, 287.22121212,
       286.88181818, 286.54242424, 286.0969697 , 285.63636364,
       285.17575758, 284.71515152, 284.27090909, 283.83212121,
       283.39333333, 282.95454545, 282.36727273, 281.69090909,
       281.01454545, 280.33818182, 279.80606061, 279.41818182,
       279.03030303, 278.64242424, 278.29909091, 278.03      ,
       277.76090909, 277.49181818, 277.25424242, 277.11121212,
       276.96818182, 276.82515152, 276.67575758, 276.48181818,
       276.28787879, 276.09393939, 275.9       , 275.63090909,
       275.36181818, 275.09272727, 274.82363636, 274.55878788,
       274.29454545, 274.03030303, 273.76606061, 273.40909091,
       273.02121212, 272.63333333, 272.24545455, 272.46363636,
       273.04545455, 273.62727273, 274.20909091, 273.53030303,
       271.59090909, 269.65151515, 267.71212121, 265.        ,
       261.        , 257.        , 253.        , 249.62424242,
       248.12121212, 246.61818182, 245.11515152, 243.72121212,
       243.09090909, 242.46060606, 241.83030303, 241.2       ])
Coordinates:
    lon      float32 4B 200.0
    time     datetime64[ns] 8B 2013-01-01
Dimensions without coordinates: lat

Finally we get some output! Let’s check that this is right

[8]:
interped = xr.apply_ufunc(
    interp1d_np,  # first the function
    air.isel(time=0, lon=0),  # now arguments in the order expected by 'interp1_np'
    air.lat,
    newlat,
    input_core_dims=[["lat"], ["lat"], []],  # list with one entry per arg
    output_core_dims=[["lat"]],
    exclude_dims=set(("lat",)),  # dimensions allowed to change size. Must be set!
)
interped["lat"] = newlat  # need to add this manually
xr.testing.assert_allclose(expected.isel(time=0, lon=0), interped)

No errors are raised so it is right!

Vectorization with np.vectorize#

Now our function currently only works on one vector of data which is not so useful given our 3D dataset. Let’s try passing the whole dataset. We add a print statement so we can see what our function receives.

[9]:
def interp1d_np(data, x, xi):
    print(f"data: {data.shape} | x: {x.shape} | xi: {xi.shape}")
    return np.interp(xi, x, data)


interped = xr.apply_ufunc(
    interp1d_np,  # first the function
    air.isel(
        lon=slice(3), time=slice(4)
    ),  # now arguments in the order expected by 'interp1_np'
    air.lat,
    newlat,
    input_core_dims=[["lat"], ["lat"], []],  # list with one entry per arg
    output_core_dims=[["lat"]],
    exclude_dims=set(("lat",)),  # dimensions allowed to change size. Must be set!
)
interped["lat"] = newlat  # need to add this manually
xr.testing.assert_allclose(expected.isel(time=0, lon=0), interped)
data: (4, 3, 25) | x: (25,) | xi: (100,)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[9], line 6
      2     print(f"data: {data.shape} | x: {x.shape} | xi: {xi.shape}")
      3     return np.interp(xi, x, data)
----> 6 interped = xr.apply_ufunc(
      7     interp1d_np,  # first the function
      8     air.isel(
      9         lon=slice(3), time=slice(4)
     10     ),  # now arguments in the order expected by 'interp1_np'
     11     air.lat,
     12     newlat,
     13     input_core_dims=[["lat"], ["lat"], []],  # list with one entry per arg
     14     output_core_dims=[["lat"]],
     15     exclude_dims=set(("lat",)),  # dimensions allowed to change size. Must be set!
     16 )
     17 interped["lat"] = newlat  # need to add this manually
     18 xr.testing.assert_allclose(expected.isel(time=0, lon=0), interped)

File ~/checkouts/readthedocs.org/user_builds/xray/checkouts/stable/xarray/core/computation.py:1265, in apply_ufunc(func, input_core_dims, output_core_dims, exclude_dims, vectorize, join, dataset_join, dataset_fill_value, keep_attrs, kwargs, dask, output_dtypes, output_sizes, meta, dask_gufunc_kwargs, on_missing_core_dim, *args)
   1263 # feed DataArray apply_variable_ufunc through apply_dataarray_vfunc
   1264 elif any(isinstance(a, DataArray) for a in args):
-> 1265     return apply_dataarray_vfunc(
   1266         variables_vfunc,
   1267         *args,
   1268         signature=signature,
   1269         join=join,
   1270         exclude_dims=exclude_dims,
   1271         keep_attrs=keep_attrs,
   1272     )
   1273 # feed Variables directly through apply_variable_ufunc
   1274 elif any(isinstance(a, Variable) for a in args):

File ~/checkouts/readthedocs.org/user_builds/xray/checkouts/stable/xarray/core/computation.py:307, in apply_dataarray_vfunc(func, signature, join, exclude_dims, keep_attrs, *args)
    302 result_coords, result_indexes = build_output_coords_and_indexes(
    303     args, signature, exclude_dims, combine_attrs=keep_attrs
    304 )
    306 data_vars = [getattr(a, "variable", a) for a in args]
--> 307 result_var = func(*data_vars)
    309 out: tuple[DataArray, ...] | DataArray
    310 if signature.num_outputs > 1:

File ~/checkouts/readthedocs.org/user_builds/xray/checkouts/stable/xarray/core/computation.py:818, in apply_variable_ufunc(func, signature, exclude_dims, dask, output_dtypes, vectorize, keep_attrs, dask_gufunc_kwargs, *args)
    813     if vectorize:
    814         func = _vectorize(
    815             func, signature, output_dtypes=output_dtypes, exclude_dims=exclude_dims
    816         )
--> 818 result_data = func(*input_data)
    820 if signature.num_outputs == 1:
    821     result_data = (result_data,)

Cell In[9], line 3, in interp1d_np(data, x, xi)
      1 def interp1d_np(data, x, xi):
      2     print(f"data: {data.shape} | x: {x.shape} | xi: {xi.shape}")
----> 3     return np.interp(xi, x, data)

File ~/checkouts/readthedocs.org/user_builds/xray/conda/stable/lib/python3.12/site-packages/numpy/lib/_function_base_impl.py:1599, in interp(x, xp, fp, left, right, period)
   1596     xp = np.concatenate((xp[-1:]-period, xp, xp[0:1]+period))
   1597     fp = np.concatenate((fp[-1:], fp, fp[0:1]))
-> 1599 return interp_func(x, xp, fp, left, right)

ValueError: object too deep for desired array

That’s a hard-to-interpret error but our print call helpfully printed the shapes of the input data:

data: (10, 53, 25) | x: (25,) | xi: (100,)

We see that air has been passed as a 3D numpy array which is not what np.interp expects. Instead we want loop over all combinations of lon and time; and apply our function to each corresponding vector of data along lat. apply_ufunc makes this easy by specifying vectorize=True:

vectorize : bool, optional
    If True, then assume ``func`` only takes arrays defined over core
    dimensions as input and vectorize it automatically with
    :py:func:`numpy.vectorize`. This option exists for convenience, but is
    almost always slower than supplying a pre-vectorized function.
    Using this option requires NumPy version 1.12 or newer.

Also see the documentation for np.vectorize: https://docs.scipy.org/doc/numpy/reference/generated/numpy.vectorize.html. Most importantly

The vectorize function is provided primarily for convenience, not for performance.
The implementation is essentially a for loop.
[10]:
def interp1d_np(data, x, xi):
    print(f"data: {data.shape} | x: {x.shape} | xi: {xi.shape}")
    return np.interp(xi, x, data)


interped = xr.apply_ufunc(
    interp1d_np,  # first the function
    air,  # now arguments in the order expected by 'interp1_np'
    air.lat,  # as above
    newlat,  # as above
    input_core_dims=[["lat"], ["lat"], []],  # list with one entry per arg
    output_core_dims=[["lat"]],  # returned data has one dimension
    exclude_dims=set(("lat",)),  # dimensions allowed to change size. Must be set!
    vectorize=True,  # loop over non-core dims
)
interped["lat"] = newlat  # need to add this manually
xr.testing.assert_allclose(expected, interped)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[10], line 6
      2     print(f"data: {data.shape} | x: {x.shape} | xi: {xi.shape}")
      3     return np.interp(xi, x, data)
----> 6 interped = xr.apply_ufunc(
      7     interp1d_np,  # first the function
      8     air,  # now arguments in the order expected by 'interp1_np'
      9     air.lat,  # as above
     10     newlat,  # as above
     11     input_core_dims=[["lat"], ["lat"], []],  # list with one entry per arg
     12     output_core_dims=[["lat"]],  # returned data has one dimension
     13     exclude_dims=set(("lat",)),  # dimensions allowed to change size. Must be set!
     14     vectorize=True,  # loop over non-core dims
     15 )
     16 interped["lat"] = newlat  # need to add this manually
     17 xr.testing.assert_allclose(expected, interped)

File ~/checkouts/readthedocs.org/user_builds/xray/checkouts/stable/xarray/core/computation.py:1265, in apply_ufunc(func, input_core_dims, output_core_dims, exclude_dims, vectorize, join, dataset_join, dataset_fill_value, keep_attrs, kwargs, dask, output_dtypes, output_sizes, meta, dask_gufunc_kwargs, on_missing_core_dim, *args)
   1263 # feed DataArray apply_variable_ufunc through apply_dataarray_vfunc
   1264 elif any(isinstance(a, DataArray) for a in args):
-> 1265     return apply_dataarray_vfunc(
   1266         variables_vfunc,
   1267         *args,
   1268         signature=signature,
   1269         join=join,
   1270         exclude_dims=exclude_dims,
   1271         keep_attrs=keep_attrs,
   1272     )
   1273 # feed Variables directly through apply_variable_ufunc
   1274 elif any(isinstance(a, Variable) for a in args):

File ~/checkouts/readthedocs.org/user_builds/xray/checkouts/stable/xarray/core/computation.py:307, in apply_dataarray_vfunc(func, signature, join, exclude_dims, keep_attrs, *args)
    302 result_coords, result_indexes = build_output_coords_and_indexes(
    303     args, signature, exclude_dims, combine_attrs=keep_attrs
    304 )
    306 data_vars = [getattr(a, "variable", a) for a in args]
--> 307 result_var = func(*data_vars)
    309 out: tuple[DataArray, ...] | DataArray
    310 if signature.num_outputs > 1:

File ~/checkouts/readthedocs.org/user_builds/xray/checkouts/stable/xarray/core/computation.py:818, in apply_variable_ufunc(func, signature, exclude_dims, dask, output_dtypes, vectorize, keep_attrs, dask_gufunc_kwargs, *args)
    813     if vectorize:
    814         func = _vectorize(
    815             func, signature, output_dtypes=output_dtypes, exclude_dims=exclude_dims
    816         )
--> 818 result_data = func(*input_data)
    820 if signature.num_outputs == 1:
    821     result_data = (result_data,)

File ~/checkouts/readthedocs.org/user_builds/xray/conda/stable/lib/python3.12/site-packages/numpy/lib/_function_base_impl.py:2397, in vectorize.__call__(self, *args, **kwargs)
   2394     self._init_stage_2(*args, **kwargs)
   2395     return self
-> 2397 return self._call_as_normal(*args, **kwargs)

File ~/checkouts/readthedocs.org/user_builds/xray/conda/stable/lib/python3.12/site-packages/numpy/lib/_function_base_impl.py:2390, in vectorize._call_as_normal(self, *args, **kwargs)
   2387     vargs = [args[_i] for _i in inds]
   2388     vargs.extend([kwargs[_n] for _n in names])
-> 2390 return self._vectorize_call(func=func, args=vargs)

File ~/checkouts/readthedocs.org/user_builds/xray/conda/stable/lib/python3.12/site-packages/numpy/lib/_function_base_impl.py:2471, in vectorize._vectorize_call(self, func, args)
   2469 """Vectorized call to `func` over positional `args`."""
   2470 if self.signature is not None:
-> 2471     res = self._vectorize_call_with_signature(func, args)
   2472 elif not args:
   2473     res = func()

File ~/checkouts/readthedocs.org/user_builds/xray/conda/stable/lib/python3.12/site-packages/numpy/lib/_function_base_impl.py:2499, in vectorize._vectorize_call_with_signature(self, func, args)
   2494     raise TypeError('wrong number of positional arguments: '
   2495                     'expected %r, got %r'
   2496                     % (len(input_core_dims), len(args)))
   2497 args = tuple(asanyarray(arg) for arg in args)
-> 2499 broadcast_shape, dim_sizes = _parse_input_dimensions(
   2500     args, input_core_dims)
   2501 input_shapes = _calculate_shapes(broadcast_shape, dim_sizes,
   2502                                  input_core_dims)
   2503 args = [np.broadcast_to(arg, shape, subok=True)
   2504         for arg, shape in zip(args, input_shapes)]

File ~/checkouts/readthedocs.org/user_builds/xray/conda/stable/lib/python3.12/site-packages/numpy/lib/_function_base_impl.py:2107, in _parse_input_dimensions(args, input_core_dims)
   2105     dummy_array = np.lib.stride_tricks.as_strided(0, arg.shape[:ndim])
   2106     broadcast_args.append(dummy_array)
-> 2107 broadcast_shape = np.lib._stride_tricks_impl._broadcast_shape(
   2108     *broadcast_args
   2109 )
   2110 return broadcast_shape, dim_sizes

File ~/checkouts/readthedocs.org/user_builds/xray/conda/stable/lib/python3.12/site-packages/numpy/lib/_stride_tricks_impl.py:431, in _broadcast_shape(*args)
    426 """Returns the shape of the arrays that would result from broadcasting the
    427 supplied arrays against each other.
    428 """
    429 # use the old-iterator because np.nditer does not handle size 0 arrays
    430 # consistently
--> 431 b = np.broadcast(*args[:32])
    432 # unfortunately, it cannot handle 32 or more arguments directly
    433 for pos in range(32, len(args), 31):
    434     # ironically, np.broadcast does not properly handle np.broadcast
    435     # objects (it treats them as scalars)
    436     # use broadcasting to avoid allocating the full array

ValueError: shape mismatch: objects cannot be broadcast to a single shape.  Mismatch is between arg 0 with shape (4, 3) and arg 2 with shape (100,).

This unfortunately is another cryptic error from numpy.

Notice that newlat is not an xarray object. Let’s add a dimension name new_lat and modify the call. Note this cannot be lat because xarray expects dimensions to be the same size (or broadcastable) among all inputs. output_core_dims needs to be modified appropriately. We’ll manually rename new_lat back to lat for easy checking.

[11]:
def interp1d_np(data, x, xi):
    print(f"data: {data.shape} | x: {x.shape} | xi: {xi.shape}")
    return np.interp(xi, x, data)


interped = xr.apply_ufunc(
    interp1d_np,  # first the function
    air,  # now arguments in the order expected by 'interp1_np'
    air.lat,  # as above
    newlat,  # as above
    input_core_dims=[["lat"], ["lat"], ["new_lat"]],  # list with one entry per arg
    output_core_dims=[["new_lat"]],  # returned data has one dimension
    exclude_dims=set(("lat",)),  # dimensions allowed to change size. Must be a set!
    vectorize=True,  # loop over non-core dims
)
interped = interped.rename({"new_lat": "lat"})
interped["lat"] = newlat  # need to add this manually
xr.testing.assert_allclose(
    expected.transpose(*interped.dims), interped  # order of dims is different
)
interped
data: (25,) | x: (25,) | xi: (100,)
data: (25,) | x: (25,) | xi: (100,)
data: (25,) | x: (25,) | xi: (100,)
data: (25,) | x: (25,) | xi: (100,)
data: (25,) | x: (25,) | xi: (100,)
data: (25,) | x: (25,) | xi: (100,)
data: (25,) | x: (25,) | xi: (100,)
data: (25,) | x: (25,) | xi: (100,)
data: (25,) | x: (25,) | xi: (100,)
data: (25,) | x: (25,) | xi: (100,)
data: (25,) | x: (25,) | xi: (100,)
data: (25,) | x: (25,) | xi: (100,)
[11]:
<xarray.DataArray (time: 4, lon: 3, lat: 100)> Size: 10kB
array([[[296.29      , 296.19545455, 296.10090909, ..., 242.46060606,
         241.83030303, 241.2       ],
        [296.79      , 296.6469697 , 296.50393939, ..., 243.46969697,
         242.98484848, 242.5       ],
        [297.1       , 297.02484848, 296.94969697, ..., 244.08181818,
         243.79090909, 243.5       ]],

       [[296.29      , 296.26818182, 296.24636364, ..., 242.82727273,
         242.46363636, 242.1       ],
        [297.2       , 297.07878788, 296.95757576, ..., 243.37878788,
         243.03939394, 242.7       ],
        [297.4       , 297.25212121, 297.10424242, ..., 243.63333333,
         243.36666667, 243.1       ]],

       [[296.4       , 296.35151515, 296.3030303 , ..., 243.41515152,
         242.85757576, 242.3       ],
        [296.29      , 296.34090909, 296.39181818, ..., 243.26181818,
         242.73090909, 242.2       ],
        [296.4       , 296.37333333, 296.34666667, ..., 243.12424242,
         242.71212121, 242.3       ]],

       [[297.5       , 297.37878788, 297.25757576, ..., 244.02818182,
         242.95909091, 241.89      ],
        [297.7       , 297.65151515, 297.6030303 , ..., 243.4969697 ,
         242.64848485, 241.8       ],
        [297.5       , 297.4030303 , 297.30606061, ..., 242.96363636,
         242.38181818, 241.8       ]]])
Coordinates:
  * lon      (lon) float32 12B 200.0 202.5 205.0
  * time     (time) datetime64[ns] 32B 2013-01-01 ... 2013-01-01T18:00:00
  * lat      (lat) float64 800B 15.0 15.61 16.21 16.82 ... 73.79 74.39 75.0

Notice that the printed input shapes are all 1D and correspond to one vector along the lat dimension.

The result is now an xarray object with coordinate values copied over from data. This is why apply_ufunc is so convenient; it takes care of a lot of boilerplate necessary to apply functions that consume and produce numpy arrays to xarray objects.

One final point: lat is now the last dimension in interped. This is a “property” of core dimensions: they are moved to the end before being sent to interp1d_np as was noted in the docstring for input_core_dims

Core dimensions are automatically moved to the last axes of input
variables before applying ``func``, which facilitates using NumPy style
generalized ufuncs [2]_.

Parallelization with dask#

So far our function can only handle numpy arrays. A real benefit of apply_ufunc is the ability to easily parallelize over dask chunks when needed.

We want to apply this function in a vectorized fashion over each chunk of the dask array. This is possible using dask’s blockwise, map_blocks, or apply_gufunc. Xarray’s apply_ufunc wraps dask’s apply_gufunc and asking it to map the function over chunks using apply_gufunc is as simple as specifying dask="parallelized". With this level of flexibility we need to provide dask with some extra information: 1. output_dtypes: dtypes of all returned objects, and 2. output_sizes: lengths of any new dimensions.

Here we need to specify output_dtypes since apply_ufunc can infer the size of the new dimension new_lat from the argument corresponding to the third element in input_core_dims. Here I choose the chunk sizes to illustrate that np.vectorize is still applied so that our function receives 1D vectors even though the blocks are 3D.

[12]:
def interp1d_np(data, x, xi):
    print(f"data: {data.shape} | x: {x.shape} | xi: {xi.shape}")
    return np.interp(xi, x, data)


interped = xr.apply_ufunc(
    interp1d_np,  # first the function
    air.chunk(
        {"time": 2, "lon": 2}
    ),  # now arguments in the order expected by 'interp1_np'
    air.lat,  # as above
    newlat,  # as above
    input_core_dims=[["lat"], ["lat"], ["new_lat"]],  # list with one entry per arg
    output_core_dims=[["new_lat"]],  # returned data has one dimension
    exclude_dims=set(("lat",)),  # dimensions allowed to change size. Must be a set!
    vectorize=True,  # loop over non-core dims
    dask="parallelized",
    output_dtypes=[air.dtype],  # one per output
).rename({"new_lat": "lat"})
interped["lat"] = newlat  # need to add this manually
xr.testing.assert_allclose(expected.transpose(*interped.dims), interped)
data: (25,) | x: (25,) | xi: (100,)
data: (25,) | x: (25,) | xi: (100,)
data: (25,) | x: (25,) | xi: (100,)
data: (25,) | x: (25,) | xi: (100,)
data: (25,) | x: (25,) | xi: (100,)
data: (25,) | x: (25,) | xi: (100,)
data: (25,) | x: (25,) | xi: (100,)
data: (25,) | x: (25,) | xi: (100,)
data: (25,) | x: (25,) | xi: (100,)
data: (25,) | x: (25,) | xi: (100,)
data: (25,) | x: (25,) | xi: (100,)
data: (25,) | x: (25,) | xi: (100,)

Yay! our function is receiving 1D vectors, so we’ve successfully parallelized applying a 1D function over a block. If you have a distributed dashboard up, you should see computes happening as equality is checked.

High performance vectorization: gufuncs, numba & guvectorize#

np.vectorize is a very convenient function but is unfortunately slow. It is only marginally faster than writing a for loop in Python and looping. A common way to get around this is to write a base interpolation function that can handle nD arrays in a compiled language like Fortran and then pass that to apply_ufunc.

Another option is to use the numba package which provides a very convenient guvectorize decorator: https://numba.pydata.org/numba-doc/latest/user/vectorize.html#the-guvectorize-decorator

Any decorated function gets compiled and will loop over any non-core dimension in parallel when necessary. We need to specify some extra information:

  1. Our function cannot return a variable any more. Instead it must receive a variable (the last argument) whose contents the function will modify. So we change from def interp1d_np(data, x, xi) to def interp1d_np_gufunc(data, x, xi, out). Our computed results must be assigned to out. All values of out must be assigned explicitly.

  2. guvectorize needs to know the dtypes of the input and output. This is specified in string form as the first argument. Each element of the tuple corresponds to each argument of the function. In this case, we specify float64 for all inputs and outputs: "(float64[:], float64[:], float64[:], float64[:])" corresponding to data, x, xi, out

  3. Now we need to tell numba the size of the dimensions the function takes as inputs and returns as output i.e. core dimensions. This is done in symbolic form i.e. data and x are vectors of the same length, say n; xi and the output out have a different length, say m. So the second argument is (again as a string) "(n), (n), (m) -> (m)." corresponding again to data, x, xi, out

[13]:
from numba import float64, guvectorize


@guvectorize("(float64[:], float64[:], float64[:], float64[:])", "(n), (n), (m) -> (m)")
def interp1d_np_gufunc(data, x, xi, out):
    # numba doesn't really like this.
    # seem to support fstrings so do it the old way
    print(
        "data: " + str(data.shape) + " | x:" + str(x.shape) + " | xi: " + str(xi.shape)
    )
    out[:] = np.interp(xi, x, data)
    # gufuncs don't return data
    # instead you assign to a the last arg
    # return np.interp(xi, x, data)

The warnings are about object-mode compilation relating to the print statement. This means we don’t get much speed up: https://numba.pydata.org/numba-doc/latest/user/performance-tips.html#no-python-mode-vs-object-mode. We’ll keep the print statement temporarily to make sure that guvectorize acts like we want it to.

[14]:
interped = xr.apply_ufunc(
    interp1d_np_gufunc,  # first the function
    air.chunk(
        {"time": 2, "lon": 2}
    ),  # now arguments in the order expected by 'interp1_np'
    air.lat,  # as above
    newlat,  # as above
    input_core_dims=[["lat"], ["lat"], ["new_lat"]],  # list with one entry per arg
    output_core_dims=[["new_lat"]],  # returned data has one dimension
    exclude_dims=set(("lat",)),  # dimensions allowed to change size. Must be a set!
    # vectorize=True,  # not needed since numba takes care of vectorizing
    dask="parallelized",
    output_dtypes=[air.dtype],  # one per output
).rename({"new_lat": "lat"})
interped["lat"] = newlat  # need to add this manually
xr.testing.assert_allclose(expected.transpose(*interped.dims), interped)

Yay! Our function is receiving 1D vectors and is working automatically with dask arrays. Finally let’s comment out the print line and wrap everything up in a nice reusable function

[15]:
from numba import float64, guvectorize


@guvectorize(
    "(float64[:], float64[:], float64[:], float64[:])",
    "(n), (n), (m) -> (m)",
    nopython=True,
)
def interp1d_np_gufunc(data, x, xi, out):
    out[:] = np.interp(xi, x, data)


def xr_interp(data, dim, newdim):
    interped = xr.apply_ufunc(
        interp1d_np_gufunc,  # first the function
        data,  # now arguments in the order expected by 'interp1_np'
        data[dim],  # as above
        newdim,  # as above
        input_core_dims=[[dim], [dim], ["__newdim__"]],  # list with one entry per arg
        output_core_dims=[["__newdim__"]],  # returned data has one dimension
        exclude_dims=set((dim,)),  # dimensions allowed to change size. Must be a set!
        # vectorize=True,  # not needed since numba takes care of vectorizing
        dask="parallelized",
        output_dtypes=[
            data.dtype
        ],  # one per output; could also be float or np.dtype("float64")
    ).rename({"__newdim__": dim})
    interped[dim] = newdim  # need to add this manually

    return interped


xr.testing.assert_allclose(
    expected.transpose(*interped.dims),
    xr_interp(air.chunk({"time": 2, "lon": 2}), "lat", newlat),
)

This technique is generalizable to any 1D function.