Alternative chunked array types#
Warning
This is a highly experimental feature. Please report any bugs or other difficulties on xarray’s issue tracker. In particular see discussion on xarray issue #6807
Xarray can wrap chunked dask arrays (see Parallel Computing with Dask), but can also wrap any other chunked array type that exposes the correct interface.
This allows us to support using other frameworks for distributed and out-of-core processing, with user code still written as xarray commands.
In particular xarray also supports wrapping cubed.Array
objects
(see Cubed’s documentation and the cubed-xarray package).
The basic idea is that by wrapping an array that has an explicit notion of .chunks
, xarray can expose control over
the choice of chunking scheme to users via methods like DataArray.chunk()
whilst the wrapped array actually
implements the handling of processing all of the chunks.
Chunked array methods and “core operations”#
A chunked array needs to meet all the requirements for normal duck arrays, but must also implement additional features.
Chunked arrays have additional attributes and methods, such as .chunks
and .rechunk
.
Furthermore, Xarray dispatches chunk-aware computations across one or more chunked arrays using special functions known
as “core operations”. Examples include map_blocks
, blockwise
, and apply_gufunc
.
The core operations are generalizations of functions first implemented in dask.array
.
The implementation of these functions is specific to the type of arrays passed to them. For example, when applying the
map_blocks
core operation, dask.array.Array
objects must be processed by dask.array.map_blocks()
,
whereas cubed.Array
objects must be processed by cubed.map_blocks()
.
In order to use the correct implementation of a core operation for the array type encountered, xarray dispatches to the
corresponding subclass of ChunkManagerEntrypoint
,
also known as a “Chunk Manager”. Therefore a full list of the operations that need to be defined is set by the
API of the ChunkManagerEntrypoint
abstract base class. Note that chunked array
methods are also currently dispatched using this class.
Chunked array creation is also handled by this class. As chunked array objects have a one-to-one correspondence with
in-memory numpy arrays, it should be possible to create a chunked array from a numpy array by passing the desired
chunking pattern to an implementation of from_array`
.
Note
The ChunkManagerEntrypoint
abstract base class is mostly just acting as a
namespace for containing the chunked-aware function primitives. Ideally in the future we would have an API standard
for chunked array types which codified this structure, making the entrypoint system unnecessary.
- class xarray.namedarray.parallelcompat.ChunkManagerEntrypoint[source]#
Interface between a particular parallel computing framework and xarray.
This abstract base class must be subclassed by libraries implementing chunked array types, and registered via the
chunkmanagers
entrypoint.Abstract methods on this class must be implemented, whereas non-abstract methods are only required in order to enable a subset of xarray functionality, and by default will raise a
NotImplementedError
if called.- array_cls#
Type of the array class this parallel computing framework provides.
Parallel frameworks need to provide an array class that supports the array API standard. This attribute is used for array instance type checking at runtime.
- Type
type[~T_ChunkedArray]
- abstract apply_gufunc(func, signature, *args, axes=None, keepdims=False, output_dtypes=None, vectorize=None, **kwargs)[source]#
Apply a generalized ufunc or similar python function to arrays.
signature
determines if the function consumes or produces core dimensions. The remaining dimensions in given input arrays (*args
) are considered loop dimensions and are required to broadcast naturally against each other.In other terms, this function is like
np.vectorize
, but for the blocks of chunked arrays. If the function itself shall also be vectorized usevectorize=True
for convenience.Called inside
xarray.apply_ufunc
, which is called internally for most xarray operations. Therefore this method must be implemented for the vast majority of xarray computations to be supported.- Parameters
func (
callable()
) – Function to call likefunc(*args, **kwargs)
on input arrays (*args
) that returns an array or tuple of arrays. If multiple arguments with non-matching dimensions are supplied, this function is expected to vectorize (broadcast) over axes of positional arguments in the style of NumPy universal functions 1 (if this is not the case, setvectorize=True
). If this function returns multiple outputs,output_core_dims
has to be set as well.signature (
string
) – Specifies what core dimensions are consumed and produced byfunc
. According to the specification of numpy.gufunc signature 2*args (
numeric
) – Input arrays or scalars to the callable function.axes (
List
oftuples
, optional,keyword only
) – A list of tuples with indices of axes a generalized ufunc should operate on. For instance, for a signature of"(i,j),(j,k)->(i,k)"
appropriate for matrix multiplication, the base elements are two-dimensional matrices and these are taken to be stored in the two last axes of each argument. The corresponding axes keyword would be[(-2, -1), (-2, -1), (-2, -1)]
. For simplicity, for generalized ufuncs that operate on 1-dimensional arrays (vectors), a single integer is accepted instead of a single-element tuple, and for generalized ufuncs for which all outputs are scalars, the output tuples can be omitted.keepdims (
bool
, optional,keyword only
) – If this is set to True, axes which are reduced over will be left in the result as a dimension with size one, so that the result will broadcast correctly against the inputs. This option can only be used for generalized ufuncs that operate on inputs that all have the same number of core dimensions and with outputs that have no core dimensions , i.e., with signatures like"(i),(i)->()"
or"(m,m)->()"
. If used, the location of the dimensions in the output can be controlled with axes and axis.output_dtypes (
Optional
,dtype
orlist
ofdtypes
,keyword only
) – Valid numpy dtype specification or list thereof. If not given, a call offunc
with a small set of data is performed in order to try to automatically determine the output dtypes.vectorize (
bool
,keyword only
) – If set toTrue
,np.vectorize
is applied tofunc
for convenience. Defaults toFalse
.**kwargs (
dict
) – Extra keyword arguments to pass to func
- Returns
Single chunked array
ortuple
ofchunked arrays
References
- property array_api#
Return the array_api namespace following the python array API standard.
See https://data-apis.org/array-api/latest/ . Currently used to access the array API function
full_like
, which is called within the xarray constructorsxarray.full_like
,xarray.ones_like
,xarray.zeros_like
, etc.See also
dask.array
,cubed.array_api
- blockwise(func, out_ind, *args, adjust_chunks=None, new_axes=None, align_arrays=True, **kwargs)[source]#
Tensor operation: Generalized inner and outer products.
A broad class of blocked algorithms and patterns can be specified with a concise multi-index notation. The
blockwise
function applies an in-memory function across multiple blocks of multiple inputs in a variety of ways. Many chunked array operations are special cases of blockwise including elementwise, broadcasting, reductions, tensordot, and transpose.Currently only called explicitly in xarray when performing multidimensional interpolation.
- Parameters
func (
callable()
) – Function to apply to individual tuples of blocksout_ind (iterable) – Block pattern of the output, something like ‘ijk’ or (1, 2, 3)
*args (sequence of
Array
,index pairs
) – You may also pass literal arguments, accompanied by None index e.g. (x, ‘ij’, y, ‘jk’, z, ‘i’, some_literal, None)**kwargs (
dict
) – Extra keyword arguments to pass to functionadjust_chunks (
dict
) – Dictionary mapping index to function to be applied to chunk sizesnew_axes (
dict
,keyword only
) – New indexes and their dimension lengthsalign_arrays (
bool
) – Whether or not to align chunks along equally sized dimensions when multiple arrays are provided. This allows for larger chunks in some arrays to be broken into smaller ones that match chunk sizes in other arrays such that they are compatible for block function mapping. If this is false, then an error will be thrown if arrays do not already have the same number of blocks in each dimension.
See also
dask.array.blockwise
,cubed.core.blockwise
- abstract chunks(data)[source]#
Return the current chunks of the given array.
Returns chunks explicitly as a tuple of tuple of ints.
Used internally by xarray objects’ .chunks and .chunksizes properties.
- Parameters
data (
chunked array
)- Returns
chunks (
tuple[tuple[int
,]
,]
)
See also
dask.array.Array.chunks
,cubed.Array.chunks
- abstract compute(*data, **kwargs)[source]#
Computes one or more chunked arrays, returning them as eager numpy arrays.
Called anytime something needs to computed, including multiple arrays at once. Used by .compute, .persist, .values.
- Parameters
*data (
object
) – Any number of objects. If an object is an instance of the chunked array type, it is computed and the in-memory result returned as a numpy array. All other types should be passed through unchanged.- Returns
objs
– The input, but with all chunked arrays now computed.
See also
- abstract from_array(data, chunks, **kwargs)[source]#
Create a chunked array from a non-chunked numpy-like array.
Generally input should have a
.shape
,.ndim
,.dtype
and support numpy-style slicing.Called when the .chunk method is called on an xarray object that is not already chunked. Also called within open_dataset (when chunks is not None) to create a chunked array from an xarray lazily indexed array.
- Parameters
data (array_like)
See also
- is_chunked_array(data)[source]#
Check if the given object is an instance of this type of chunked array.
Compares against the type stored in the array_cls attribute by default.
- Parameters
data (
Any
)- Returns
is_chunked (
bool
)
See also
- map_blocks(func, *args, dtype=None, chunks=None, drop_axis=None, new_axis=None, **kwargs)[source]#
Map a function across all blocks of a chunked array.
Called in elementwise operations, but notably not (currently) called within xarray.map_blocks.
- Parameters
func (
callable()
) – Function to apply to every block in the array. Iffunc
acceptsblock_info=
orblock_id=
as keyword arguments, these will be passed dictionaries containing information about input and output chunks/arrays during computation. See examples for details.args (
dask arrays
orother objects
)dtype (
np.dtype
, optional) – Thedtype
of the output array. It is recommended to provide this. If not provided, will be inferred by applying the function to a small set of fake data.chunks (
tuple
, optional) – Chunk shape of resulting blocks if the function does not preserve shape. If not provided, the resulting array is assumed to have the same block structure as the first input array.drop_axis (
number
or iterable, optional) – Dimensions lost by the function.new_axis (
number
or iterable, optional) – New dimensions created by the function. Note that these are applied afterdrop_axis
(if present).**kwargs – Other keyword arguments to pass to function. Values must be constants (not dask.arrays)
See also
- abstract normalize_chunks(chunks, shape=None, limit=None, dtype=None, previous_chunks=None)[source]#
Normalize given chunking pattern into an explicit tuple of tuples representation.
Exposed primarily because different chunking backends may want to make different decisions about how to automatically chunk along dimensions not given explicitly in the input chunks.
Called internally by xarray.open_dataset.
- Parameters
chunks (
tuple
,int
,dict
, orstring
) – The chunks to be normalized.shape (
Tuple[int]
) – The shape of the arraylimit (
int (optional)
) – The maximum block size to target in bytes, if freedom is given to choosedtype (
np.dtype
)previous_chunks (
Tuple[Tuple[int]]
, optional) – Chunks from a previous array that we should use for inspiration when rechunking dimensions automatically.
See also
- persist(*data, **kwargs)[source]#
Persist one or more chunked arrays in memory.
- Parameters
*data (
object
) – Any number of objects. If an object is an instance of the chunked array type, it is persisted as a chunked array in memory. All other types should be passed through unchanged.- Returns
objs
– The input, but with all chunked arrays now persisted in memory.
See also
- rechunk(data, chunks, **kwargs)[source]#
Changes the chunking pattern of the given array.
Called when the .chunk method is called on an xarray object that is already chunked.
- Parameters
- Returns
chunked array
See also
- reduction(arr, func, combine_func=None, aggregate_func=None, axis=None, dtype=None, keepdims=False)[source]#
A general version of array reductions along one or more axes.
Used inside some reductions like nanfirst, which is used by
groupby.first
.- Parameters
arr (
chunked array
) – Data to be reduced along one or more axes.func (
Callable(x_chunk
,axis
,keepdims)
) – First function to be executed when resolving the dask graph. This function is applied in parallel to all original chunks of x. See below for function parameters.combine_func (
Callable(x_chunk
,axis
,keepdims)
, optional) – Function used for intermediate recursive aggregation (see split_every below). If omitted, it defaults to aggregate_func.aggregate_func (
Callable(x_chunk
,axis
,keepdims)
) – Last function to be executed, producing the final output. It is always invoked, even when the reduced Array counts a single chunk along the reduced axes.axis (
int
or sequence ofints
, optional) – Axis or axes to aggregate upon. If omitted, aggregate along all axes.dtype (
np.dtype
) – data type of output. This argument was previously optional, but leaving asNone
will now raise an exception.keepdims (
boolean
, optional) – Whether the reduction function should preserve the reduced axes, leaving them at sizeoutput_size
, or remove them.
- Returns
chunked array
See also
dask.array.reduction
,cubed.core.reduction
- scan(func, binop, ident, arr, axis=None, dtype=None, **kwargs)[source]#
General version of a 1D scan, also known as a cumulative array reduction.
Used in
ffill
andbfill
in xarray.- Parameters
func (
callable()
) – Cumulative function like np.cumsum or np.cumprodbinop (
callable()
) – Associated binary operator likenp.cumsum->add
ornp.cumprod->mul
ident (
Number
) – Associated identity likenp.cumsum->0
ornp.cumprod->1
arr (
dask Array
)axis (
int
, optional)dtype (
dtype
)
- Returns
Chunked array
See also
dask.array.cumreduction
- store(sources, targets, **kwargs)[source]#
Store chunked arrays in array-like objects, overwriting data in target.
This stores chunked arrays into object that supports numpy-style setitem indexing (e.g. a Zarr Store). Allows storing values chunk by chunk so that it does not have to fill up memory. For best performance you likely want to align the block size of the storage target with the block size of your array.
Used when writing to any registered xarray I/O backend.
- Parameters
sources (
Array
orcollection
ofArrays
)targets (array-like or
collection
ofarray-likes
) – These should support setitem syntaxtarget[10:20] = ...
. If sources is a single item, targets must be a single item; if sources is a collection of arrays, targets must be a matching collection.kwargs – Parameters passed to compute/persist (only used if compute=True)
See also
- unify_chunks(*args, **kwargs)[source]#
Unify chunks across a sequence of arrays.
Called by xarray.unify_chunks.
- Parameters
*args (sequence of
Array
,index pairs
) – Sequence like (x, ‘ij’, y, ‘jk’, z, ‘i’)
See also
dask.array.core.unify_chunks
,cubed.core.unify_chunks
Registering a new ChunkManagerEntrypoint subclass#
Rather than hard-coding various chunk managers to deal with specific chunked array implementations, xarray uses an
entrypoint system to allow developers of new chunked array implementations to register their corresponding subclass of
ChunkManagerEntrypoint
.
To register a new entrypoint you need to add an entry to the setup.cfg
like this:
[options.entry_points]
xarray.chunkmanagers =
dask = xarray.namedarray.daskmanager:DaskManager
See also cubed-xarray for another example.
To check that the entrypoint has worked correctly, you may find it useful to display the available chunkmanagers using
the internal function list_chunkmanagers()
.
- xarray.namedarray.parallelcompat.list_chunkmanagers()[source]#
Return a dictionary of available chunk managers and their ChunkManagerEntrypoint subclass objects.
- Returns
chunkmanagers (
dict
) – Dictionary whose values are registered ChunkManagerEntrypoint subclass instances, and whose values are the strings under which they are registered.
User interface#
Once the chunkmanager subclass has been registered, xarray objects wrapping the desired array type can be created in 3 ways:
By manually passing the array type to the
DataArray
constructor, see the examples for numpy-like arrays,Calling
chunk()
, passing the keyword argumentschunked_array_type
andfrom_array_kwargs
,Calling
open_dataset()
, passing the keyword argumentschunked_array_type
andfrom_array_kwargs
.
The latter two methods ultimately call the chunkmanager’s implementation of .from_array
, to which they pass the from_array_kwargs
dict.
The chunked_array_type
kwarg selects which registered chunkmanager subclass to dispatch to. It defaults to 'dask'
if Dask is installed, otherwise it defaults to whichever chunkmanager is registered if only one is registered.
If multiple chunkmanagers are registered, the chunk_manager
configuration option (which can be set using set_options()
)
will be used to determine which chunkmanager to use, defaulting to 'dask'
.
Parallel processing without chunks#
To use a parallel array type that does not expose a concept of chunks explicitly, none of the information on this page is theoretically required. Such an array type (e.g. Ramba or Arkouda) could be wrapped using xarray’s existing support for numpy-like “duck” arrays.