xarray.core.accessor_str.StringAccessor.extract#

StringAccessor.extract(pat, dim, case=None, flags=0)[source]#

Extract the first match of capture groups in the regex pat as a new dimension in a DataArray.

For each string in the DataArray, extract groups from the first match of regular expression pat.

If pat is array-like, it is broadcast against the array and applied elementwise.

Parameters:

pat (str or re.Pattern or array-like of str or re.Pattern) – A string containing a regular expression or a compiled regular expression object. If array-like, it is broadcast.
dim (hashable or None) – Name of the new dimension to store the captured strings in. If None, the pattern must have only one capture group and the resulting DataArray will have the same size as the original.
case (bool, default: True) – If True, case sensitive. Cannot be set if pat is a compiled regex. Equivalent to setting the re.IGNORECASE flag.
flags (int, default: 0) – Flags to pass through to the re module, e.g. re.IGNORECASE. see compilation-flags. 0 means no flags. Flags can be combined with the bitwise or operator |. Cannot be set if pat is a compiled regex.

Returns:

extracted (same type as values or object array)

Raises:

ValueError – pat has no capture groups.
ValueError – dim is None and there is more than one capture group.
ValueError – case is set when pat is a compiled regular expression.
KeyError – The given dimension is already present in the DataArray.

Examples

Create a string array

>>> value = xr.DataArray(
...     [
...         [
...             "a_Xy_0",
...             "ab_xY_10-bab_Xy_110-baab_Xy_1100",
...             "abc_Xy_01-cbc_Xy_2210",
...         ],
...         [
...             "abcd_Xy_-dcd_Xy_33210-dccd_Xy_332210",
...             "",
...             "abcdef_Xy_101-fef_Xy_5543210",
...         ],
...     ],
...     dims=["X", "Y"],
... )

Extract matches

>>> value.str.extract(r"(\w+)_Xy_(\d*)", dim="match")
<xarray.DataArray (X: 2, Y: 3, match: 2)> Size: 288B
array([[['a', '0'],
        ['bab', '110'],
        ['abc', '01']],

       [['abcd', ''],
        ['', ''],
        ['abcdef', '101']]], dtype='<U6')
Dimensions without coordinates: X, Y, match