Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a ragged structure family #801

Open
danielballan opened this issue Oct 27, 2024 · 0 comments
Open

Add a ragged structure family #801

danielballan opened this issue Oct 27, 2024 · 0 comments

Comments

@danielballan
Copy link
Member

tl;dr Let's add first-class support for ragged structures in Tiled, remixing some patterns we have used for array and awkward structures, to properly represent data from hardware that can give us irregularly-sized arrays.

Motivation

Sometimes hardware generates "ragged" array data: array data that cannot be represented in a rectangular array.

[
    [1, 2, 3],
    [4, 5],
    [7, 8, 9, 10],
]

In "Databroker v0" (2015, pre-dating Tiled) we gave users either an iterable of arrays or a pandas.Series of arrays. Both of these were able to accommodate irregularly sized arrays. By placing each item in its own Python object, this approach was not as efficient---in memory footprint or speed of common operations---as a single array. For the common case where data was regularly shaped, it was unnecessarily clumsy and slow.

Now, in the pre-release "Databroker v2", we currently always give users a single array. For regularly shaped data, this is both more usable and more efficient than a generator or a pandas.Series. But in the case where data is irregularly shaped (i.e. ragged), often due to intermittently flaky hardware, the data is now un-representable. Scientists sometimes elect to pad or trim into a regular shape in order to access the data. This situation is not good: it ought to be possible to represent the actual measurements.

Ragged and Awkward

A new library, ragged, enables representing arrays with irregularly-sized dimensions. It is more space- and time-efficient than using nested Python lists. Like numpy, it places all the data in a contiguous block of memory and holds metadata describing how a logical step through the N-dimensional array corresponds to a stride length through the linear block of memory. Numpy describes the stride-length along each dimension with a single number:

>>> a = numpy.ones((3, 5))

>>> a.strides
(40, 8)

To describe the more complex structure of a ragged array, ragged needs to use a list of offsets for each ragged dimension. This is implemented by wrapping an AwkwardArray. (AwkwardArrays can describe tree-like (JSON-like) structures. Ragged arrays are a subset of AwkwardArrays.)

>>> x = ragged.array([[1,2,3], [4,5], [7,8,9,10]])

>>> x
ragged.array([
    [1, 2, 3],
    [4, 5],
    [7, 8, 9, 10]
])

>>> x.shape
(3, None)

>>> x._impl  # AwkwardArray tucked inside
<Array [[1, 2, 3], [4, 5], [7, 8, 9, 10]] type='3 * var * int64'>

>>> awkward.to_buffers(x._impl)  # backed by a buffer of values and a buffer of "offsets"
(ListOffsetForm('i64', NumpyForm('int64', form_key='node1'), form_key='node0'),
 3,
 {'node0-offsets': array([0, 3, 5, 9]),
  'node1-data': array([ 1,  2,  3,  4,  5,  7,  8,  9, 10])})

Unlike general AwkwardArrays, ragged arrays have shape. The example above has shape (3, None) where None denotes a variable-length dimension. As the documentation notes, this is technically compliant with the Array API specification, which allows for None in a shape tuple. However, the intent of the Array API was for None to mean "unknown" not "variable". If experiments with ragged array demonstrate that it is useful and can integrate well with array libraries, perhaps the Array API will be extended in the future to bless this interpretation of None or add some new distinct token VAR.

See Why does this library exist? for more context.

Adding ragged as a new structure family in Tiled

Tiled added support for awkward structures in #450 (July 2023). We could just use the existing awkward structure family, perhaps denoting ragged as a Tiled "spec", in the same way that xarray_dataset is just a "spec" on the underlying container structure family.

But I think it is worth adding a new structure family to Tiled to support ragged structures, distinctly from awkward ones. The general awkward structure description is quite complex. In involves a JSON description ("form") and several varieties of buffers, depending on the given structure. A ragged structure can be described more simply:

  • one data type
  • one shape tuple, which may contain None
  • one buffer of array values
  • N buffers of "offsets" into that values array, one per ragged dimension

I think it's possible to express chunking as well, just as with do with strided arrays, if we can accept that not all chunks are "full". Thus, we can describe a ragged structure as a tiny extension of ArrayStructure:

class RaggedStructure(ArrayStructure):
    shape: Tuple[None | int, ...]

This will make it easier for users not versed in AwkwardArray "form" to interpret the structure. I can imagine we might have some clients, not in Python, that _want to represent ragged structures but may not take on the greater complexity of supporting general awkward structures.

New HTTP endpoints:

/ragged/full?slice=...
/ragged/block?block=i.j,...&slice=...

will have the same semantics as their array counterparts /array/full and /array/block. Their payloads will be simplified Awkward payloads: either a ZIP archive of named buffers or a JSON object of named arrays. The names will be data and offsets-N corresponding to the Awkward buffers that implement the ragged object.

@jpivarski If you have any words of advice on this general approach I'd be glad to hear them.

Implications for Bluesky

Bluesky's document model does not yet have any way to deal with awkward or ragged data. While there has been some discussion of adding support for full awkward structures someday, to deal with the new event-based detectors we are beginning to see on the floor, that will be a big job that will require significant design discussion.

Ragged presents an opportunity to do something much easier to start:

  • Extend the Event Descriptor schema for shape to allow None and interpret this as "variable".
                "shape": {
                    "title": "Shape",
                    "description": "The shape of the data.  Empty list indicates scalar data.",
                    "type": "array",
                    "items": {
                        "type": "integer"
                    }
                },
  • Update databroker.mongo_normalized to switch from presenting array to ragged structures when the shape contains None.
  • Update the describe() method of ophyd objects of hardware that is expected to return ragged data.
  • When hardware was expected to return regular data but happened to return ragged data in one problematic instance, the shape metadata can be patched to set the offending dimension to None and make the data representable.

As a convenience, we allow users to access a group of data as an xarray.Dataset. An xarray.Dataset requires its constituent arrays to have regular shapes. (If one tries to put a ragged array into an xarray.DataArray, one gets a TypeError related to the None in the shape tuple.) While there is discussion about adding support, it looks like significant work remains to be done.

For now, I think our best option is to make Tiled Python client's DatasetClient raise an error message if any of the items in the container happen to be ragged. This error message can recommend some option keyword argument, like dsc.read(skip_ragged=True) and helpfully list the names of the ragged items. They can be separately accessed, as ragged arrays, via dsc[key][:].

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant