You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
tl;dr Let's add first-class support for ragged structures in Tiled, remixing some patterns we have used for array and awkward structures, to properly represent data from hardware that can give us irregularly-sized arrays.
Motivation
Sometimes hardware generates "ragged" array data: array data that cannot be represented in a rectangular array.
[
[1, 2, 3],
[4, 5],
[7, 8, 9, 10],
]
In "Databroker v0" (2015, pre-dating Tiled) we gave users either an iterable of arrays or a pandas.Series of arrays. Both of these were able to accommodate irregularly sized arrays. By placing each item in its own Python object, this approach was not as efficient---in memory footprint or speed of common operations---as a single array. For the common case where data was regularly shaped, it was unnecessarily clumsy and slow.
Now, in the pre-release "Databroker v2", we currently always give users a single array. For regularly shaped data, this is both more usable and more efficient than a generator or a pandas.Series. But in the case where data is irregularly shaped (i.e. ragged), often due to intermittently flaky hardware, the data is now un-representable. Scientists sometimes elect to pad or trim into a regular shape in order to access the data. This situation is not good: it ought to be possible to represent the actual measurements.
Ragged and Awkward
A new library, ragged, enables representing arrays with irregularly-sized dimensions. It is more space- and time-efficient than using nested Python lists. Like numpy, it places all the data in a contiguous block of memory and holds metadata describing how a logical step through the N-dimensional array corresponds to a stride length through the linear block of memory. Numpy describes the stride-length along each dimension with a single number:
>>>a=numpy.ones((3, 5))
>>>a.strides
(40, 8)
To describe the more complex structure of a ragged array, ragged needs to use a list of offsets for each ragged dimension. This is implemented by wrapping an AwkwardArray. (AwkwardArrays can describe tree-like (JSON-like) structures. Ragged arrays are a subset of AwkwardArrays.)
>>>x=ragged.array([[1,2,3], [4,5], [7,8,9,10]])
>>>xragged.array([
[1, 2, 3],
[4, 5],
[7, 8, 9, 10]
])
>>>x.shape
(3, None)
>>>x._impl# AwkwardArray tucked inside<Array [[1, 2, 3], [4, 5], [7, 8, 9, 10]] type='3 * var * int64'>>>>awkward.to_buffers(x._impl) # backed by a buffer of values and a buffer of "offsets"
(ListOffsetForm('i64', NumpyForm('int64', form_key='node1'), form_key='node0'),
3,
{'node0-offsets': array([0, 3, 5, 9]),
'node1-data': array([ 1, 2, 3, 4, 5, 7, 8, 9, 10])})
Unlike general AwkwardArrays, ragged arrays have shape. The example above has shape (3, None) where None denotes a variable-length dimension. As the documentation notes, this is technically compliant with the Array API specification, which allows for None in a shape tuple. However, the intent of the Array API was for None to mean "unknown" not "variable". If experiments with ragged array demonstrate that it is useful and can integrate well with array libraries, perhaps the Array API will be extended in the future to bless this interpretation of None or add some new distinct token VAR.
Tiled added support for awkward structures in #450 (July 2023). We could just use the existing awkward structure family, perhaps denoting ragged as a Tiled "spec", in the same way that xarray_dataset is just a "spec" on the underlying container structure family.
But I think it is worth adding a new structure family to Tiled to support ragged structures, distinctly from awkward ones. The general awkward structure description is quite complex. In involves a JSON description ("form") and several varieties of buffers, depending on the given structure. A ragged structure can be described more simply:
one data type
one shape tuple, which may contain None
one buffer of array values
N buffers of "offsets" into that values array, one per ragged dimension
I think it's possible to express chunking as well, just as with do with strided arrays, if we can accept that not all chunks are "full". Thus, we can describe a ragged structure as a tiny extension of ArrayStructure:
This will make it easier for users not versed in AwkwardArray "form" to interpret the structure. I can imagine we might have some clients, not in Python, that _want to represent ragged structures but may not take on the greater complexity of supporting general awkward structures.
will have the same semantics as their array counterparts /array/full and /array/block. Their payloads will be simplified Awkward payloads: either a ZIP archive of named buffers or a JSON object of named arrays. The names will be data and offsets-N corresponding to the Awkward buffers that implement the ragged object.
@jpivarski If you have any words of advice on this general approach I'd be glad to hear them.
Implications for Bluesky
Bluesky's document model does not yet have any way to deal with awkward or ragged data. While there has been some discussion of adding support for full awkward structures someday, to deal with the new event-based detectors we are beginning to see on the floor, that will be a big job that will require significant design discussion.
Ragged presents an opportunity to do something much easier to start:
Extend the Event Descriptor schema for shape to allow None and interpret this as "variable".
"shape": {
"title": "Shape",
"description": "The shape of the data. Empty list indicates scalar data.",
"type": "array",
"items": {
"type": "integer"
}
},
Update databroker.mongo_normalized to switch from presenting array to ragged structures when the shape contains None.
Update the describe() method of ophyd objects of hardware that is expected to return ragged data.
When hardware was expected to return regular data but happened to return ragged data in one problematic instance, the shape metadata can be patched to set the offending dimension to None and make the data representable.
As a convenience, we allow users to access a group of data as an xarray.Dataset. An xarray.Dataset requires its constituent arrays to have regular shapes. (If one tries to put a ragged array into an xarray.DataArray, one gets a TypeError related to the None in the shape tuple.) While there is discussion about adding support, it looks like significant work remains to be done.
For now, I think our best option is to make Tiled Python client's DatasetClient raise an error message if any of the items in the container happen to be ragged. This error message can recommend some option keyword argument, like dsc.read(skip_ragged=True) and helpfully list the names of the ragged items. They can be separately accessed, as ragged arrays, via dsc[key][:].
The text was updated successfully, but these errors were encountered:
tl;dr Let's add first-class support for
ragged
structures in Tiled, remixing some patterns we have used forarray
andawkward
structures, to properly represent data from hardware that can give us irregularly-sized arrays.Motivation
Sometimes hardware generates "ragged" array data: array data that cannot be represented in a rectangular array.
In "Databroker v0" (2015, pre-dating Tiled) we gave users either an iterable of arrays or a
pandas.Series
of arrays. Both of these were able to accommodate irregularly sized arrays. By placing each item in its own Python object, this approach was not as efficient---in memory footprint or speed of common operations---as a single array. For the common case where data was regularly shaped, it was unnecessarily clumsy and slow.Now, in the pre-release "Databroker v2", we currently always give users a single array. For regularly shaped data, this is both more usable and more efficient than a generator or a
pandas.Series
. But in the case where data is irregularly shaped (i.e. ragged), often due to intermittently flaky hardware, the data is now un-representable. Scientists sometimes elect to pad or trim into a regular shape in order to access the data. This situation is not good: it ought to be possible to represent the actual measurements.Ragged and Awkward
A new library, ragged, enables representing arrays with irregularly-sized dimensions. It is more space- and time-efficient than using nested Python lists. Like numpy, it places all the data in a contiguous block of memory and holds metadata describing how a logical step through the N-dimensional array corresponds to a stride length through the linear block of memory. Numpy describes the stride-length along each dimension with a single number:
To describe the more complex structure of a ragged array,
ragged
needs to use a list of offsets for each ragged dimension. This is implemented by wrapping an AwkwardArray. (AwkwardArrays can describe tree-like (JSON-like) structures. Ragged arrays are a subset of AwkwardArrays.)Unlike general AwkwardArrays, ragged arrays have
shape
. The example above has shape(3, None)
whereNone
denotes a variable-length dimension. As the documentation notes, this is technically compliant with the Array API specification, which allows forNone
in a shape tuple. However, the intent of the Array API was forNone
to mean "unknown" not "variable". If experiments with ragged array demonstrate that it is useful and can integrate well with array libraries, perhaps the Array API will be extended in the future to bless this interpretation ofNone
or add some new distinct tokenVAR
.See Why does this library exist? for more context.
Adding ragged as a new structure family in Tiled
Tiled added support for
awkward
structures in #450 (July 2023). We could just use the existingawkward
structure family, perhaps denotingragged
as a Tiled "spec", in the same way thatxarray_dataset
is just a "spec" on the underlyingcontainer
structure family.But I think it is worth adding a new structure family to Tiled to support
ragged
structures, distinctly fromawkward
ones. The generalawkward
structure description is quite complex. In involves a JSON description ("form") and several varieties of buffers, depending on the given structure. Aragged
structure can be described more simply:shape
tuple, which may containNone
I think it's possible to express chunking as well, just as with do with strided arrays, if we can accept that not all chunks are "full". Thus, we can describe a
ragged
structure as a tiny extension ofArrayStructure
:This will make it easier for users not versed in AwkwardArray "form" to interpret the structure. I can imagine we might have some clients, not in Python, that _want to represent
ragged
structures but may not take on the greater complexity of supporting generalawkward
structures.New HTTP endpoints:
will have the same semantics as their
array
counterparts/array/full
and/array/block
. Their payloads will be simplified Awkward payloads: either a ZIP archive of named buffers or a JSON object of named arrays. The names will bedata
andoffsets-N
corresponding to the Awkward buffers that implement the ragged object.@jpivarski If you have any words of advice on this general approach I'd be glad to hear them.
Implications for Bluesky
Bluesky's document model does not yet have any way to deal with awkward or ragged data. While there has been some discussion of adding support for full awkward structures someday, to deal with the new event-based detectors we are beginning to see on the floor, that will be a big job that will require significant design discussion.
Ragged presents an opportunity to do something much easier to start:
shape
to allowNone
and interpret this as "variable".databroker.mongo_normalized
to switch from presentingarray
toragged
structures when theshape
containsNone
.describe()
method of ophyd objects of hardware that is expected to return ragged data.shape
metadata can be patched to set the offending dimension toNone
and make the data representable.As a convenience, we allow users to access a group of data as an
xarray.Dataset
. Anxarray.Dataset
requires its constituent arrays to have regular shapes. (If one tries to put aragged
array into anxarray.DataArray
, one gets aTypeError
related to theNone
in theshape
tuple.) While there is discussion about adding support, it looks like significant work remains to be done.For now, I think our best option is to make Tiled Python client's
DatasetClient
raise an error message if any of the items in the container happen to beragged
. This error message can recommend some option keyword argument, likedsc.read(skip_ragged=True)
and helpfully list the names of the ragged items. They can be separately accessed, as ragged arrays, viadsc[key][:]
.The text was updated successfully, but these errors were encountered: