Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include chunk shape as a parameter in stream resource for HDF dataset #544

Merged
merged 32 commits into from
Sep 19, 2024

Conversation

jwlodek
Copy link
Member

@jwlodek jwlodek commented Aug 28, 2024

Tiled requires information on chunk sizes to read back data. See:
https://github.com/bluesky/bluesky/blob/978995b69c00cc0b9bef5103bdb31b665a6fbed9/src/bluesky/consolidators.py#L129

If chunk size is set to None, 0, or does not exist, assumption is made that the entire dataset is in one chunk.

Otherwise, the chunk size defines the number of elements in each chunk, of equal size with the exception of the last chunk which can be smaller.

Marking as a WIP until I have a chance to test w/ some real devices.

@jwlodek
Copy link
Member Author

jwlodek commented Aug 28, 2024

Note that in the tests the mock AD ophyd devices read the num frames in chunk signal as 0, with a real AD this cannot be set to below 1

@jwlodek jwlodek requested a review from coretl August 28, 2024 17:43
tests/epics/adaravis/test_aravis.py Outdated Show resolved Hide resolved
tests/fastcs/panda/test_hdf_panda.py Outdated Show resolved Hide resolved
src/ophyd_async/core/_hdf_dataset.py Outdated Show resolved Hide resolved
@jwlodek jwlodek marked this pull request as ready for review September 4, 2024 18:36
@jwlodek
Copy link
Member Author

jwlodek commented Sep 4, 2024

Tested and working as expected with a real PandA and a real AD:

pass-315284 [21]: list(panda1._writer._file.stream_resources())
Out[21]: 
[{'uid': 'ea522eb1-e4b0-40e8-ada1-01f21bf12858',
  'data_key': 'Arm Timestamp',
  'mimetype': 'application/x-hdf5',
  'uri': 'file://localhost/nsls2/data/hex/proposals/2024-2/pass-315284/assets/panda1/2024/09/04/panda1_899280bb-5156-4558-989e-4a4de222b6bc.h5',
  'parameters': {'dataset': '/Arm Timestamp',
   'swmr': False,
   'multiplier': 1,
   'chunk_size': None}},
 {'uid': '8ecc70c5-c035-4c2e-98cc-f5b17951bdac',
  'data_key': 'Trigger Timestamp',
  'mimetype': 'application/x-hdf5',
  'uri': 'file://localhost/nsls2/data/hex/proposals/2024-2/pass-315284/assets/panda1/2024/09/04/panda1_899280bb-5156-4558-989e-4a4de222b6bc.h5',
  'parameters': {'dataset': '/Trigger Timestamp',
   'swmr': False,
   'multiplier': 1,
   'chunk_size': None}}]
pass-315284 [28]: list(kinetix1._writer._file.stream_resources())
Out[28]: 
[{'uid': '3f479788-341d-4ac0-913e-ec490bb6e084',
  'data_key': 'kinetix-det1',
  'mimetype': 'application/x-hdf5',
  'uri': 'file://localhost/home/xf27id1',
  'parameters': {'dataset': '/entry/data/data',
   'swmr': False,
   'multiplier': 1,
   'chunk_size': 1}}]

tests/epics/adaravis/test_aravis.py Outdated Show resolved Hide resolved
tests/epics/advimba/test_vimba.py Outdated Show resolved Hide resolved
tests/fastcs/panda/test_hdf_panda.py Outdated Show resolved Hide resolved
@coretl
Copy link
Collaborator

coretl commented Sep 5, 2024

I'm generally confused about chunk size.

h5py suggests that the chunk can be anything, so part of an image, or many images.

Tiled looks like it might be treating chunk_size as number of datums per chunk

AreaDetector uses NumFramesChunk to work out how many NDArrays to put in a chunk, but looks like only when ChunkSizeAuto is Yes

This PR gives chunk_size as an integer, but doesn't describe what it is. I suggest it is called something like datums_per_chunk or frames_per_chunk so that it is clear what it means.

We also need to document the parameters for a StreamResource mimetype somewhere, and refer to it in the HDFFile documentation.

@danielballan any opinions?

@coretl
Copy link
Collaborator

coretl commented Sep 5, 2024

By the way, it would appear that h5py has selected some chunking for us in PandA:

$ h5ls -v panda2.h5
Opened "panda2.h5" with sec2 driver.
PCAP.GATE_DURATION.Value Dataset {467/Inf}
    Location:  1:195
    Links:     1
    Chunks:    {1024} 8192 bytes
    Storage:   3736 logical bytes, 8192 allocated bytes, 45.61% utilization
    Type:      native double
x-rbv                    Dataset {467/Inf}
    Location:  1:999
    Links:     1
    Chunks:    {1024} 8192 bytes
    Storage:   3736 logical bytes, 8192 allocated bytes, 45.61% utilization
    Type:      native double
x-rbv-max                Dataset {467/Inf}
    Location:  1:731
    Links:     1
    Chunks:    {1024} 8192 bytes
    Storage:   3736 logical bytes, 8192 allocated bytes, 45.61% utilization
    Type:      native double
x-rbv-min                Dataset {467/Inf}
    Location:  1:463
    Links:     1
    Chunks:    {1024} 8192 bytes
    Storage:   3736 logical bytes, 8192 allocated bytes, 45.61% utilization
    Type:      native double

So PandA shouldn't be 0, but 1024. Maybe we need a PANDA:DATA:NUM_PER_CHUNK PV that queries the chunking information from h5py and provides this?

@jwlodek
Copy link
Member Author

jwlodek commented Sep 5, 2024

For AD, the chunk size auto basically guarantees that the 2nd, 3rd, and for color images 4th chunking dimensions will just be the dimensions of the frame (you can set this with chunk size auto = No as well, you just need to explicitly define the chunking dimensions). Then Num frames chunks is basically number of frames in each chunk in the first dimension.

For PandA I admittedly have not looked at a dataset w/ a large amount of points - all the ones I have looked at had all data in one chunk, which for tiled purposes 0 or None would cover. If it is set to some preset size, I do think that exposing that via the IOC is the way to go, both to be able to tune the chunk size as desired, and to have accurate size info in the doc.

I would be OK with a frames_per_chunk name, I think datums_per_chunk doesn't work as well, since currently we can emit a single datum doc that lists all the indicies but still have many chunks. I.e. if I do a software flyscan w/ a StandardDetector and collect 200 points I get one datum doc that lists a start/stop of 0 and 199, but my chunk size is 1.

@jwlodek
Copy link
Member Author

jwlodek commented Sep 5, 2024

After looking at a PandA dataset with 1801 points, it does seem like the chunk size is being set to 1024 (presumably by h5py). I suggest we expose that as a PV in the IOC, and get the value from there - for now I will set it to 1024 for this PR with a TODO.

image

@coretl
Copy link
Collaborator

coretl commented Sep 5, 2024

Decided on chunk_size and it to be tuple[int, ...], with (<nchunks>, <det_h>, <det_w>) for the 2D areaDetector example

@jwlodek
Copy link
Member Author

jwlodek commented Sep 5, 2024

The test failures appear unrelated to the changes made here. They seem to be the same as the ones in the first commit in #559

src/ophyd_async/epics/adcore/_hdf_writer.py Outdated Show resolved Hide resolved
# Represents explicit chunk size written to disk.
# The first dimension represents the number of frames/points
# per equally-sized chunk in the first chunking dimension.
chunk_size: tuple | None = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is empty tuple instead of None ok @genematx ?

Suggested change
chunk_size: tuple | None = None
chunk_size: tuple[int, ...] = ()

src/ophyd_async/fastcs/panda/_writer.py Outdated Show resolved Hide resolved
tests/epics/adaravis/test_aravis.py Outdated Show resolved Hide resolved
@jwlodek
Copy link
Member Author

jwlodek commented Sep 9, 2024

The new changes make the tests use realistic values for shape/chunk size for standard detectors, and also makes sure that chunk size is always a tuple, potentially an empty one. I did find when testing with real devices that the tuples are currently not being handled correctly. I think a PR into the bluesky repo will need to accompany this one. I will work on that now.

Copy link
Collaborator

@coretl coretl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One comment on the tests, apart from that I think this looks good

Comment on lines 11 to 9
from ophyd_async.epics import adkinetix
from ophyd_async.epics.adkinetix import KinetixDetector
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually imported adkinetix rather than from adkinetix import KinetixDetector as that is the pattern we are about to recommend in all the tutorials, so we updated all the tests to follow that pattern

Copy link
Member

@mrakitin mrakitin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me and good to go after @coretl's comment about importing is addressed.

Comment on lines +105 to +106
# TODO: Update chunk size to read signal once available in IOC
# Currently PandA IOC sets chunk size to 1024 points per chunk
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make it an issue and link here.

async def generate_ad_standard_det(ad_standard_detector_class, number=1):
detector_name = ad_standard_detector_class.__name__
if detector_name.endswith("Detector"):
detector_name = detector_name[:-8]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it because of the len("Detector") == 8?

Copy link
Member Author

@jwlodek jwlodek Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, I can maybe change it to detector_name[:-(len("Detector"))] to make it more obvious what it is doing

Comment on lines 22 to 24
# Represents explicit chunk size written to disk.
# The first dimension represents the number of frames/points
# per equally-sized chunk in the first chunking dimension.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the second half of this is actually always true, we happen to make this so for areaDetector, but in the general case the chunk size is just that, the size of the HDF chunk. This may or may not correspond to a multiple of images, it might be half an image in one direction and 15 in the other, but its magnitude is in the base datatype of the dataset.

I wonder if chunk_shape might be a better name to make people think of shape in the output of describe()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually had the same thought yesterday about calling it chunk_shape. If it's not too late, I'd +1 to that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No preference from me about chunk_shape over chunk_size - perhaps lets discuss tomorrow at the sprint catchup, and we can settle on a naming scheme

@jwlodek
Copy link
Member Author

jwlodek commented Sep 17, 2024

I don't really understand the test failures here - I am seeing intermittent failures on sim detector, pilatus, and panda tests. Out of these, this PR only touches the PandA one, and even then it's a fairly minor change to just add chunk shape. Any ideas? @evalott100

Edit: The tests all pass if I change

await self.hdf.chunk_size_auto.set(True)

to

await asyncio.gather(self.hdf.chunk_size_auto.set(True))

I am not sure I understand why this behaves as it does, but will leave like this for now to facilitate merging.

@jwlodek jwlodek changed the title Include chunk size as a parameter in stream resource for HDF dataset Include chunk shape as a parameter in stream resource for HDF dataset Sep 17, 2024
@jwlodek jwlodek merged commit 35c07fe into bluesky:main Sep 19, 2024
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants