Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H5 context management #24

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

H5 context management #24

wants to merge 3 commits into from

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Apr 5, 2022

  • extends h5py.Dataset and h5py.Group to have __enter__ and __exit__ methods, where __exit__ closes the file.
  • use context-managed datasets and groups in tests
  • add a test for the h5 kwarg partitioning
  • (untested) access_h5 can optionally take an h5py.File instance as the first argument

Copy link
Contributor

@mkitti mkitti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the general direction.

track_order may need special handling. If it is set, perhaps just distribute it to the File, Group, and Dataset as needed.

Consider allowing path to be empty or None. In the case of empty, returning h5f may make sense. None is used to create anonymous datasets.

Anonymous datasets allow for temporary storage within the file, the create of spacers, and has been used by HDF5-EOS for storing some Zarr compat metadata.

"allow_unknown_filter",
)

H5_GROUP_KWDS = ("name", "track_order")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since track_order is a common keyword to Datasets, Group's, and Files perhaps this parameter should be passed to all objects when they are created.

for key in H5_DATASET_KWDS:
if key in file_kwargs:
for key in kwargs:
if key in H5_DATASET_KWDS:
dataset_kwargs[key] = file_kwargs.pop(key)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider keeping track_order as a File keyword. Also return Group keywords that may contain track_order.

dataset_kwargs[key] = file_kwargs.pop(key)

return file_kwargs, dataset_kwargs


def access_h5(
store: Pathlike, path: Pathlike, mode: str, **kwargs
store: Union[h5py.File, Pathlike], path: Pathlike, **kwargs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
store: Union[h5py.File, Pathlike], path: Pathlike, **kwargs
store: Union[h5py.File, Pathlike], path: Union[Pathlike, NoneType], **kwargs

None is a valid dataset name and is used to create anonymous datasets.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting, how is that different from naming a dataset with the empty string?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For one, providing an empty string results in a TypeError in h5py:

In [22]: with h5py.File("test.hdf5", "w") as h5f:
    ...:     h5f.create_dataset(None, data = np.zeros((5,5)))
    ...:     print(list(h5f.keys()))
    ...:
    ...:
[]

In [23]: with h5py.File("test.hdf5", "w") as h5f:
    ...:     h5f.create_dataset("", data = np.zeros((5,5)))
    ...:     print(list(h5f.keys()))
    ...:
    ...:
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-23-5230edf15b52> in <module>
      1 with h5py.File("test.hdf5", "w") as h5f:
----> 2     h5f.create_dataset("", data = np.zeros((5,5)))
      3     print(list(h5f.keys()))
      4
      5

~\.julia\conda\3\lib\site-packages\h5py\_hl\group.py in create_dataset(self, name, shape, dtype, data, **kwds)
    147                     group = self.require_group(parent_path)
    148
--> 149             dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
    150             dset = dataset.Dataset(dsid)
    151             return dset

~\.julia\conda\3\lib\site-packages\h5py\_hl\dataset.py in make_new_dset(parent, shape, dtype, data, name, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl, allow_unknown_filter)
    140
    141
--> 142     dset_id = h5d.create(parent.id, name, tid, sid, dcpl=dcpl)
    143
    144     if (data is not None) and (not isinstance(data, Empty)):

h5py\_objects.pyx in h5py._objects.with_phil.wrapper()

h5py\_objects.pyx in h5py._objects.with_phil.wrapper()

h5py\h5d.pyx in h5py.h5d.create()

TypeError: expected bytes, str found

In [24]: with h5py.File("test.hdf5", "w") as h5f:
    ...:     h5f.create_dataset(b"", data = np.zeros((5,5)))
    ...:     print(list(h5f.keys()))
    ...:
    ...:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-24-7f1869f34ebe> in <module>
      1 with h5py.File("test.hdf5", "w") as h5f:
----> 2     h5f.create_dataset(b"", data = np.zeros((5,5)))
      3     print(list(h5f.keys()))
      4
      5

~\.julia\conda\3\lib\site-packages\h5py\_hl\group.py in create_dataset(self, name, shape, dtype, data, **kwds)
    147                     group = self.require_group(parent_path)
    148
--> 149             dsid = dataset.make_new_dset(group, shape, dtype, data, name, **kwds)
    150             dset = dataset.Dataset(dsid)
    151             return dset

~\.julia\conda\3\lib\site-packages\h5py\_hl\dataset.py in make_new_dset(parent, shape, dtype, data, name, chunks, compression, shuffle, fletcher32, maxshape, compression_opts, fillvalue, scaleoffset, track_times, external, track_order, dcpl, allow_unknown_filter)
    140
    141
--> 142     dset_id = h5d.create(parent.id, name, tid, sid, dcpl=dcpl)
    143
    144     if (data is not None) and (not isinstance(data, Empty)):

h5py\_objects.pyx in h5py._objects.with_phil.wrapper()

h5py\_objects.pyx in h5py._objects.with_phil.wrapper()

h5py\h5d.pyx in h5py.h5d.create()

ValueError: Unable to create dataset (no name given)

if "name" in dataset_kwargs:
warnings.warn(
'"Name" was provided to this function as a keyword argument. This value will be replaced with the second argument to this function.'
)
dataset_kwargs["name"] = path
result = h5f.create_dataset(**dataset_kwargs)
else:
result = h5f.require_group(path)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allow track_order to be passed to the group as well.

else:
h5f = h5py.File(store, **file_kwargs)

if mode in ("r", "r+", "a"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps if path is empty, just return the File, h5f.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are in r+ or a mode, perhaps the path does not exist because we are trying to create it. In those cases, perhaps you should catch the error.

result.attrs.update(**attrs)

return result
if isinstance(result, h5py.Group):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A File is also a Group. In the case of a File, you do not need a new context manager.

**dataset_kwargs, **file_kwargs
)
assert file_kwargs == file_kwargs_out
assert dataset_kwargs == dataset_kwargs_out
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test for common keyword arguments like track_order.

@mkitti
Copy link
Contributor

mkitti commented Apr 5, 2022

Here is the reference for the use of anonymous datasets to store Zarr chunk information:
http://www.hdfeos.org/workshops/ws23/presentations/axj.pdf

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants