Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting a nasty crash with some data which seems to involve dimensions #3

Closed
bnlawrence opened this issue Jul 19, 2024 · 11 comments
Closed

Comments

@bnlawrence
Copy link
Collaborator

What happened:

import h5netcdf
from s3fs import S3FileSystem

blocks_MB = 1
storage_options = {
    'key': "f2d55c6dcfc7618b2c34e00b58df3cef",
    'secret': "$/'#M{0{/4rVhp%n^(XeX$q@y#&(NM3W1->~N.Q6VP.5[@bLpi='nt]AfH)>78pT",
    'client_kwargs': {'endpoint_url': "https://uor-aces-o.ext.proxy.jc.rl.ac.uk/"},
    'default_fill_cache':False,
    'default_cache_type':"readahead",
    'default_block_size': blocks_MB * (2**20)
}

file_system = S3FileSystem(**storage_options)
f = file_system.open("bnl/ch330a.pc19790301-def.nc", "rb")

ds = h5netcdf.File(f,         
            "r",
            backend="pyfive",
            decode_vlen_strings=True)
var = ds['time']
print (var.dimensions)

raises the following errror

# ValueError: variable '/time' has no dimension scale associated with axis 0. 
# Use phony_dims='sort' for sorted naming or phony_dims='access' for per access naming.

instead of printing the dimensions.

@bnlawrence
Copy link
Collaborator Author

bnlawrence commented Jul 19, 2024

Diving a little deeper, I think this may be due to a failure in how the dimensions are handled, before there is a problem with the dimension scale. With a simple file
(ncdump,
h5dump)
we don't get that error but we get lots of interesting warnings from pyfive via h5netcdf which we can investigate using the following code:

import h5netcdf
import pyfive
import h5py



p5 = False. # use pyfive to look at the file
h5 = True   # use h5py to look at the file
# (with both false, h5netcdf is used to look at the file
doit = True # look at the dimensions

if p5:
    ds = pyfive.File('delme.nc','r')
elif h5:
     ds = h5py.File('delme.nc','r')
else:
    decode_vlen_strings = {'decode_vlen_strings':True}
    print(h5netcdf.__file__)
    ds = h5netcdf.File('delme.nc', "r", backend='pyfive', **decode_vlen_strings)

var = ds['time']
var = ds['lat']

if doit: 
    print('Now do dimensions')
    if p5 or h5:
        print(var.dims)
    else:
        print (var.dimensions)

We then see that

  • h5py just works, and reports var.dims as <Dimensions of HDF5 object at 4461599520>
  • pyfive works fine, unless we choose to look at the dimensions, in which case we get a warning: UserWarning: Attribute REFERENCE_LIST type not implemented, set to None. and then we see var.dims as <pyfive.high_level.DimensionManager object at 0x10333abd0> (I.e. similiar to the h5py call as you'd expect)
  • h5netcdf issues the same warning three times, then again when doing the dimension call, which suggests it's doing that dimension call somehow several times already.

My working assumption is that because of this, the dimension scale is not properly handled from an h5netcdf point of view, and we might need to implement the REFERENCE_LIST ...

@davidhassell
Copy link
Collaborator

In a version of cfdm with print statements, here are the actual netCDF4-python calls that created the dimensions - pretty bog standard:

>>> import cfdm
>>> cfdm.write(cfdm.example_field(0), 'delme.nc')
CALL: parent_group.createDimension(lat, 5)
 str(parent_group) after call: 
 <class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
    Conventions: CF-1.11
    dimensions(sizes): lat(5)
    variables(dimensions): 
    groups: 

CALL: parent_group.createDimension(bounds2, 2)
 str(parent_group) after call:: 
 <class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
    Conventions: CF-1.11
    dimensions(sizes): lat(5), bounds2(2)
    variables(dimensions): 
    groups: 

CALL: parent_group.createDimension(lon, 8)
 str(parent_group) after call:: 
 <class 'netCDF4._netCDF4.Dataset'>
root group (NETCDF4 data model, file format HDF5):
    Conventions: CF-1.11
    dimensions(sizes): lat(5), bounds2(2), lon(8)
    variables(dimensions): float64 lat_bnds(lat, bounds2), float64 lat(lat)
    groups: 

@davidhassell
Copy link
Collaborator

From https://docs.unidata.ucar.edu/netcdf-c/current/file_format_specifications.html (don't know if it's useful)

Attributes

Attributes in HDF5 and netCDF-4 correspond very closely. Each attribute in an HDF5 file is represented as an attribute in the netCDF-4 file, with the exception of the attributes below, which are hidden by the netCDF-4 API.

  • _Netcdf4Coordinates An integer array containing the dimension IDs of a variable which is a multi-dimensional coordinate variable.
  • _nc3_strict When this (scalar, H5T_NATIVE_INT) attribute exists in the root group of the HDF5 file, the netCDF API will enforce the netCDF classic model on the data file.
  • REFERENCE_LIST This attribute is created and maintained by the HDF5 dimension scale API.
  • CLASS This attribute is created and maintained by the HDF5 dimension scale API.
  • DIMENSION_LIST This attribute is created and maintained by the HDF5 dimension scale API.
  • NAME This attribute is created and maintained by the HDF5 dimension scale API.
  • _Netcdf4Dimid Holds a scalar H5T_NATIVE_INT that is the (zero-based) dimension ID for this dimension, needed when dimensions and coordinate variables are defined in different orders.
  • _NCProperties Holds provenance information about a file at the time it was created. It specifies the versions of the netCDF and HDF5 libraries used to create the file.

@bnlawrence
Copy link
Collaborator Author

bnlawrence commented Jul 19, 2024

I think at this point we need a tutorial on dimension scales, the nearest equivalent to being the series of blog articles by John Caron which start here. I'm gonna excerpt some key bits, but the real oil is there.

When we created the netCDF-4 file format on top of HDF5, we asked the HDF group to add shared dimensions. They said no, and instead added dimension scales, which at that point were in the HDF4 data model, but not in HDF5. In retrospect, I think we should have worked harder to come to a mutual agreement. The lack of shared dimensions in HDF5 makes HDF5 not a strict superset of netCDF-4.

the essence of shared dimensions is that they indicate that two variables have the same domain, and this is needed to assign coordinates for sampled functions ... HDF5 variables (aka datasets) don't use shared dimensions, but define their shape with a dataspace object, which is defined separately for each variable. So there is no formal way in the HDF5 data model to indicate that two variables share the same domain. As we'll see, dimension scales help some, but not enough.

Each variable in HDF5 defines its shape with a dataspace, which is essentially a list of private dimensions for the variable. A Dimension Scale is a special variable containing a set of references to dimensions in variables. Each referenced variable has a DIMENSION_LIST attribute that contains, for each dimension, a list of references to Dimension Scales. So we have a two-way, many-to-many linking between Dimension Scales and Dimensions.

The HDF5 Dimension Scale API mostly just maintains this two way linking, plus allows the links to be named.

So it appears that by using Dimension Scales, we now have shared dimensions in HDF5: namely, all the dimensions that share the same Dimension Scale are ... the same! Unfortunately nothing requires the "shared" dimensions to have the same length as the dimension scale, or have the same length as any of the other dimensions that are associated with the dimension scale, or that the dimension scale even has the same rank as an associated dimension. The HDF5 dimensions scale design doc is quite explicit that any other semantics are not part of the HDF5 data model, and must be added by other layer.

Obviously, other application layers like netCDF-4 can layer shared dimensions on top of HDF5 Dimension Scales. The minimum requirements for shared dimensions are that:

  1. Dimensions are associated with only one Dimension Scale.
  2. A Dimension Scale is one dimensional.
  3. All dimensions have the same length as the shared Dimension Scale.

Those are the things that a program can check for. But the intention of the data writer is crucial, because the real requirement for shared dimensions is that the dimensions represent the domain of the function, and the dimension scale values represent the coordinates for that dimension.

(The links in the original were broken, I have added some that point to either what i hope is the original or something similar.)

@bnlawrence
Copy link
Collaborator Author

John goes on to detail exactly how HDF5 dimension scales work, and then how NetCDF4 uses them. A key part of the latter document seems to make it clear that NetCDF4 does NOT need the values of the REFERENCE_LIST, so implementing them should not be necessary. So the issue at hand would appear NOT to be related to our failure to implement REFERENCE_LISTS in pyfive (or at least should not, so now we have to work out exactly what H5Py is up to).

@bnlawrence
Copy link
Collaborator Author

bnlawrence commented Jul 19, 2024

Ok, so first, let's address these annoying warnings while we wait for a better test for the actual crash.
The warnings are coming from within the h5netcdf Group instantiation (core.py 733):

     if v.attrs.get("CLASS") == b"DIMENSION_SCALE":

but only when it evaluates as true (because that's the only time there is a REFERENCE_LIST datatype message in the dataset messages). Diving further into the code we see that this .attrs is accessed via a property on a BaseVariable found in core.py 449:

@property
def attrs(self):
    """Return variable attributes."""
    return Attributes(
        self._h5ds.attrs, self._root._check_valid_netcdf_dtype, self._root._h5py
        )

and the class atributes can be found in (attrs.py 20), and is initialised with h5attrs, check_dtype, h5py_pckg. It is quite clear that you cannot get a reference_list item via this class and it's methods (any attempt to do so is met with a KeyError), so we really do not need to implement this in pyfive.

So we just want to suppress this warning when accessed as part of this pathway.

@bnlawrence
Copy link
Collaborator Author

Ok, and as to the actual error, should have done some RTFM, where we find:

Datasets with missing dimension scales

By default (see below) h5netcdf raises a ValueError if variables with no dimension
scale associated with one of their axes are accessed.
You can set phony_dims='sort' when opening a file to let h5netcdf invent
phony dimensions according to netCDF_ behaviour.

  # mimic netCDF-behaviour for non-netcdf files
  f = h5netcdf.File('mydata.h5', mode='r', phony_dims='sort')

Note, that this iterates once over the whole group-hierarchy. This has affects
on performance in case you rely on laziness of group access.
You can set phony_dims='access' instead to defer phony dimension creation
to group access time. The created phony dimension naming will differ from
netCDF_ behaviour.

  f = h5netcdf.File('mydata.h5', mode='r', phony_dims='access')

(The keyword default setting is phony_dims=None for backwards compatibility.)

@bnlawrence
Copy link
Collaborator Author

And, lo and behold, if we set phony_dims, indeed it runs fine. So now the question is why did this file not have compliant dimension scales given it was written by netcdf4_python?

@bnlawrence
Copy link
Collaborator Author

this seems relevant, from the docs:

NetCDF-4 allows some interoperability with HDF5.

Reading and Editing NetCDF-4 Files with HDF5

The HDF5 Files produced by netCDF-4 are perfectly respectable HDF5 files, and can be read by any HDF5 application.

NetCDF-4 relies on several new features of HDF5, including dimension scales. The HDF5 dimension scales feature adds a bunch of attributes to the HDF5 file to keep track of the dimension information.

It is not just wrong, but wrong-headed, to modify these attributes except with the HDF5 dimension scale API. If you do so, then you will deserve what you get, which will be a mess.

Additionally, netCDF stores some extra information for dimensions without dimension scale information. (That is, a dimension without an associated coordinate variable). So HDF5 users should not write data to a netCDF-4 file which extends any unlimited dimension, or change any of the extra attributes used by netCDF to track dimension information.

Also there are some types allowed in HDF5, but not allowed in netCDF-4 (for example the time type). Using any such type in a netCDF-4 file will cause the file to become unreadable to netCDF-4. So don't do it.

NetCDF-4 ignores all HDF5 references. Can't make head nor tail of them. Also netCDF-4 assumes a strictly hierarchical group structure. No looping, you weirdo!

Attributes can be added (they must be one of the netCDF-4 types), modified, or even deleted, in HDF5.

Reading and Editing HDF5 Files with NetCDF-4

Assuming a HDF5 file is written in accordance with the netCDF-4 rules (i.e. no strange types, no looping groups), and assuming that every dataset has a dimension scale attached to each dimension, the netCDF-4 API can be used to read and edit the file, quite easily.

In HDF5 (version 1.8.0 and later), dimension scales are (generally) 1D datasets, that hold dimension data. A multi-dimensional dataset can then attach a dimension scale to any or all of its dimensions. For example, a user might have 1D dimension scales for lat and lon, and a 2D dataset which has lat attached to the first dimension, and lon to the second.

If dimension scales are not used, then netCDF-4 can still edit the file, and will invent anonymous dimensions for each variable shape. This is done by iterating through the space of each dataset. As each space size is encountered, a phony dimension of that size is checked for. It it does not exist, a new phony dimension is created for that size. In this way, a HDF5 file with datasets that are using shared dimensions can be seen properly in netCDF-4. (There is no shared dimension in HDF5, but data users will freqently write many datasets with the same shape, and intend these to be shared dimensions.)

Starting with version 4.7.3, if a dataset is encountered with uses the same size for two or more of its dataspace lengths, then a new phony dimension will be created for each. That is, a dataset with size [100][100] will result in two phony dimensions, each of size 100.

@bnlawrence
Copy link
Collaborator Author

There is additional information on dimension scales here

@bnlawrence
Copy link
Collaborator Author

Given this is all solved with a keyword, it seems that it's not really an issue ... for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants