-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
netcdf4-python support for s3 #95
Comments
@bnlawrence @davidhassell @markgoddard looks like we are very, very close to an S3-abled netCDF4-python Unidata/netcdf4-python#1264 |
that is now possible via this biutiful thing done by my bud @zklaus conda-forge/libnetcdf-feedstock#178 - we'll have to wait for netCDF4-python folk to build new wheels and conda binaries, but that's now just a matter of time, and not a matter of "if" 😁 |
Cheers 😄. Note that conda-forge/libnetcdf-feedstock#178 is not quite enough; we also need conda-forge/libnetcdf-feedstock#180. |
Right on, cheers, Klaus! But the bigh heavy thing was that that you've done in 178, am assuming now it's just a matter of time, not of will and blah anymore 😁 I owes you a 🍺 bud |
also pay attention to what Klaus mentions here conda-forge/libnetcdf-feedstock#178 (comment) |
gentlemen @bnlawrence @davidhassell @markgoddard conda-forge/libnetcdf-feedstock#180 has now been merged, so in theory, we should be good to go to implement our use of netCDF4-python with S3 - also in light of #95 (comment) it looks like some mucking about will still be needed 😁 Of course, one needs to wait for wheels now ☸️ |
Something I noticed while working on byte order and missing data support - Kerchunk's SingleHdf5ToZarr uses h5py to read the netcdf metadata when creating a Zarr array. There definitely seems to be some oddness there in what we get back. In particular, when a byte order/endianness is specified on a variable, it always seems to show as big endian in the Zarr array. I have a workaround for this in #132, but I expect there are other issues, in particular around compressed data. |
in S3-netCDF4 news there are bad news - S3 support had to be reverted from libnetcdf since it was causing flakiness apparently conda-forge/libnetcdf-feedstock#188 - I created a test branch to see when/how that's ready to work https://github.com/valeriupredoi/PyActiveStorage/tree/try_netcdf4 |
OK in light of the bad news listed above, I got some fairly solid progress done on this: it proves out we don't actually need S3-support in netCDF4 per se, all we have to do is be clever about how we read the file as a remote file: with fs.open(s3_testfile_uri, 'rb') as s3f:
nc_bytes = s3f.read()
ds = netCDF4.Dataset(f'inmemory.nc', memory=nc_bytes) will read the S3 file into memory and dump it a very nice netCDF4 Dataset at the client: you can see it in action in this GA https://github.com/valeriupredoi/PyActiveStorage/actions/runs/6225016972/job/16894516376 - the problem is that calling |
I think the problem with that is that you've read the entire file across the network. Ideally we'd be using range-get and only getting metadata in the first instance, so it would remain lazy for actual data. |
You should be able to construct a simple https URL using bucket name and object name as The caveats are:
But for this testing, that is appropriate. Hopefully, we'll resolve the S3 support before things move into production, pending the resolution of Unidata/netcdf-c#2739. |
@bnlawrence exactly - I think there is a way to MacGyver that 😁 @zklaus cheers, but the problem is not solved by creating a public URL - we need to have support for |
Just bumping this ^ While it may be an easy enough switch to netcdf4-python in PyActiveStorage when reading the metadata, swapping out h5py for netcdf4-python in kerchunk would be more difficult, and potentially more contentious. |
Good point wrt kerchunk h5py dependency ... need to think about that. Without doing proper diligence, I suspect that the kerchunk to Zarr transition uses the endian specification to ensure that Zarr sees the data always with the same endianness, regardless of what is in the netcdf file ... IFIRC they do that in the chunk load step. |
good move to bump it! These things happen because we cobbled together this package using a bunch of other tools that have their own peculiarities for which others have workarounds for. Specifically about endianness no clue, but about the other thing @markgoddard has raised (chunk gets downloaded by SingleHDF5toZarr) I think we have to be very careful with how we build the fs that kerchunk uses |
In fact, I'd make this a separate issue 👍 |
+1 for a separate issue. |
just keeping this alive via a wee summary: here are the netcdf-C bother points that would need be fixed Unidata/netcdf-c#2739 and Unidata/netcdf-c#2777 and my comment #95 (comment) pointing to the reversal of S3 capability in libnetcdf due to flakiness As you can see from the second issue listed there, there are already strides to get Zarry support via reference FS and chunks |
We believe that netcdf-c should support natural operations on a file on an S3 storage system. We need to test and demonstrate that we can do this with netcdf4-python (perhaps via a version with a specially compiled c library).
(We need this to avoid having to rebuild all of cf-python with h5netcdf with unknown implications for functionality and performance).
The text was updated successfully, but these errors were encountered: