Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reading and writing from generic IO objects #552

Open
ExpandingMan opened this issue May 21, 2019 · 22 comments
Open

reading and writing from generic IO objects #552

ExpandingMan opened this issue May 21, 2019 · 22 comments

Comments

@ExpandingMan
Copy link

It is important to be able to read and write from general IO objects rather than just files. This is really important in case you need to stash files over a network, for example with AWS S3 rather than to the file system.

I don't know how cooperative the HDF5 library is going to be with this. Skimming through the code, it does not look like it will be easy.

@ggggggggg
Copy link
Contributor

Can you be more specific, eg do you want to do something like this?

s3_bucket_with_hdf5_file_in_it = S3.openbucket()
h5 = HDF5.h5open(s3_bucket_with_hdf5_file_in_it, "rw")
h5["a"]=4
x = h5["b"]

@ExpandingMan
Copy link
Author

In that case it would involve writing to a IOBuffer object, taking the Vector{UInt8} buffer and sending it to AWS via HTTP. Something like

io = IOBuffer()
get_data_from_s3!(s3, io)
h5 = h5open(io)

I suppose.

@baumgold
Copy link
Contributor

Hi @ExpandingMan. Did you ever manage to find a solution to being able to read HDF5 files from an S3 object store? The HDF5/S3 connector may be useful here, but I'm not sure if this has since been solved in a different way. Thanks!

@mkitti
Copy link
Member

mkitti commented Jul 27, 2022

The canonical way would be to use HDF5 virtual file drivers. https://docs.hdfgroup.org/hdf5/v1_12/_v_f_l.html

@baumgold
Copy link
Contributor

@mkitti - that’s what I suspected. Any idea if this is available/integrated with HDF5.jl? My understanding is virtual file drivers need to be selected at HDF5 compile-time. I presume we’ll need some changes from HDF5_jll to get this support?

@mkitti
Copy link
Member

mkitti commented Jul 27, 2022

We started adding support for drivers here:
https://github.com/JuliaIO/HDF5.jl/blob/master/src/drivers/drivers.jl

We may be able to use the Core driver to read an I/O stream completely into memory and use that.

@ExpandingMan
Copy link
Author

I haven't had many occasions to use HDF5, but when I did I was certainly resorting to temp files, which is certainly not ideal. In linux it's very easy to do this all in-memory (you can use /dev/shm or another in-memory directory) there is probably still a lot of overhead to that, so it's in no way an ideal solution.

@gszep
Copy link
Contributor

gszep commented Sep 11, 2022

at the very least this lib can support the ROS3 driver written by the HDFgroup? Perhaps following this python-equivalent PR: h5py/h5py#1755. I recommend aiming for the following solution:

h5open(s3path; driver=Drivers.ROS3()) do file
    file
end

@mkitti
Copy link
Member

mkitti commented Sep 11, 2022

The ROS3 driver seems quite distinct from the rest of the issue. Could you create a new issue, please?

@denglerchr
Copy link
Contributor

Hello, I am trying to send and receive some HDF5 files via network without writing to a file. I think the only way to do this would also be to read it from a generic IO or from a byte array. Is there any update on this, can this be done at this point?

@mkitti
Copy link
Member

mkitti commented Jun 1, 2023

I think we might be able to do this via H5FD_CORE via HDF5.Drivers.Core and HDF5.API.h5p_set_file_image

@mkitti
Copy link
Member

mkitti commented Jun 1, 2023

See also https://portal.hdfgroup.org/display/HDF5/HDF5+File+Image+Operations#HDF5FileImageOperations-1.IntroductiontoHDF5FileImageOperations

Basically, I think we have exposed the underlying low-level C API to do this in Julia, but have not created a high level API for this.

@denglerchr
Copy link
Contributor

Thanks, unfortunately I am not familiar with the low-level API at all, but Ill see if I can get this to work somehow. I found the h5py supports this already, maybe one day this could work in Julia as well with just passing an IO object instead of a filename?
From https://docs.h5py.org/en/stable/high/file.html?highlight=driver#h5py.File.driver

"""Create an HDF5 file in memory and retrieve the raw bytes

This could be used, for instance, in a server producing small HDF5
files on demand.
"""
import io
import h5py

bio = io.BytesIO()
with h5py.File(bio, 'w') as f:
    f['dataset'] = range(10)

data = bio.getvalue() # data is a regular Python bytes object.
print("Total size:", len(data))
print("First bytes:", data[:10])

@mkitti
Copy link
Member

mkitti commented Jun 2, 2023

I looked into how they implemented that. They implemented a virtual file driver:
https://github.com/h5py/h5py/blob/2e95e93b1331fd6b9c43dea38c863642624d319c/h5py/h5fd.pyx#L87-L101

In [77]: with h5py.File(bio, "w") as f:
    ...:     print(f._id.get_access_plist().get_driver())
    ...: 
576460752303423496

In [78]: h5py.h5fd.fileobj_driver
Out[78]: 576460752303423496

This is a bit overkill if all you need to do is read it into memory though.

@denglerchr
Copy link
Contributor

Would implementing something like in h5py make it to the roadmap of this package for the near future? We have a project that would require such data over network in the future and HDF5 was used previously. An alternative might be to send the data flattened as vectors in the Arrow format though.

@mkitti
Copy link
Member

mkitti commented Jun 2, 2023

We have a project that would require such data over network in the future and HDF5 was used previously

Have you considered the ROS3 (read only S3) driver? Do you need write capability over the network as well?

Another approach is detailed here:
https://medium.com/pangeo/cloud-performant-netcdf4-hdf5-with-zarr-fsspec-and-intake-3d3a3e7cb935

It might be good to fully understand what you mean by network here and what your requirements are for access. Are you using chunked datasets with compression? Is read-only ok, do you do read-write? To what degree does this have to scale.

Would implementing something like in h5py make it to the roadmap of this package for the near future?

The custom file driver approach does not seem very hard to do. It's actually significantly easier to do from Julia, so it's mainly about time and priorities.

We're gearing up for a 0.17 breaking release, so that's where my focus is at the moment.

@simonbyrne
Copy link
Collaborator

A simple alternative would be to write to a RAM disk, then copy it over.

@denglerchr
Copy link
Contributor

Our application is an R&D project that involves a line scanner (basically laser+high def camera) and reading from network would be enough in first step. The data is collected using a C++ program and then distributed to consumers in a batch approx every second over MQTT (local network only). Some analysis of the collected batch is then to be done in julia and the result forwarded using Mqtt again. The data would be 4 matrices of around 300x4000 Float32 values every second, we wanted to use HDF5 files with blosc-lz4 compression.

@denglerchr
Copy link
Contributor

C++ part is still to be adapted anyway, currently working on specification where I describe this data exchange. I am leaning towards Arrow tbh, but I will try a bit more with HDF5 as this was used in a similar project. Maybe PyCall and h5py would also be a solution.

@mkitti
Copy link
Member

mkitti commented Jun 3, 2023

OK, you nerd sniped me. Here is a demonstration of the Core driver

julia> using HDF5, H5Zblosc, CRC32c

julia> checksum(dataset) = crc32c(copy(reinterpret(UInt8, dataset[:])))
checksum (generic function with 1 method)

julia> function create_file_inmemory(dataset = rand(1:10, 256, 256))
           @info "Dataset Checksum" checksum(dataset)
           
           # Create File Access Property List
           fapl = HDF5.FileAccessProperties()
           fapl.driver = HDF5.Drivers.Core(; backing_store=false)
         
           # Create file in memory
           name = "inmemtest"
           fid = HDF5.API.h5f_create(name, HDF5.API.H5F_ACC_EXCL, HDF5.API.H5P_DEFAULT, fapl)
           h5f = HDF5.File(fid, name)
           write_dataset(h5f, "lz4_comp_dataset", dataset, chunk=(16,16), filters=BloscFilter())
           HDF5.API.h5f_flush(h5f, 1)
         
           # Get file image
           buf_len = HDF5.API.h5f_get_file_image(h5f, C_NULL, 0)
           inmemfile = Vector{UInt8}(undef, buf_len)
           HDF5.API.h5f_get_file_image(h5f, inmemfile, length(inmemfile))
           
           # Finish
           close(h5f)
           return inmemfile
       end
create_file_inmemory (generic function with 2 methods)

julia> function read_file_inmemory(inmemfile::Vector{UInt8})
           # Create File Access Property List
           fapl = HDF5.FileAccessProperties()
           fapl.driver = HDF5.Drivers.Core(; backing_store=false)
           HDF5.API.h5p_set_file_image(fapl, inmemfile, length(inmemfile))
        
           # Open the file in memory
           name = "inmemtest"
           fid = HDF5.API.h5f_open(name, HDF5.API.H5F_ACC_RDONLY, fapl)
           h5f = HDF5.File(fid, name)
           display(h5f)
           dataset = h5f["lz4_comp_dataset"][]
           
           # Finish
           close(h5f)
           @info "Dataset Checksum" checksum(dataset)
           return dataset
       end
read_file_inmemory (generic function with 1 method)

julia> inmemfile = create_file_inmemory();
┌ Info: Dataset Checksum
└   checksum(dataset) = 0x3ecfcea2

julia> read_file_inmemory(inmemfile);
🗂️ HDF5.File: (read-only) inmemtest
└─ 🔢 lz4_comp_dataset
┌ Info: Dataset Checksum
└   checksum(dataset) = 0x3ecfcea2

julia> write("ondisk.h5", inmemfile)
117208

julia> run(`h5ls -v ondisk.h5`)
Opened "ondisk.h5" with sec2 driver.
lz4_comp_dataset         Dataset {256/256, 256/256}
    Location:  1:800
    Links:     1
    Chunks:    {16, 16} 2048 bytes
    Storage:   524288 logical bytes, 100352 allocated bytes, 522.45% utilization
    Filter-0:  blosc-32001 OPT {2, 2, 8, 2048, 5, 1, 0}
    Type:      native long
Process(`h5ls -v ondisk.h5`, ProcessExited(0))

@mkitti
Copy link
Member

mkitti commented Jun 3, 2023

#1077 should make reading and writing files from memory easier.

@denglerchr
Copy link
Contributor

Wow, thanks so much, this would have taken me quite long to figure out, if at all! You are the best @mkitti !
It is exactly what we need for and I think this should also be what @ExpandingMan was looking for?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants