reading and writing from generic `IO` objects #552

ExpandingMan · 2019-05-21T17:14:46Z

It is important to be able to read and write from general IO objects rather than just files. This is really important in case you need to stash files over a network, for example with AWS S3 rather than to the file system.

I don't know how cooperative the HDF5 library is going to be with this. Skimming through the code, it does not look like it will be easy.

The text was updated successfully, but these errors were encountered:

ggggggggg · 2019-05-21T17:22:59Z

Can you be more specific, eg do you want to do something like this?

s3_bucket_with_hdf5_file_in_it = S3.openbucket()
h5 = HDF5.h5open(s3_bucket_with_hdf5_file_in_it, "rw")
h5["a"]=4
x = h5["b"]

ExpandingMan · 2019-05-22T00:02:35Z

In that case it would involve writing to a IOBuffer object, taking the Vector{UInt8} buffer and sending it to AWS via HTTP. Something like

io = IOBuffer()
get_data_from_s3!(s3, io)
h5 = h5open(io)

I suppose.

baumgold · 2022-07-27T19:58:45Z

Hi @ExpandingMan. Did you ever manage to find a solution to being able to read HDF5 files from an S3 object store? The HDF5/S3 connector may be useful here, but I'm not sure if this has since been solved in a different way. Thanks!

mkitti · 2022-07-27T21:33:19Z

The canonical way would be to use HDF5 virtual file drivers. https://docs.hdfgroup.org/hdf5/v1_12/_v_f_l.html

baumgold · 2022-07-27T21:42:39Z

@mkitti - that’s what I suspected. Any idea if this is available/integrated with HDF5.jl? My understanding is virtual file drivers need to be selected at HDF5 compile-time. I presume we’ll need some changes from HDF5_jll to get this support?

mkitti · 2022-07-27T21:54:30Z

We started adding support for drivers here:
https://github.com/JuliaIO/HDF5.jl/blob/master/src/drivers/drivers.jl

We may be able to use the Core driver to read an I/O stream completely into memory and use that.

ExpandingMan · 2022-07-28T15:44:21Z

I haven't had many occasions to use HDF5, but when I did I was certainly resorting to temp files, which is certainly not ideal. In linux it's very easy to do this all in-memory (you can use /dev/shm or another in-memory directory) there is probably still a lot of overhead to that, so it's in no way an ideal solution.

gszep · 2022-09-11T18:10:11Z

at the very least this lib can support the ROS3 driver written by the HDFgroup? Perhaps following this python-equivalent PR: h5py/h5py#1755. I recommend aiming for the following solution:

h5open(s3path; driver=Drivers.ROS3()) do file
    file
end

mkitti · 2022-09-11T18:53:44Z

The ROS3 driver seems quite distinct from the rest of the issue. Could you create a new issue, please?

denglerchr · 2023-06-01T15:13:30Z

Hello, I am trying to send and receive some HDF5 files via network without writing to a file. I think the only way to do this would also be to read it from a generic IO or from a byte array. Is there any update on this, can this be done at this point?

mkitti · 2023-06-01T17:06:24Z

I think we might be able to do this via H5FD_CORE via HDF5.Drivers.Core and HDF5.API.h5p_set_file_image

mkitti · 2023-06-01T17:15:29Z

See also https://portal.hdfgroup.org/display/HDF5/HDF5+File+Image+Operations#HDF5FileImageOperations-1.IntroductiontoHDF5FileImageOperations

Basically, I think we have exposed the underlying low-level C API to do this in Julia, but have not created a high level API for this.

denglerchr · 2023-06-02T06:49:47Z

Thanks, unfortunately I am not familiar with the low-level API at all, but Ill see if I can get this to work somehow. I found the h5py supports this already, maybe one day this could work in Julia as well with just passing an IO object instead of a filename?
From https://docs.h5py.org/en/stable/high/file.html?highlight=driver#h5py.File.driver

"""Create an HDF5 file in memory and retrieve the raw bytes

This could be used, for instance, in a server producing small HDF5
files on demand.
"""
import io
import h5py

bio = io.BytesIO()
with h5py.File(bio, 'w') as f:
    f['dataset'] = range(10)

data = bio.getvalue() # data is a regular Python bytes object.
print("Total size:", len(data))
print("First bytes:", data[:10])

mkitti · 2023-06-02T11:43:18Z

I looked into how they implemented that. They implemented a virtual file driver:
https://github.com/h5py/h5py/blob/2e95e93b1331fd6b9c43dea38c863642624d319c/h5py/h5fd.pyx#L87-L101

In [77]: with h5py.File(bio, "w") as f:
    ...:     print(f._id.get_access_plist().get_driver())
    ...: 
576460752303423496

In [78]: h5py.h5fd.fileobj_driver
Out[78]: 576460752303423496

This is a bit overkill if all you need to do is read it into memory though.

denglerchr · 2023-06-02T13:01:43Z

Would implementing something like in h5py make it to the roadmap of this package for the near future? We have a project that would require such data over network in the future and HDF5 was used previously. An alternative might be to send the data flattened as vectors in the Arrow format though.

mkitti · 2023-06-02T17:43:51Z

We have a project that would require such data over network in the future and HDF5 was used previously

Have you considered the ROS3 (read only S3) driver? Do you need write capability over the network as well?

Another approach is detailed here:
https://medium.com/pangeo/cloud-performant-netcdf4-hdf5-with-zarr-fsspec-and-intake-3d3a3e7cb935

It might be good to fully understand what you mean by network here and what your requirements are for access. Are you using chunked datasets with compression? Is read-only ok, do you do read-write? To what degree does this have to scale.

Would implementing something like in h5py make it to the roadmap of this package for the near future?

The custom file driver approach does not seem very hard to do. It's actually significantly easier to do from Julia, so it's mainly about time and priorities.

We're gearing up for a 0.17 breaking release, so that's where my focus is at the moment.

simonbyrne · 2023-06-02T17:58:27Z

A simple alternative would be to write to a RAM disk, then copy it over.

denglerchr · 2023-06-02T18:02:10Z

Our application is an R&D project that involves a line scanner (basically laser+high def camera) and reading from network would be enough in first step. The data is collected using a C++ program and then distributed to consumers in a batch approx every second over MQTT (local network only). Some analysis of the collected batch is then to be done in julia and the result forwarded using Mqtt again. The data would be 4 matrices of around 300x4000 Float32 values every second, we wanted to use HDF5 files with blosc-lz4 compression.

denglerchr · 2023-06-02T18:11:36Z

C++ part is still to be adapted anyway, currently working on specification where I describe this data exchange. I am leaning towards Arrow tbh, but I will try a bit more with HDF5 as this was used in a similar project. Maybe PyCall and h5py would also be a solution.

mkitti · 2023-06-03T02:25:42Z

OK, you nerd sniped me. Here is a demonstration of the Core driver

julia> using HDF5, H5Zblosc, CRC32c

julia> checksum(dataset) = crc32c(copy(reinterpret(UInt8, dataset[:])))
checksum (generic function with 1 method)

julia> function create_file_inmemory(dataset = rand(1:10, 256, 256))
           @info "Dataset Checksum" checksum(dataset)
           
           # Create File Access Property List
           fapl = HDF5.FileAccessProperties()
           fapl.driver = HDF5.Drivers.Core(; backing_store=false)
         
           # Create file in memory
           name = "inmemtest"
           fid = HDF5.API.h5f_create(name, HDF5.API.H5F_ACC_EXCL, HDF5.API.H5P_DEFAULT, fapl)
           h5f = HDF5.File(fid, name)
           write_dataset(h5f, "lz4_comp_dataset", dataset, chunk=(16,16), filters=BloscFilter())
           HDF5.API.h5f_flush(h5f, 1)
         
           # Get file image
           buf_len = HDF5.API.h5f_get_file_image(h5f, C_NULL, 0)
           inmemfile = Vector{UInt8}(undef, buf_len)
           HDF5.API.h5f_get_file_image(h5f, inmemfile, length(inmemfile))
           
           # Finish
           close(h5f)
           return inmemfile
       end
create_file_inmemory (generic function with 2 methods)

julia> function read_file_inmemory(inmemfile::Vector{UInt8})
           # Create File Access Property List
           fapl = HDF5.FileAccessProperties()
           fapl.driver = HDF5.Drivers.Core(; backing_store=false)
           HDF5.API.h5p_set_file_image(fapl, inmemfile, length(inmemfile))
        
           # Open the file in memory
           name = "inmemtest"
           fid = HDF5.API.h5f_open(name, HDF5.API.H5F_ACC_RDONLY, fapl)
           h5f = HDF5.File(fid, name)
           display(h5f)
           dataset = h5f["lz4_comp_dataset"][]
           
           # Finish
           close(h5f)
           @info "Dataset Checksum" checksum(dataset)
           return dataset
       end
read_file_inmemory (generic function with 1 method)

julia> inmemfile = create_file_inmemory();
┌ Info: Dataset Checksum
└   checksum(dataset) = 0x3ecfcea2

julia> read_file_inmemory(inmemfile);
🗂️ HDF5.File: (read-only) inmemtest
└─ 🔢 lz4_comp_dataset
┌ Info: Dataset Checksum
└   checksum(dataset) = 0x3ecfcea2

julia> write("ondisk.h5", inmemfile)
117208

julia> run(`h5ls -v ondisk.h5`)
Opened "ondisk.h5" with sec2 driver.
lz4_comp_dataset         Dataset {256/256, 256/256}
    Location:  1:800
    Links:     1
    Chunks:    {16, 16} 2048 bytes
    Storage:   524288 logical bytes, 100352 allocated bytes, 522.45% utilization
    Filter-0:  blosc-32001 OPT {2, 2, 8, 2048, 5, 1, 0}
    Type:      native long
Process(`h5ls -v ondisk.h5`, ProcessExited(0))

mkitti · 2023-06-03T05:55:09Z

#1077 should make reading and writing files from memory easier.

denglerchr · 2023-06-03T06:37:01Z

Wow, thanks so much, this would have taken me quite long to figure out, if at all! You are the best @mkitti !
It is exactly what we need for and I think this should also be what @ExpandingMan was looking for?

gszep mentioned this issue Sep 11, 2022

add support for read-only s3 virtual driver #1010

Merged

denglerchr mentioned this issue Sep 15, 2023

Update documentation about using files in memory #1118

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

reading and writing from generic `IO` objects #552

reading and writing from generic `IO` objects #552

ExpandingMan commented May 21, 2019

ggggggggg commented May 21, 2019

ExpandingMan commented May 22, 2019

baumgold commented Jul 27, 2022

mkitti commented Jul 27, 2022

baumgold commented Jul 27, 2022

mkitti commented Jul 27, 2022

ExpandingMan commented Jul 28, 2022

gszep commented Sep 11, 2022 •

edited

Loading

mkitti commented Sep 11, 2022

denglerchr commented Jun 1, 2023

mkitti commented Jun 1, 2023

mkitti commented Jun 1, 2023

denglerchr commented Jun 2, 2023

mkitti commented Jun 2, 2023 •

edited

Loading

denglerchr commented Jun 2, 2023

mkitti commented Jun 2, 2023

simonbyrne commented Jun 2, 2023

denglerchr commented Jun 2, 2023

denglerchr commented Jun 2, 2023

mkitti commented Jun 3, 2023 •

edited

Loading

mkitti commented Jun 3, 2023

denglerchr commented Jun 3, 2023

reading and writing from generic IO objects #552

reading and writing from generic IO objects #552

Comments

ExpandingMan commented May 21, 2019

ggggggggg commented May 21, 2019

ExpandingMan commented May 22, 2019

baumgold commented Jul 27, 2022

mkitti commented Jul 27, 2022

baumgold commented Jul 27, 2022

mkitti commented Jul 27, 2022

ExpandingMan commented Jul 28, 2022

gszep commented Sep 11, 2022 • edited Loading

mkitti commented Sep 11, 2022

denglerchr commented Jun 1, 2023

mkitti commented Jun 1, 2023

mkitti commented Jun 1, 2023

denglerchr commented Jun 2, 2023

mkitti commented Jun 2, 2023 • edited Loading

denglerchr commented Jun 2, 2023

mkitti commented Jun 2, 2023

simonbyrne commented Jun 2, 2023

denglerchr commented Jun 2, 2023

denglerchr commented Jun 2, 2023

mkitti commented Jun 3, 2023 • edited Loading

mkitti commented Jun 3, 2023

denglerchr commented Jun 3, 2023

reading and writing from generic `IO` objects #552

reading and writing from generic `IO` objects #552

gszep commented Sep 11, 2022 •

edited

Loading

mkitti commented Jun 2, 2023 •

edited

Loading

mkitti commented Jun 3, 2023 •

edited

Loading