Remote hdf5 file access #37

jrobinso · 2023-01-13T05:42:50Z

Hi, awesome project. Forgive me if this has been answered or if I've missed something.

I'm interested in reading objects (root group and specific named datasets) from large hdf5 files remotely, that is by URL, without reading the entire file into an array buffer. First is this possible out-of-the box? I assume it is not having spent some hours experimenting and looking at source, but perhaps I have missed something.

Thanks for your attention.

bmaranville · 2023-01-13T11:19:03Z

Your conclusions are correct: it would be impossible to use jsfive in this way, because of the fact that it uses an ArrayBuffer to hold the entire file contents and reads that buffer synchronously.

What you are suggesting is possible, though not exactly "out-of-the-box", with h5wasm, a sibling library to jsfive that uses the HDF5 C API compiled to WASM, along with another library lazyFileLRU.

There is a demo of what you're asking for running at https://bmaranville.github.io/lazyFileLRU/ with source code at https://github.com/bmaranville/lazyFileLRU. There are caveats to this solution, though:

it must run in a Web Worker (as it is done in the demo) because synchronous reads are required, and fetch() in the main page is always async
HTTP Range requests must be enabled on the server where the files are being sourced
Certain headers are required on the server - see Enable ROS3 Driver h5wasm#12 (comment)

jrobinso · 2023-01-13T15:34:06Z

Thanks for confirming! I'm not sure we will be able to use the Web Worker solution, and we don't have control of servers our users use. We do require range header support.

I am going to attempt to modify jsFive to support this, at least for our very narrow use case. It looks like datasets are decoded in struct.unpack_from, which takes the buffer and an offset. My thought is to range query for this buffer on demand and modify the offset, probably just before calling unpack_from. I've already confirmed I can read the root group and other metatdata needed by range query for the first few kb of the file. I have some thoughts on a more general solution but this is where I'll start.

bmaranville · 2023-01-13T16:01:46Z

Ah great! I suspect you're going to need to use an async version of jsfive to accomplish what you want, since fetch() or XMLHttpRequest are always async outside of a web worker.

I did make such a version, which can be found in the https://github.com/usnistgov/jsfive/tree/async branch.

If you want to share your progess, I'd be interested in adding this feature into the jsfive library (though maybe in a separate jsfive-async package)

jrobinso · 2023-01-14T03:12:01Z

@bmaranville OK, thanks for the info and encouragement. I'll report back here if I make progress.

jrobinso · 2023-01-15T05:47:06Z

@bmaranville I have this working in theory. You had done most of the work already in the async branch. I forked and implemented my bit in the forked async branch

https://github.com/igvteam/jsfive-async/tree/async

I say works in theory, but this use case might be fundamentally impossible in practice due to the design of an hdf5 file itself. The problem occurs when I try to load any dataset from the group '/replica10_chr1/spatial_positions' in the test dataset here:

https://dl.dropboxusercontent.com/s/4ncekfdrlvkjjj6/spleen_1chr1rep.cndb?dl=0

Attempting to load any dataset in that group results in an explosion of seeks for tiny bits of information all over the file. This is true even if I load the dataset directly by absolute path like this:

const spDataset = await hdfFile.get('/replica10_chr1/spatial_position/1')

Its quite difficult to determine where this explosion is coming from but it seems to be a b-tree. Async debugging is not what it could be. I think its walking a b-tree to find the file offset of the dataset. If you have any thoughts I would appreciate hearing them.

The single dataset "genomic_position" under the group /replica10_chr1/ is loaded in one seek and does not trigger this btree explosion.

bmaranville · 2023-01-15T14:27:42Z

B-trees get used in a few places in an HDF5 file. In particular, if your data set is "chunked", then the chunks will not be guaranteed to be contiguous, and the their memory locations are indexed in a B-tree.

Chunking is enabled automatically (and is required) for any dataset that has filters turned on, such as compression (I would guess that is the most common filter). Any dataset that is created as "resizable" will also be chunked. There might be other triggers as well.

jrobinso · 2023-01-15T15:28:46Z

@bmaranville Thanks again for your input. I understand chunking but I don't think that is what's happening here. The dataset I am loading has a "contiguous" layout, and it loads very fast in 1 seek. There are thousands of individual seeks before this point, however. After loading 1 dataset I can load any of the other from this group (there are 10,000 in this group) in a single seek. That is why I was speculating that the b-tree was an index to the dataset file positions. The datasets themselves in this file are contiguous. Even if they were chunked a few 10's of seeks wouldn't be a major issue.

I created a method as a test to load the dataset directly given its file offset. It loads in a single seek. So a solution for our use case might be to build an index of the file offsets of all groups and datasets in an external file, then use this to address the objects. It would be great if that index could be inserted as a new datasets into the hdf5 file without disturbing the locations of the existing objects, but I doubt that is possible.

jrobinso · 2023-01-15T15:43:50Z

If you're interested the test I am running is here: https://github.com/igvteam/jsfive-async/blob/async/test/testTraces.js. The b-tree explosion is triggered when the first of the 9,999 datasets in "spatial_position" is loaded at line 84. The next dataset that is loaded from that group (line 96) loads in a single seek.

In the second unit test I load the same dataset directly by file offset (line 143, I created a new function for that). It loads in a single seek.

bmaranville · 2023-01-16T01:43:47Z

Wow you aren't kidding - it seeks all over the 600 MB just to load the dataset names (and file offsets to the dataset) in the 'spatial_position' group. The b-tree in question is holding the group metadata, I think. Right now jsfive is set up to read all the links when a group is initialized - otherwise there's no easy way to do a child lookup by name!

My best guess is that the datasets were populated incrementally, and the writer kept having to expand the group metadata to new regions of the file near the end after each write. Have you tried running h5repack, maybe with a bigger size of --metadata_block_size? (https://support.hdfgroup.org/HDF5/doc/RM/Tools.html#Tools-Repack) It's possible that just running repack will put all the metadata together, which will greatly improve your performance.

jrobinso · 2023-01-16T02:56:38Z

Yes, its walking the tree to create the links on the group. So I take it this isn't typical, or at least isn't unavoidable.

I've implemented a solution that will probably work for us, if all else fails. I walk the tree in advance and build an index, basically dataset name -> offset associations, in an external file. If that's present it will build an index and be used instead of get_links(). My next step will be so see if this can be appended to the hdf5 file without disturbing the offsets of the existing objects, to avoid maintaining an external file. Yes its a hack. If repacking works that would be preferable.

This jsfive code is easy to work with, BTW, nice job.

jrobinso · 2023-01-16T03:41:19Z

Repacking didn't have any effect, I tried this

 h5repack -i spleen_1chr1rep.hdf5 -o spleen_1chr1rep.repacked.hdf5 -M 10000000

jrobinso · 2023-01-16T07:26:02Z

Very dramatic speed increase with an index even for local files. This is promising. Test case is here: https://github.com/jrobinso/jsfive-async/blob/async/test/testCNDB.js.

bmaranville · 2023-01-17T18:32:38Z

For indexing the chunks in HDF5 files, it looks like some others have attacked this problem in detail - see https://fsspec.github.io/kerchunk/index.html

jrobinso · 2023-01-17T18:48:28Z

Thanks, that's interesting, but its not the problem here, unless I've missed something. The datsaets themselves in this file are small, generally about 16kb, and not chunked. Once the offset to the dataset is know it loads in ms. The problem is there are 10,000 of them. So building the links for the offsets requires walking all over the file.

I encountered this problem around 10 years ago in another project, igv, which also had 10s (actually 100s) of thousands of small datasets. After lots of helpful back and forth with the HDF5 team I was told that HDF5 is not designed for lots of small containers, rather a few large containers. I didn't pursue it further at the time. I think we are experiencing the same issue.

Indexing the containers is working great, better than I expected, and is a solution for our particular use case. So we're likely to go forward using this fork. I was going to ask you for suggestions on what to call this module. The working name is "jsfive-async" but it occured to me you might want to reserve that for your own project. This is assuming you don't want to merge what I've done back. It is perhaps an esoteric use case.

I also created a python project for creating the index https://github.com/jrobinso/h5-indexer

bmaranville · 2023-01-17T19:58:46Z

Thanks for asking about the name - I was planning to someday release the async branch under the name jsfive-async, since it can't be effectively merged with the sync branch anymore, but the extra logic in your fork does seem a bit specific to merge back in to the general library.

jrobinso · 2023-01-17T21:10:47Z

Its somewhat specific, but is also very minimal. The only change to the existing classes is this in init method for groups. I attach the linkIndex to Group but am not wedded to that. The essential idea is to provide the Group with "_links" from an external source. If jsfive supported that by some means I would make my project dependent on it.

  async init(dataobjects) {
    if(Group.linkIndex && this.name in Group.linkIndex) {
      this._links = Group.linkIndex[this.name]
    } else {
      this._links = await dataobjects.get_links();
    }

One possible alternative to the global map above would be to just allow the "_links" to be supplied externally, perhaps through the group constructor, then rewrite init as

  async init(dataobjects, links) {
      this._links =  links || await dataobjects.get_links();

This is for you consideration, I understand if you don't want to include it. However I don't think our case is completely esoteric, anyone with file designs containing large numbers of containers (datsets in out case) will have this issue. Its not too noticeable with a local file because seeks are relatively cheap, but it is noticeable. My "local file" test case takes ~800 ms without the index, 15 ms with it.

RE the name, suggestions? I will not use jsfive-async, but jsfive-????

jrobinso · 2023-01-17T21:13:38Z

@bmaranville Oh, and BTW I merged my forked async-branch with the main branch without an issue, or without any issue that I've noticed yet. Do you have some reason to think it is not mergeable? I'm asking because its entirely possible I missed something important.

bmaranville · 2023-01-17T21:22:10Z

Thanks! I didn't appreciate how minimal the changes were to the library itself. I would be happy to merge the second version of your change above - where links is an optional second parameter to init.

As for merging, I meant more that future changes implemented in the sync branch will probably have to be manually merged into the async branch (or vice versa), since there will be enough difference between the two that automatic merging will often fail. I think most people will still want to use the sync version, so they will both have to exist.

jrobinso · 2023-01-17T21:33:40Z

OK, good to know. I think what I'll do is move most of my jfive-async code to another project, and leave just the minimal change with the optional links parameter in my "jsfive-async" fork. So it will be a dependency. Then you can merge the change at your convenience.

There is some packaging issue with compressed files, pako is not included in the "esm" bundle so its not found in a browser. I can solve that with rollup or come other packager (I generally use rollup because it minimally changes the source code), but how do you envision using the esm package in a browser with compressed files (i.e. needs pako). Filters.js is looking for it in "../node_modules/pako.js", but that path isn't valid in the dist bundle.

bmaranville · 2023-01-17T21:54:24Z

I think that the pako functions are bundled in jsfive/dist/esm/index.mjs by esbuild - are you using the code in jsfive/esm directly? (I should really rename the root esm folder to src - this project is from early in the ESM years)

jrobinso · 2023-01-18T00:22:25Z

Ah yes, you are correct, my mistake. The bundle works fine.

…

On Tue, Jan 17, 2023 at 1:54 PM Brian Benjamin Maranville < ***@***.***> wrote: I think that the pako functions are bundled in jsfive/dist/esm/index.mjs by esbuild - are you using the code in jsfive/esm directly? (I should really rename the root esm folder to src - this project is from early in the ESM years) — Reply to this email directly, view it on GitHub <#37 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAHD2HDWQ5NYHP7YARCKFL3WS4IJVANCNFSM6AAAAAAT2AYVMM> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

jrobinso · 2023-01-18T03:03:22Z

@bmaranville At the risk of extending a long thread. It turns out I don't need any changes, necessarily, to jsfive- async. Javascript being the wonderfully flexible language it is, I can simply override the init method at runtime like this. So what began as a potentially a lot of changes to jsfive is no changes at all.

So for now I'm going to declare a dependency on your async branch,, then do this runtime override of the Group.init method from my project, to be named. If you decide in the future you want to support this sort of thing it will be trivial to add it to jsfive-async.

        const indexFileContents = await indexReader.read()
        const indexFileJson = new TextDecoder().decode(indexFileContents)
        const index = JSON.parse(indexFileJson)

        Group.prototype.init = async function(dataobjects) {
            if(this.name in index) {
                this._links = index[this.name]
            } else {
                this._links = await dataobjects.get_links();
            }
            this._dataobjects = dataobjects;
            this._attrs = null;  // cached property
            this._keys = null;
        }

bmaranville · 2023-01-18T04:37:33Z

Ok - that works for me right now. Are you able to effectively work with the async jsfive code as a dependency in its current form (as a branch on github?) Would it be easier for you if it were packaged in some other way, e.g. as a separate npm package?

jrobinso · 2023-01-18T05:40:01Z

I can be dependent on a branch, in fact on a single commit. Thanks for the discussion it was enlightening, and for the async jsfive which makes this possible.

jrobinso · 2023-01-19T07:26:26Z

@bmaranville There's a bit of a complication in using the async branch as a dependency. NPM installing the branch installs the source code in "jsfive". This is fine, but "filters.js" has the following dependency

import * as pako from '../node_modules/pako/dist/pako.esm.mjs';

That of course is not there. So my previous answer is incorrect, or rather incomplete, yes I can specify the branch as dependency but that's not quite enough.

In the meantime I have just built from the async branch myself and include the bundle in my source tree. This works fine. When there is a jsfive-async bundle to import from NPM or elsewhere it should be a straight swap.

bmaranville · 2023-01-20T14:05:31Z

I think the relative path for importing "pako" is a relic. I would like to change that line to:

import * as pako from 'pako/dist/pako.esm.mjs';

After the change, I am able to build distributions that work for me. If you can verify that this change allows you to build your application as well, I will push the change to both the async and master branches

Edit:

import * as pako from 'pako';

seems to work just as well, in both node and browser builds.

jrobinso · 2023-01-20T16:17:52Z

I don't have a preference as I am using the distribution build now. I would not call the form with the relative path a relic, AFAIK that is what you have to do for browsers if you want to use the code with transformations. But again I'm using the built distribution so the change won't affect me.

jrobinso closed this as completed Jan 14, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remote hdf5 file access #37

Remote hdf5 file access #37

jrobinso commented Jan 13, 2023

bmaranville commented Jan 13, 2023

jrobinso commented Jan 13, 2023

bmaranville commented Jan 13, 2023

jrobinso commented Jan 14, 2023

jrobinso commented Jan 15, 2023

bmaranville commented Jan 15, 2023

jrobinso commented Jan 15, 2023

jrobinso commented Jan 15, 2023

bmaranville commented Jan 16, 2023

jrobinso commented Jan 16, 2023 •

edited

Loading

jrobinso commented Jan 16, 2023 •

edited

Loading

jrobinso commented Jan 16, 2023 •

edited

Loading

bmaranville commented Jan 17, 2023

jrobinso commented Jan 17, 2023

bmaranville commented Jan 17, 2023

jrobinso commented Jan 17, 2023

jrobinso commented Jan 17, 2023

bmaranville commented Jan 17, 2023

jrobinso commented Jan 17, 2023

bmaranville commented Jan 17, 2023

jrobinso commented Jan 18, 2023 via email

jrobinso commented Jan 18, 2023

bmaranville commented Jan 18, 2023

jrobinso commented Jan 18, 2023

jrobinso commented Jan 19, 2023 •

edited

Loading

bmaranville commented Jan 20, 2023 •

edited

Loading

jrobinso commented Jan 20, 2023

Remote hdf5 file access #37

Remote hdf5 file access #37

Comments

jrobinso commented Jan 13, 2023

bmaranville commented Jan 13, 2023

jrobinso commented Jan 13, 2023

bmaranville commented Jan 13, 2023

jrobinso commented Jan 14, 2023

jrobinso commented Jan 15, 2023

bmaranville commented Jan 15, 2023

jrobinso commented Jan 15, 2023

jrobinso commented Jan 15, 2023

bmaranville commented Jan 16, 2023

jrobinso commented Jan 16, 2023 • edited Loading

jrobinso commented Jan 16, 2023 • edited Loading

jrobinso commented Jan 16, 2023 • edited Loading

bmaranville commented Jan 17, 2023

jrobinso commented Jan 17, 2023

bmaranville commented Jan 17, 2023

jrobinso commented Jan 17, 2023

jrobinso commented Jan 17, 2023

bmaranville commented Jan 17, 2023

jrobinso commented Jan 17, 2023

bmaranville commented Jan 17, 2023

jrobinso commented Jan 18, 2023 via email

jrobinso commented Jan 18, 2023

bmaranville commented Jan 18, 2023

jrobinso commented Jan 18, 2023

jrobinso commented Jan 19, 2023 • edited Loading

bmaranville commented Jan 20, 2023 • edited Loading

jrobinso commented Jan 20, 2023

jrobinso commented Jan 16, 2023 •

edited

Loading

jrobinso commented Jan 16, 2023 •

edited

Loading

jrobinso commented Jan 16, 2023 •

edited

Loading

jrobinso commented Jan 19, 2023 •

edited

Loading

bmaranville commented Jan 20, 2023 •

edited

Loading