Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remote hdf5 file access #37

Closed
jrobinso opened this issue Jan 13, 2023 · 27 comments
Closed

Remote hdf5 file access #37

jrobinso opened this issue Jan 13, 2023 · 27 comments

Comments

@jrobinso
Copy link

Hi, awesome project. Forgive me if this has been answered or if I've missed something.

I'm interested in reading objects (root group and specific named datasets) from large hdf5 files remotely, that is by URL, without reading the entire file into an array buffer. First is this possible out-of-the box? I assume it is not having spent some hours experimenting and looking at source, but perhaps I have missed something.

Thanks for your attention.

@bmaranville
Copy link
Member

Your conclusions are correct: it would be impossible to use jsfive in this way, because of the fact that it uses an ArrayBuffer to hold the entire file contents and reads that buffer synchronously.

What you are suggesting is possible, though not exactly "out-of-the-box", with h5wasm, a sibling library to jsfive that uses the HDF5 C API compiled to WASM, along with another library lazyFileLRU.

There is a demo of what you're asking for running at https://bmaranville.github.io/lazyFileLRU/ with source code at https://github.com/bmaranville/lazyFileLRU. There are caveats to this solution, though:

  • it must run in a Web Worker (as it is done in the demo) because synchronous reads are required, and fetch() in the main page is always async
  • HTTP Range requests must be enabled on the server where the files are being sourced
  • Certain headers are required on the server - see Enable ROS3 Driver h5wasm#12 (comment)

@jrobinso
Copy link
Author

Thanks for confirming! I'm not sure we will be able to use the Web Worker solution, and we don't have control of servers our users use. We do require range header support.

I am going to attempt to modify jsFive to support this, at least for our very narrow use case. It looks like datasets are decoded in struct.unpack_from, which takes the buffer and an offset. My thought is to range query for this buffer on demand and modify the offset, probably just before calling unpack_from. I've already confirmed I can read the root group and other metatdata needed by range query for the first few kb of the file. I have some thoughts on a more general solution but this is where I'll start.

@bmaranville
Copy link
Member

Ah great! I suspect you're going to need to use an async version of jsfive to accomplish what you want, since fetch() or XMLHttpRequest are always async outside of a web worker.

I did make such a version, which can be found in the https://github.com/usnistgov/jsfive/tree/async branch.

If you want to share your progess, I'd be interested in adding this feature into the jsfive library (though maybe in a separate jsfive-async package)

@jrobinso
Copy link
Author

@bmaranville OK, thanks for the info and encouragement. I'll report back here if I make progress.

@jrobinso
Copy link
Author

@bmaranville I have this working in theory. You had done most of the work already in the async branch. I forked and implemented my bit in the forked async branch

https://github.com/igvteam/jsfive-async/tree/async

I say works in theory, but this use case might be fundamentally impossible in practice due to the design of an hdf5 file itself. The problem occurs when I try to load any dataset from the group '/replica10_chr1/spatial_positions' in the test dataset here:

https://dl.dropboxusercontent.com/s/4ncekfdrlvkjjj6/spleen_1chr1rep.cndb?dl=0

Attempting to load any dataset in that group results in an explosion of seeks for tiny bits of information all over the file. This is true even if I load the dataset directly by absolute path like this:

const spDataset = await hdfFile.get('/replica10_chr1/spatial_position/1')

Its quite difficult to determine where this explosion is coming from but it seems to be a b-tree. Async debugging is not what it could be. I think its walking a b-tree to find the file offset of the dataset. If you have any thoughts I would appreciate hearing them.

The single dataset "genomic_position" under the group /replica10_chr1/ is loaded in one seek and does not trigger this btree explosion.

Screen Shot 2023-01-14 at 9 28 12 PM

@bmaranville
Copy link
Member

B-trees get used in a few places in an HDF5 file. In particular, if your data set is "chunked", then the chunks will not be guaranteed to be contiguous, and the their memory locations are indexed in a B-tree.

Chunking is enabled automatically (and is required) for any dataset that has filters turned on, such as compression (I would guess that is the most common filter). Any dataset that is created as "resizable" will also be chunked. There might be other triggers as well.

@jrobinso
Copy link
Author

@bmaranville Thanks again for your input. I understand chunking but I don't think that is what's happening here. The dataset I am loading has a "contiguous" layout, and it loads very fast in 1 seek. There are thousands of individual seeks before this point, however. After loading 1 dataset I can load any of the other from this group (there are 10,000 in this group) in a single seek. That is why I was speculating that the b-tree was an index to the dataset file positions. The datasets themselves in this file are contiguous. Even if they were chunked a few 10's of seeks wouldn't be a major issue.

I created a method as a test to load the dataset directly given its file offset. It loads in a single seek. So a solution for our use case might be to build an index of the file offsets of all groups and datasets in an external file, then use this to address the objects. It would be great if that index could be inserted as a new datasets into the hdf5 file without disturbing the locations of the existing objects, but I doubt that is possible.

@jrobinso
Copy link
Author

If you're interested the test I am running is here: https://github.com/igvteam/jsfive-async/blob/async/test/testTraces.js. The b-tree explosion is triggered when the first of the 9,999 datasets in "spatial_position" is loaded at line 84. The next dataset that is loaded from that group (line 96) loads in a single seek.

In the second unit test I load the same dataset directly by file offset (line 143, I created a new function for that). It loads in a single seek.

@bmaranville
Copy link
Member

Wow you aren't kidding - it seeks all over the 600 MB just to load the dataset names (and file offsets to the dataset) in the 'spatial_position' group. The b-tree in question is holding the group metadata, I think. Right now jsfive is set up to read all the links when a group is initialized - otherwise there's no easy way to do a child lookup by name!

My best guess is that the datasets were populated incrementally, and the writer kept having to expand the group metadata to new regions of the file near the end after each write. Have you tried running h5repack, maybe with a bigger size of --metadata_block_size? (https://support.hdfgroup.org/HDF5/doc/RM/Tools.html#Tools-Repack) It's possible that just running repack will put all the metadata together, which will greatly improve your performance.

@jrobinso
Copy link
Author

jrobinso commented Jan 16, 2023

Yes, its walking the tree to create the links on the group. So I take it this isn't typical, or at least isn't unavoidable.

I've implemented a solution that will probably work for us, if all else fails. I walk the tree in advance and build an index, basically dataset name -> offset associations, in an external file. If that's present it will build an index and be used instead of get_links(). My next step will be so see if this can be appended to the hdf5 file without disturbing the offsets of the existing objects, to avoid maintaining an external file. Yes its a hack. If repacking works that would be preferable.

This jsfive code is easy to work with, BTW, nice job.

@jrobinso
Copy link
Author

jrobinso commented Jan 16, 2023

Repacking didn't have any effect, I tried this

 h5repack -i spleen_1chr1rep.hdf5 -o spleen_1chr1rep.repacked.hdf5 -M 10000000  

@jrobinso
Copy link
Author

jrobinso commented Jan 16, 2023

Very dramatic speed increase with an index even for local files. This is promising. Test case is here: https://github.com/jrobinso/jsfive-async/blob/async/test/testCNDB.js.

@bmaranville
Copy link
Member

For indexing the chunks in HDF5 files, it looks like some others have attacked this problem in detail - see https://fsspec.github.io/kerchunk/index.html

@jrobinso
Copy link
Author

Thanks, that's interesting, but its not the problem here, unless I've missed something. The datsaets themselves in this file are small, generally about 16kb, and not chunked. Once the offset to the dataset is know it loads in ms. The problem is there are 10,000 of them. So building the links for the offsets requires walking all over the file.

I encountered this problem around 10 years ago in another project, igv, which also had 10s (actually 100s) of thousands of small datasets. After lots of helpful back and forth with the HDF5 team I was told that HDF5 is not designed for lots of small containers, rather a few large containers. I didn't pursue it further at the time. I think we are experiencing the same issue.

Indexing the containers is working great, better than I expected, and is a solution for our particular use case. So we're likely to go forward using this fork. I was going to ask you for suggestions on what to call this module. The working name is "jsfive-async" but it occured to me you might want to reserve that for your own project. This is assuming you don't want to merge what I've done back. It is perhaps an esoteric use case.

I also created a python project for creating the index https://github.com/jrobinso/h5-indexer

@bmaranville
Copy link
Member

Thanks for asking about the name - I was planning to someday release the async branch under the name jsfive-async, since it can't be effectively merged with the sync branch anymore, but the extra logic in your fork does seem a bit specific to merge back in to the general library.

@jrobinso
Copy link
Author

Its somewhat specific, but is also very minimal. The only change to the existing classes is this in init method for groups. I attach the linkIndex to Group but am not wedded to that. The essential idea is to provide the Group with "_links" from an external source. If jsfive supported that by some means I would make my project dependent on it.

  async init(dataobjects) {
    if(Group.linkIndex && this.name in Group.linkIndex) {
      this._links = Group.linkIndex[this.name]
    } else {
      this._links = await dataobjects.get_links();
    }

One possible alternative to the global map above would be to just allow the "_links" to be supplied externally, perhaps through the group constructor, then rewrite init as

  async init(dataobjects, links) {
      this._links =  links || await dataobjects.get_links();
    

This is for you consideration, I understand if you don't want to include it. However I don't think our case is completely esoteric, anyone with file designs containing large numbers of containers (datsets in out case) will have this issue. Its not too noticeable with a local file because seeks are relatively cheap, but it is noticeable. My "local file" test case takes ~800 ms without the index, 15 ms with it.

RE the name, suggestions? I will not use jsfive-async, but jsfive-????

@jrobinso
Copy link
Author

@bmaranville Oh, and BTW I merged my forked async-branch with the main branch without an issue, or without any issue that I've noticed yet. Do you have some reason to think it is not mergeable? I'm asking because its entirely possible I missed something important.

@bmaranville
Copy link
Member

Thanks! I didn't appreciate how minimal the changes were to the library itself. I would be happy to merge the second version of your change above - where links is an optional second parameter to init.

As for merging, I meant more that future changes implemented in the sync branch will probably have to be manually merged into the async branch (or vice versa), since there will be enough difference between the two that automatic merging will often fail. I think most people will still want to use the sync version, so they will both have to exist.

@jrobinso
Copy link
Author

OK, good to know. I think what I'll do is move most of my jfive-async code to another project, and leave just the minimal change with the optional links parameter in my "jsfive-async" fork. So it will be a dependency. Then you can merge the change at your convenience.

There is some packaging issue with compressed files, pako is not included in the "esm" bundle so its not found in a browser. I can solve that with rollup or come other packager (I generally use rollup because it minimally changes the source code), but how do you envision using the esm package in a browser with compressed files (i.e. needs pako). Filters.js is looking for it in "../node_modules/pako.js", but that path isn't valid in the dist bundle.

@bmaranville
Copy link
Member

I think that the pako functions are bundled in jsfive/dist/esm/index.mjs by esbuild - are you using the code in jsfive/esm directly? (I should really rename the root esm folder to src - this project is from early in the ESM years)

@jrobinso
Copy link
Author

jrobinso commented Jan 18, 2023 via email

@jrobinso
Copy link
Author

@bmaranville At the risk of extending a long thread. It turns out I don't need any changes, necessarily, to jsfive- async. Javascript being the wonderfully flexible language it is, I can simply override the init method at runtime like this. So what began as a potentially a lot of changes to jsfive is no changes at all.

So for now I'm going to declare a dependency on your async branch,, then do this runtime override of the Group.init method from my project, to be named. If you decide in the future you want to support this sort of thing it will be trivial to add it to jsfive-async.

        const indexFileContents = await indexReader.read()
        const indexFileJson = new TextDecoder().decode(indexFileContents)
        const index = JSON.parse(indexFileJson)

        Group.prototype.init = async function(dataobjects) {
            if(this.name in index) {
                this._links = index[this.name]
            } else {
                this._links = await dataobjects.get_links();
            }
            this._dataobjects = dataobjects;
            this._attrs = null;  // cached property
            this._keys = null;
        }

@bmaranville
Copy link
Member

Ok - that works for me right now. Are you able to effectively work with the async jsfive code as a dependency in its current form (as a branch on github?) Would it be easier for you if it were packaged in some other way, e.g. as a separate npm package?

@jrobinso
Copy link
Author

I can be dependent on a branch, in fact on a single commit. Thanks for the discussion it was enlightening, and for the async jsfive which makes this possible.

@jrobinso
Copy link
Author

jrobinso commented Jan 19, 2023

@bmaranville There's a bit of a complication in using the async branch as a dependency. NPM installing the branch installs the source code in "jsfive". This is fine, but "filters.js" has the following dependency

import * as pako from '../node_modules/pako/dist/pako.esm.mjs';

That of course is not there. So my previous answer is incorrect, or rather incomplete, yes I can specify the branch as dependency but that's not quite enough.

In the meantime I have just built from the async branch myself and include the bundle in my source tree. This works fine. When there is a jsfive-async bundle to import from NPM or elsewhere it should be a straight swap.

@bmaranville
Copy link
Member

bmaranville commented Jan 20, 2023

I think the relative path for importing "pako" is a relic. I would like to change that line to:

import * as pako from 'pako/dist/pako.esm.mjs';

After the change, I am able to build distributions that work for me. If you can verify that this change allows you to build your application as well, I will push the change to both the async and master branches

Edit:

import * as pako from 'pako';

seems to work just as well, in both node and browser builds.

@jrobinso
Copy link
Author

I don't have a preference as I am using the distribution build now. I would not call the form with the relative path a relic, AFAIK that is what you have to do for browsers if you want to use the code with transformations. But again I'm using the built distribution so the change won't affect me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants