Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Worker bundle #70

Open
wants to merge 9 commits into
base: main
Choose a base branch
from
Open

Worker bundle #70

wants to merge 9 commits into from

Conversation

bmaranville
Copy link
Member

Add a new build target: a bundled interface that accesses h5wasm through a Web Worker proxy.

Motivation

  • access to WORKERFS, allows random access to files without reading entire file contents into memory
  • HDF5 reads are effectively running in different thread than the page

Differences from standard library

  • all access is async

Usage

import { save_to_workerfs, save_to_memfs, save_bytes_to_memfs, get_file_proxy } from 'h5wasm/worker/worker_proxy_bundle.js';

const loaded_files = [];
const workerfs_input = document.getElementById("save_to_workerfs");
const load_plugin_input = document.getElementById("load_plugin");
workerfs_input.addEventListener("change", async (event) => {
  const file = event.target.files[0];
  const filepath = await save_to_workerfs(file);
  loaded_files.push(filepath);
});
load_plugin_input.addEventListener("change", async (event) => {
  const file = event.target.files[0];
  const ab = await file.arrayBuffer();
  const bytes = new Uint8Array(ab);
  const filepath = await save_bytes_to_memfs(`/usr/local/hdf5/lib/plugin/${file.name}`, bytes);
  // console.log({filepath});
});

// ... load a local file called "water_224.h5" in file input
// ... load plugin libH5Zbshuf.so

const h5wasm_file_proxy = await get_file_proxy(loaded_files[0]); // loaded_files[0] === '/workerfs/water_224.h5'
root_keys = await h5wasm_file_proxy.keys();
// ['entry_0000']
const entry = await h5wasm_file_proxy.get('entry_0000');
// GroupProxy {proxy: Proxy(Function), file_id: 72057594037927938n}
await entry.keys()
// ['0_measurement', '1_integration', '2_cormap', '3_time_average', '4_azimuthal_integration', 'BM29', 'program_name', 'start_time', 'title', 'water']
dset = await entry.get('0_measurement/images')
await dset.metadata;
// {signed: true, type: 0, cset: -1, vlen: false, littleEndian: true, …}
await dset.shape;
// [10, 1043, 981]
s = await dset.slice([[0,1]]);
// Int32Array(1023183) [2, 0, 2, 0, 2, 1, 2, 2, 0, 0, 3, 0, 2, 4, 4, 1, 2, 3, 0, 1, 3, 0, 0, 3, 2, 4, 2, 7, 1, 1, 3, 3, 3, 2, 2, 2, 2, 0, 1, 6, 1, 1, 1, 1, 1, 2, 3, 1, 1, 2, 1, 3, 2, 1, 1, 0, 4, 1, 1, 2, 4, 6, 1, 0, 1, 7, 0, 2, 3, 1, 3, 1, 4, 2, 3, 0, 4, 0, 2, 3, 4, 2, 2, 1, 3, 2, 2, 1, 3, 4, 1, 1, 3, 1, 2, 2, 3, 2, 1, 2, …]
console.time('slice'); s = await dset.slice([[1,2]]); console.timeEnd('slice');
// slice: 37.31884765625 ms

type ACCESS_MODESTRING = keyof typeof ACCESS_MODES;

const worker = new DedicatedWorker(); // new Worker('./worker.js');
const remote = Comlink.wrap(worker) as Comlink.Remote<typeof api>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of exporting the api object, have you tried exporting its type?

export type WorkerApi = typeof api;

This might help remove the need for the two files, lib_worker and h5wasm.worker.ts? 🤷

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! ...this is why it's great to have a TypeScript pro review your code...


async function save_to_workerfs(file: File) {
const { FS, WORKERFS, mount } = await workerfs_promise;
const { name: filename, size } = file;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In H5Web, I generate a random ID with nanoid and use it as the filename when writing to Emscripten's file system. This removes any issue/complexity with parsing paths or dealing with conflicting filenames.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea. nanoid has the right number of dependencies.

return await this.proxy.paths();
}

async get(name: string = "/") {
Copy link
Collaborator

@axelboc axelboc Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think I understand, you have to basically reimplement your own Group#get method that returns proxied entity objects because the objects returned by the original method loose all their methods.

The downside I see is that we're basically exposing a Group class from the worker that doesn't quite work as designed. Also, when we get an entity at a given path, we end up either with a "native" h5wasm entity (Dataset, Datatype, etc.) or with a GroupProxy, which might be confusing.

I have an idea for a slightly different approach:

  • Instead of the "native" h5wasm Group, Dataset, etc. classes, expose superclasses that forbid calling all unsupported methods, like Group#get.
  • Provide a functional utility, maybe get_entity(path) that implements the same logic as Group#get (including calling Module#get_type, Module#get_external_link, etc.) but returns instances of the proxied superclasses.
class WorkerGroup extends Group {
  public override get(path: string) {
    throw new Error('Method `get` not supported);
  }
}

Comlink.expose({
  ready: h5wasm.ready,
  WorkerGroup
});
export async function get_entity(fileId: string, entityPath: string): Promise<WorkerGroup | WorkerDataset | ...> {
  const module = await remote.ready;
  const kind = await module.get_type(fileId, entityPath);

  switch (kind) {
    case "Group":
      return new remote.WorkerGroup(fileId, entityPath);
    ...
    default:
      throw new Error(...);
  }
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this superclass approach, we can also override some methods to return Transferable values to avoid copying buffers and increase performance.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it will be important for performance to refine this so it transfers large objects instead of copying them.

For the Group proxy though, I would hate to lose the ability to use Group.get... I feel like that's an important part of the current API, particularly when this is being used in an interactive session, where you might want to explore a Group or its children without having to reconstruct the whole HDF path every time (and append to it). I understand the reasons for including a get_entity(<absolute_hdf_path>) method, but can't we do both?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, it's true that they are not incompatible approaches. I guess you could just call get_entity from GroupProxy#get and if the result is a WorkerGroup, wrap it in another GroupProxy. 👍

@bmaranville
Copy link
Member Author

bmaranville commented Mar 7, 2024

This is a derivative product from h5wasm - @axelboc what do you think about making it a separate (but dependent) package (e.g. https://github.com/h5wasm/h5wasm-worker-proxy or something like that?)

It would make the packaging easier, and also simplify importing it into another project.

@axelboc
Copy link
Collaborator

axelboc commented Mar 7, 2024

Good call!

@bmaranville
Copy link
Member Author

moved to https://github.com/h5wasm/h5wasm-worker ... maybe this will be the last move :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants