API lacks handling for async ML device errors on the context #477

bbernhar · 2023-11-06T23:33:22Z

What happens if a WebNN operation dispatched through MLContext encounters some internal error which causes the GPU device to get removed?

I would expect WebNN to provide a spec into how fatal (device) errors are handled so the WebNN developer could respond appropriately. If we want to do more with MLContext (ex. create buffers), I believe we'll need a more robust error mechanism like WebGPU [1].

[1] https://www.w3.org/TR/webgpu/#errors-and-debugging

The text was updated successfully, but these errors were encountered:

bbernhar · 2024-05-30T16:04:41Z

@anssiko this issue is for non-interop and still needs a follow-up proposal (low priority atm).

zolkis · 2024-05-30T16:31:58Z

Or, should we add plain error event(s) to MLContext?
Was there a specific design reason for the WebGPU type of error handling, i.e. not using error events, but tracking a promise on lost()?

After a device is lost (described below), errors are no longer surfaced. (...) Additionally, no errors are generated by the device loss itself. Instead, the GPUDevice.lost promise resolves to indicate the device is lost.

inexorabletash · 2024-05-30T19:42:30Z

A promise has slightly better ergonomics because (1) the transition to "lost" only happens once for a device, and (2) it works even if your code runs after the state has changed; you aren't forced to add an event listener immediately.

https://www.w3.org/2001/tag/doc/promises-guide#when-to-use

@mingmingtasd

Based on @mingmingtasd's work in the Chromium prototype implementation. For webmachinelearning#477

anssiko · 2024-11-06T07:39:51Z

@bbernhar if you agree with the general direction of the proposal #778 please feel free to close this issue to focus our attention.

bbernhar · 2024-11-13T23:01:47Z

@anssiko SGTM to move the discussion there.

@mingmingtasd

…reporting (#744) * Specify behavior around context loss and error reporting. Based on @mingmingtasd's work in the Chromium prototype implementation. For #477 * Add link for 'settled' * Add 'is lost' check to build() * Add 'is not lost' dfn alias * Add "can (not) build" defn, fix throwing vs. rejecting for methods * Make build() and compute() explicitly invoke context lost steps * reword to reduce some indenting * restore missing algorithm class * Specify various destroy() methods * Lint/wording fix * whitespace change * another whitespace change * Move destroy of dependencies to 'lost' steps and other feedback * Abort steps / reject promises when context is lost * Separate out lost/lose steps, make dispatch() use abort-when * Incorporate more feedback, another lint rule * Include tensor creation in MLContext/destroy() note * Move async 'abort when'/'if aborted' into substeps * Update index.bs Co-authored-by: Reilly Grant <[email protected]> * Add suggested wording by @reillyeon for freeing graph resources on context timeline --------- Co-authored-by: Reilly Grant <[email protected]>

anssiko added the webgpu interop label Feb 6, 2024

a-sully mentioned this issue Jun 27, 2024

Inconsistency between MLGraphBuilder and MLBuffer construction #697

Open

inexorabletash added a commit to inexorabletash/webnn that referenced this issue Jul 29, 2024

Specify behavior around context loss and error reporting.

aa2a8b8

Based on @mingmingtasd's work in the Chromium prototype implementation. For webmachinelearning#477

inexorabletash mentioned this issue Jul 29, 2024

Specify destroy() methods and behavior around context loss and error reporting #744

Merged

a-sully mentioned this issue Nov 5, 2024

Proposal: Report non-fatal errors from the WebNN timeline #778

Open

anssiko added question and removed webgpu interop labels Nov 6, 2024

bbernhar closed this as completed Nov 13, 2024

a-sully mentioned this issue Nov 14, 2024

Editorial: Link to error handling proposal from the MLTensor explainer #785

Merged

anssiko added the bug label Jan 15, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API lacks handling for async ML device errors on the context #477

API lacks handling for async ML device errors on the context #477

bbernhar commented Nov 6, 2023 •

edited

Loading

bbernhar commented May 30, 2024

zolkis commented May 30, 2024

inexorabletash commented May 30, 2024

anssiko commented Nov 6, 2024

bbernhar commented Nov 13, 2024

API lacks handling for async ML device errors on the context #477

API lacks handling for async ML device errors on the context #477

Comments

bbernhar commented Nov 6, 2023 • edited Loading

bbernhar commented May 30, 2024

zolkis commented May 30, 2024

inexorabletash commented May 30, 2024

anssiko commented Nov 6, 2024

bbernhar commented Nov 13, 2024

bbernhar commented Nov 6, 2023 •

edited

Loading