Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API lacks handling for async ML device errors on the context #477

Closed
bbernhar opened this issue Nov 6, 2023 · 5 comments
Closed

API lacks handling for async ML device errors on the context #477

bbernhar opened this issue Nov 6, 2023 · 5 comments

Comments

@bbernhar
Copy link

bbernhar commented Nov 6, 2023

What happens if a WebNN operation dispatched through MLContext encounters some internal error which causes the GPU device to get removed?

I would expect WebNN to provide a spec into how fatal (device) errors are handled so the WebNN developer could respond appropriately. If we want to do more with MLContext (ex. create buffers), I believe we'll need a more robust error mechanism like WebGPU [1].

[1] https://www.w3.org/TR/webgpu/#errors-and-debugging

@bbernhar
Copy link
Author

@anssiko this issue is for non-interop and still needs a follow-up proposal (low priority atm).

@zolkis
Copy link
Collaborator

zolkis commented May 30, 2024

Or, should we add plain error event(s) to MLContext?
Was there a specific design reason for the WebGPU type of error handling, i.e. not using error events, but tracking a promise on lost()?

After a device is lost (described below), errors are no longer surfaced. (...) Additionally, no errors are generated by the device loss itself. Instead, the GPUDevice.lost promise resolves to indicate the device is lost.

@inexorabletash
Copy link
Member

A promise has slightly better ergonomics because (1) the transition to "lost" only happens once for a device, and (2) it works even if your code runs after the state has changed; you aren't forced to add an event listener immediately.

https://www.w3.org/2001/tag/doc/promises-guide#when-to-use

@anssiko
Copy link
Member

anssiko commented Nov 6, 2024

@bbernhar if you agree with the general direction of the proposal #778 please feel free to close this issue to focus our attention.

@bbernhar
Copy link
Author

@anssiko SGTM to move the discussion there.

@anssiko anssiko added the bug label Jan 15, 2025
fdwr pushed a commit that referenced this issue Jan 16, 2025
…reporting (#744)

* Specify behavior around context loss and error reporting.

Based on @mingmingtasd's work in the Chromium prototype
implementation.

For #477

* Add link for 'settled'

* Add 'is lost' check to build()

* Add 'is not lost' dfn alias

* Add "can (not) build" defn, fix throwing vs. rejecting for methods

* Make build() and compute() explicitly invoke context lost steps

* reword to reduce some indenting

* restore missing algorithm class

* Specify various destroy() methods

* Lint/wording fix

* whitespace change

* another whitespace change

* Move destroy of dependencies to 'lost' steps and other feedback

* Abort steps / reject promises when context is lost

* Separate out lost/lose steps, make dispatch() use abort-when

* Incorporate more feedback, another lint rule

* Include tensor creation in MLContext/destroy() note

* Move async 'abort when'/'if aborted' into substeps

* Update index.bs

Co-authored-by: Reilly Grant <[email protected]>

* Add suggested wording by @reillyeon for freeing graph resources on context timeline

---------

Co-authored-by: Reilly Grant <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants