Specify destroy() methods and behavior around context loss and error reporting #744

inexorabletash · 2024-07-29T19:36:08Z

Based on @mingmingtasd's work in the Chromium prototype implementation.

MLContext gains a destroy() method and a lost promise attribute. Calling destroy() releases resources for the context itself and any associated MLGraphs and MLTensors, and any outstanding build() requests for an associated MLGraphBuilder are aborted. The lost attribute then resolves, and any future use of the context fails. If the context is lost for other reasons, the same behavior occurs (releasing associated resources), and the lost promise provides an implementation-defined message explaining the reason.

MLGraph similarly gains a destroy() attribute. When destroyed - either by an explicit call, or because the associated context is lost - then any outstanding dispatch() requests using the graph are aborted.

This also modifies the omnipresent "has built" tests on MLGraphBuilder methods to be a "can built" test which also checks that the builder's context is not lost.

For #477

Preview | Diff

@mingmingtasd

Based on @mingmingtasd's work in the Chromium prototype implementation. For webmachinelearning#477

inexorabletash · 2024-07-29T19:42:03Z

Some points for discussion:

This uses the internal slot [[lost]] as both (1) the promise value and (2) how to check if a context "is lost". Maybe an explicit boolean internal slot would be clearer, albeit more verbose?
In the implementation, the "lost" state is equivalent to the context's mojo remote being connected. This is only checked explicitly when (1) creating an MLGraphBuilder and (2) creating an MLBuffer. This spec change does the former, but MLBuffer is not in the spec yet so that's not present. I did add an explicit check in compute(); I think that'll happen in the implementation indirectly because the dispatch would fail, but an explicit check seemed like a good idea?
The MLContext section of the spec could be reorganized a bit. Separate PR?

@mingmingtasd and @reillyeon could you do an initial review?

index.bs

mingmingtasd · 2024-07-30T01:18:00Z

Thanks! @inexorabletash @reillyeon
I am also working on a chromium CL to expose MLContext::destroy to make a context lost.

In the implementation, the "lost" state is equivalent to the context's mojo remote being connected. This is only checked explicitly when (1) creating an MLGraphBuilder and (2) creating an MLBuffer. This spec change does the former, but MLBuffer is not in the spec yet so that's not present.

I will check context lost for all of the synchronous and asynchronous actions depending on the context in my chromium CL.

reillyeon · 2024-07-30T01:32:09Z

Something this PR doesn't do is specify the behavior around rejecting in-flight asynchronous operations.

inexorabletash · 2024-07-30T17:35:24Z

Something this PR doesn't do is specify the behavior around rejecting in-flight asynchronous operations.

Thoughts on how to do that and to what level of detail?

in compute(), is that handled by execute graph returning error, or do we need more steps?
in build() it looks like failure other than "does not support a requested feature" isn't covered.

One generic approach for both of those would be to replace: "Queue an ML task with global to resolve promise with ..." with:

Queue an ML task with global to run these steps:

If context is lost, then reject promise with an InvalidStateError.

Resolve promise with ...

... which I think covers the script-observable behavior, but not that the async steps internally should fail.

fdwr · 2024-08-01T00:50:45Z

@RafaelCintron

reillyeon · 2024-08-02T18:37:47Z

There are two separate considerations: what happens to the promise returned by a method and what happens to an asynchronous operation itself. Operations like build(), compute() and dispatch() aren't abortable (or at least, the specification should not require that they are aborted, only that whether or not they are aborted is not visible to script). We can however be explicit that the promises returned by any methods on an object are synchronously rejected when the destroy() method is called or the context is lost. We could either specify this exactly the way the Chromium implementation works, by having each object hold a list of all promises and reject them, or we could add a note to the "in parallel" steps which says "when the context is lost or this is destroyed, reject promise and potentially abort these steps". The definition of "potentially abort" is the tricky one because it has to consider cases like destroying a single buffer that is part of a larger set of pending operations which should still be able to complete.

huningxin · 2024-08-05T14:17:05Z

@inexorabletash

in compute(), is that handled by execute graph returning error, or do we need more steps?

I suppose the following step of execute graph could handle the device lost error:

Issue a compute request to graph.[[implementation]] given name and inputResources and wait for completion.

If that returns an error, then return an "OperationError" DOMException.

A question is if the error is device lost, should it run the "context-lost" steps? Like the Chromium prototype does in HandleComputationFailure(). Or we can just assume the "context-lost" steps are triggered by "When an MLContext context is no longer available to fulfill requests" asynchronously?

in build() it looks like failure other than "does not support a requested feature" isn't covered.

Or we could just say "If that returns an error, then queue an ML task with global to reject promise with an 'OperationError' DOMException" similar to compute()?

One generic approach for both of those would be to replace: "Queue an ML task with global to resolve promise with ..." with:

Queue an ML task with global to run these steps:

If context is lost, then reject promise with an InvalidStateError.

Resolve promise with ...

The new steps may not run because these steps may already be aborted (by "abort these steps") if previous steps fail .

bbernhar · 2024-08-06T16:41:01Z

We can however be explicit that the promises returned by any methods on an object are synchronously rejected when the destroy() method is called or the context is lost.

+1. MLContext.destroy() steps could be spec to the following:

Script timeline:

Immediately reject all pending promises made off |this| context.
Issue steps for |this| context on the device/queue timeline.

Device/queue timeline:

Wait for async operations on the device to complete.
Then Lose |this| context.

Note: impl. is always free to abort pending async ops immediately (and release buffers).

inexorabletash · 2024-08-07T19:47:26Z

A question is if the error is device lost, should it run the "context-lost" steps? Like the Chromium prototype does in HandleComputationFailure(). Or we can just assume the "context-lost" steps are triggered by "When an MLContext context is no longer available to fulfill requests" asynchronously?

I think this would be script observable - i.e. what order do the lost and compute() promises settle. So we should spec it explicitly.

mingmingtasd · 2024-08-19T05:59:03Z

The definition of "potentially abort" is the tricky one because it has to consider cases like destroying a single buffer that is part of a larger set of pending operations which should still be able to complete.

@reillyeon @huningxin @bbernhar I think the DirectML backend here has considered this, MLBuffer can be destroyed prior to Compute() and Dispatch but its resource will be kept alive anyway until all the pending GPU work done. For example, if you destroy a MLBuffer used by Dispatch before the Dispatch is done, Dispatch can still complete but you can't continue to call ReadBuffer to read back the results. It seems OK and as expected?

reillyeon · 2024-08-19T14:42:04Z

I think we've settled on the idea that destroying an MLBuffer will only make readBuffer() fail. Pending compute tasks will still complete.

Destroying an MLContext (or context loss) is the equivalent of calling destroy() on all the builders, graphs and buffers created by the context. We haven't yet decided the semantics of destroying a graph but it will probably similarly allow pending compute tasks to complete.

mingmingtasd · 2024-08-20T04:51:56Z

I think we've settled on the idea that destroying an MLBuffer will only make readBuffer() fail. Pending compute tasks will still complete.

Destroying an MLContext (or context loss) is the equivalent of calling destroy() on all the builders, graphs and buffers created by the context. We haven't yet decided the semantics of destroying a graph but it will probably similarly allow pending compute tasks to complete.

I have submitted a CL to expose MLGraph::destroy : https://chromium-review.googlesource.com/c/chromium/src/+/5799069

inexorabletash · 2025-01-08T20:39:01Z

(not actually ready for review, but I wanted to force an updated preview to be generated)

index.bs

huningxin · 2025-01-12T03:23:36Z

index.bs

+    The <dfn method for=MLGraph>destroy()</dfn> method steps are:
+</summary>
+    1. If [=this=].{{MLGraph/[[isDestroyed]]}} is true, then abort these steps.
+    1. Set [=this=].{{MLGraph/[[isDestroyed]]}} to true.


MLGraph may own device (e.g. GPU/NPU) resources. The destroy steps may need to release them on the context timeline. Worth adding a note? (Like MLTensor.destory() does)

Added in 0fc4151 but it's a very generic note, it doesn't mention the timeline. Should it?

We can be a bit more explicit and queue a task on the context timeline to "mark resources owned by this graph as freeable".

index.bs

huningxin · 2025-01-13T03:13:36Z

index.bs

@@ -1080,16 +1097,17 @@ Reads back the {{MLTensor/[[data]]}} of an {{MLTensor}} from the {{MLContext}}.{
    1. If |tensor|.{{MLTensor/[[isDestroyed]]}} is true, then return [=a new promise=] [=rejected=] with a {{TypeError}}.
    1. If |tensor|.{{MLTensor/[[descriptor]]}}.{{MLTensorDescriptor/readable}} is false, then return [=a new promise=] [=rejected=] with a {{TypeError}}.
    1. Let |promise| be [=a new promise=].
-    1. Enqueue the following steps to |tensor|.{{MLGraph/[[context]]}}.{{MLContext/[[timeline]]}}:
+    1. Enqueue the following steps to |tensor|.{{MLGraph/[[context]]}}.{{MLContext/[[timeline]]}}, which [=/abort when=] [=this=] [=MLContext/is lost=]:


Is |tensor|.{{MLGraph/[[context]]}} a typo (an old issue)? Should it be |tensor|.{{MLTensor/[[context]]}} or just [=this=].{{MLContext/[[timeline]]}}? If it is a typo, the line 1096 also needs to be fixed. For example,

Suggested change

1. Enqueue the following steps to |tensor|.{{MLGraph/[[context]]}}.{{MLContext/[[timeline]]}}, which [=/abort when=] [=this=] [=MLContext/is lost=]:

1. Enqueue the following steps to [=this=].{{MLContext/[[timeline]]}}, which [=/abort when=] [=this=] [=MLContext/is lost=]:

Another question is whether it should check context lost earlier? Like createTensor steps do, e.g.

1. If [=this=] [=MLContext/is lost=], then return [=a new promise=] [=rejected=] with an "{{InvalidStateError}}" {{DOMException}}.

Given that the context loss will destroy every associated tensor, I don't think that is needed. I'm not opposed to adding it explicitly though.

For the "type confusion" I added a lint tool test and caught several more places. It's not perfect, but it is progress!

index.bs

reillyeon · 2025-01-13T22:20:22Z

index.bs

+    The <dfn method for=MLGraph>destroy()</dfn> method steps are:
+</summary>
+    1. If [=this=].{{MLGraph/[[isDestroyed]]}} is true, then abort these steps.
+    1. Set [=this=].{{MLGraph/[[isDestroyed]]}} to true.


We can be a bit more explicit and queue a task on the context timeline to "mark resources owned by this graph as freeable".

index.bs

Co-authored-by: Reilly Grant <[email protected]>

…ntext timeline

huningxin

LGTM!

fdwr

Thanks. I scanned it, and it looks fine (not as close scrutiny given Reilly and Ningxin already looked). I'll delay merging a bit though because Rafael wanted to look at it tomorrow. ⏳

RafaelCintron

LGTM. @inexorabletash , thank you very much for putting this together.

…reporting (#744) SHA: 3d430fa Reason: push, by fdwr Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

mwyrzykowski · 2025-01-16T15:43:20Z

Also looks good from WebKit's side @inexorabletash - informal approval from discussion in today's call

Specify behavior around context loss and error reporting.

aa2a8b8

Based on @mingmingtasd's work in the Chromium prototype implementation. For webmachinelearning#477

reillyeon suggested changes Jul 29, 2024

View reviewed changes

index.bs Outdated Show resolved Hide resolved

index.bs Show resolved Hide resolved

inexorabletash added 4 commits July 29, 2024 14:48

Add link for 'settled'

7456eff

Add 'is lost' check to build()

01cf420

Add 'is not lost' dfn alias

1149c8d

Add "can (not) build" defn, fix throwing vs. rejecting for methods

d2ec3f4

reillyeon approved these changes Jul 29, 2024

View reviewed changes

mingmingtasd approved these changes Jul 30, 2024

View reviewed changes

Merge branch 'refs/heads/review' into context-lost

057280f

Merge branch 'refs/heads/review' into context-lost

e4fa318

inexorabletash marked this pull request as draft August 6, 2024 19:29

Merge branch 'refs/heads/review' into context-lost

68c5b97

inexorabletash added 3 commits August 7, 2024 13:05

Make build() and compute() explicitly invoke context lost steps

61ff07d

reword to reduce some indenting

164d3e4

restore missing algorithm class

8f5decf

inexorabletash added 3 commits November 4, 2024 16:28

Merge branch 'refs/heads/draft' into context-lost

e7a718d

Merge branch 'refs/heads/draft' into context-lost

62e0c74

Merge branch 'refs/heads/draft' into context-lost

ae81da2

inexorabletash added 2 commits January 7, 2025 17:14

Specify various destroy() methods

4f6e751

Lint/wording fix

de1d618

inexorabletash marked this pull request as ready for review January 8, 2025 20:37

whitespace change

ecb96d0

another whitespace change

55fc7ed

inexorabletash marked this pull request as draft January 9, 2025 17:18

inexorabletash changed the title ~~Specify behavior around context loss and error reporting.~~ Specify destroy() methods and behavior around context loss and error reporting Jan 9, 2025

reillyeon suggested changes Jan 9, 2025

View reviewed changes

index.bs Outdated Show resolved Hide resolved

index.bs Outdated Show resolved Hide resolved

index.bs Outdated Show resolved Hide resolved

index.bs Outdated Show resolved Hide resolved

inexorabletash added 2 commits January 10, 2025 10:00

Move destroy of dependencies to 'lost' steps and other feedback

e6d9f27

Abort steps / reject promises when context is lost

cf5a95c

reillyeon suggested changes Jan 10, 2025

View reviewed changes

index.bs Outdated Show resolved Hide resolved

Separate out lost/lose steps, make dispatch() use abort-when

a9b8643

huningxin reviewed Jan 13, 2025

View reviewed changes

inexorabletash added 3 commits January 13, 2025 09:50

Incorporate more feedback, another lint rule

0fc4151

Include tensor creation in MLContext/destroy() note

95c3b26

Move async 'abort when'/'if aborted' into substeps

88e8d58

reillyeon approved these changes Jan 13, 2025

View reviewed changes

inexorabletash and others added 2 commits January 13, 2025 17:12

Update index.bs

5c92d97

Co-authored-by: Reilly Grant <[email protected]>

Add suggested wording by @reillyeon for freeing graph resources on co…

f59890c

…ntext timeline

inexorabletash marked this pull request as ready for review January 14, 2025 01:18

huningxin approved these changes Jan 14, 2025

View reviewed changes

inexorabletash requested a review from fdwr January 14, 2025 18:17

fdwr requested a review from RafaelCintron January 15, 2025 02:08

fdwr approved these changes Jan 15, 2025

View reviewed changes

RafaelCintron approved these changes Jan 16, 2025

View reviewed changes

fdwr merged commit 3d430fa into webmachinelearning:main Jan 16, 2025
2 checks passed

inexorabletash deleted the context-lost branch January 16, 2025 15:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specify destroy() methods and behavior around context loss and error reporting #744

Specify destroy() methods and behavior around context loss and error reporting #744

inexorabletash commented Jul 29, 2024 •

edited by pr-preview bot

Loading

inexorabletash commented Jul 29, 2024

mingmingtasd commented Jul 30, 2024 •

edited

Loading

reillyeon commented Jul 30, 2024

inexorabletash commented Jul 30, 2024

fdwr commented Aug 1, 2024

reillyeon commented Aug 2, 2024

huningxin commented Aug 5, 2024

bbernhar commented Aug 6, 2024 •

edited

Loading

inexorabletash commented Aug 7, 2024

mingmingtasd commented Aug 19, 2024

reillyeon commented Aug 19, 2024

mingmingtasd commented Aug 20, 2024

inexorabletash commented Jan 8, 2025

huningxin Jan 12, 2025

inexorabletash Jan 13, 2025

reillyeon Jan 13, 2025

huningxin Jan 13, 2025

huningxin Jan 13, 2025

inexorabletash Jan 13, 2025

inexorabletash Jan 13, 2025

reillyeon Jan 13, 2025

huningxin left a comment

fdwr left a comment •

edited

Loading

RafaelCintron left a comment

mwyrzykowski commented Jan 16, 2025

	1. Enqueue the following steps to \|tensor\|.{{MLGraph/[[context]]}}.{{MLContext/[[timeline]]}}, which [=/abort when=] [=this=] [=MLContext/is lost=]:
	1. Enqueue the following steps to [=this=].{{MLContext/[[timeline]]}}, which [=/abort when=] [=this=] [=MLContext/is lost=]:

Specify destroy() methods and behavior around context loss and error reporting #744

Specify destroy() methods and behavior around context loss and error reporting #744

Conversation

inexorabletash commented Jul 29, 2024 • edited by pr-preview bot Loading

inexorabletash commented Jul 29, 2024

mingmingtasd commented Jul 30, 2024 • edited Loading

reillyeon commented Jul 30, 2024

inexorabletash commented Jul 30, 2024

fdwr commented Aug 1, 2024

reillyeon commented Aug 2, 2024

huningxin commented Aug 5, 2024

bbernhar commented Aug 6, 2024 • edited Loading

inexorabletash commented Aug 7, 2024

mingmingtasd commented Aug 19, 2024

reillyeon commented Aug 19, 2024

mingmingtasd commented Aug 20, 2024

inexorabletash commented Jan 8, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

huningxin left a comment

Choose a reason for hiding this comment

fdwr left a comment • edited Loading

Choose a reason for hiding this comment

RafaelCintron left a comment

Choose a reason for hiding this comment

mwyrzykowski commented Jan 16, 2025

inexorabletash commented Jul 29, 2024 •

edited by pr-preview bot

Loading

mingmingtasd commented Jul 30, 2024 •

edited

Loading

bbernhar commented Aug 6, 2024 •

edited

Loading

fdwr left a comment •

edited

Loading