Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify destroy() methods and behavior around context loss and error reporting #744

Merged
merged 32 commits into from
Jan 16, 2025

Conversation

inexorabletash
Copy link
Member

@inexorabletash inexorabletash commented Jul 29, 2024

Based on @mingmingtasd's work in the Chromium prototype implementation.

MLContext gains a destroy() method and a lost promise attribute. Calling destroy() releases resources for the context itself and any associated MLGraphs and MLTensors, and any outstanding build() requests for an associated MLGraphBuilder are aborted. The lost attribute then resolves, and any future use of the context fails. If the context is lost for other reasons, the same behavior occurs (releasing associated resources), and the lost promise provides an implementation-defined message explaining the reason.

MLGraph similarly gains a destroy() attribute. When destroyed - either by an explicit call, or because the associated context is lost - then any outstanding dispatch() requests using the graph are aborted.

This also modifies the omnipresent "has built" tests on MLGraphBuilder methods to be a "can built" test which also checks that the builder's context is not lost.

For #477


Preview | Diff

Based on @mingmingtasd's work in the Chromium prototype
implementation.

For webmachinelearning#477
@inexorabletash
Copy link
Member Author

Some points for discussion:

  • This uses the internal slot [[lost]] as both (1) the promise value and (2) how to check if a context "is lost". Maybe an explicit boolean internal slot would be clearer, albeit more verbose?
  • In the implementation, the "lost" state is equivalent to the context's mojo remote being connected. This is only checked explicitly when (1) creating an MLGraphBuilder and (2) creating an MLBuffer. This spec change does the former, but MLBuffer is not in the spec yet so that's not present. I did add an explicit check in compute(); I think that'll happen in the implementation indirectly because the dispatch would fail, but an explicit check seemed like a good idea?
  • The MLContext section of the spec could be reorganized a bit. Separate PR?

@mingmingtasd and @reillyeon could you do an initial review?

index.bs Outdated Show resolved Hide resolved
index.bs Show resolved Hide resolved
@mingmingtasd
Copy link
Contributor

mingmingtasd commented Jul 30, 2024

Thanks! @inexorabletash @reillyeon
I am also working on a chromium CL to expose MLContext::destroy to make a context lost.

In the implementation, the "lost" state is equivalent to the context's mojo remote being connected. This is only checked explicitly when (1) creating an MLGraphBuilder and (2) creating an MLBuffer. This spec change does the former, but MLBuffer is not in the spec yet so that's not present.

I will check context lost for all of the synchronous and asynchronous actions depending on the context in my chromium CL.

@reillyeon
Copy link
Contributor

Something this PR doesn't do is specify the behavior around rejecting in-flight asynchronous operations.

@inexorabletash
Copy link
Member Author

Something this PR doesn't do is specify the behavior around rejecting in-flight asynchronous operations.

Thoughts on how to do that and to what level of detail?

  • in compute(), is that handled by execute graph returning error, or do we need more steps?
  • in build() it looks like failure other than "does not support a requested feature" isn't covered.

One generic approach for both of those would be to replace: "Queue an ML task with global to resolve promise with ..." with:

  1. Queue an ML task with global to run these steps:
    1. If context is lost, then reject promise with an InvalidStateError.
    2. Resolve promise with ...

... which I think covers the script-observable behavior, but not that the async steps internally should fail.

@fdwr
Copy link
Collaborator

fdwr commented Aug 1, 2024

@RafaelCintron

@reillyeon
Copy link
Contributor

There are two separate considerations: what happens to the promise returned by a method and what happens to an asynchronous operation itself. Operations like build(), compute() and dispatch() aren't abortable (or at least, the specification should not require that they are aborted, only that whether or not they are aborted is not visible to script). We can however be explicit that the promises returned by any methods on an object are synchronously rejected when the destroy() method is called or the context is lost. We could either specify this exactly the way the Chromium implementation works, by having each object hold a list of all promises and reject them, or we could add a note to the "in parallel" steps which says "when the context is lost or this is destroyed, reject promise and potentially abort these steps". The definition of "potentially abort" is the tricky one because it has to consider cases like destroying a single buffer that is part of a larger set of pending operations which should still be able to complete.

@huningxin
Copy link
Contributor

@inexorabletash

  • in compute(), is that handled by execute graph returning error, or do we need more steps?

I suppose the following step of execute graph could handle the device lost error:

Issue a compute request to graph.[[implementation]] given name and inputResources and wait for completion.

If that returns an error, then return an "OperationError" DOMException.

A question is if the error is device lost, should it run the "context-lost" steps? Like the Chromium prototype does in HandleComputationFailure(). Or we can just assume the "context-lost" steps are triggered by "When an MLContext context is no longer available to fulfill requests" asynchronously?

  • in build() it looks like failure other than "does not support a requested feature" isn't covered.

Or we could just say "If that returns an error, then queue an ML task with global to reject promise with an 'OperationError' DOMException" similar to compute()?

One generic approach for both of those would be to replace: "Queue an ML task with global to resolve promise with ..." with:

  1. Queue an ML task with global to run these steps:

    1. If context is lost, then reject promise with an InvalidStateError.
    2. Resolve promise with ...

The new steps may not run because these steps may already be aborted (by "abort these steps") if previous steps fail .

@bbernhar
Copy link

bbernhar commented Aug 6, 2024

We can however be explicit that the promises returned by any methods on an object are synchronously rejected when the destroy() method is called or the context is lost.

+1. MLContext.destroy() steps could be spec to the following:

Script timeline:

  1. Immediately reject all pending promises made off |this| context.
  2. Issue steps for |this| context on the device/queue timeline.

Device/queue timeline:

  1. Wait for async operations on the device to complete.
  2. Then Lose |this| context.

Note: impl. is always free to abort pending async ops immediately (and release buffers).

@inexorabletash inexorabletash marked this pull request as draft August 6, 2024 19:29
@inexorabletash
Copy link
Member Author

A question is if the error is device lost, should it run the "context-lost" steps? Like the Chromium prototype does in HandleComputationFailure(). Or we can just assume the "context-lost" steps are triggered by "When an MLContext context is no longer available to fulfill requests" asynchronously?

I think this would be script observable - i.e. what order do the lost and compute() promises settle. So we should spec it explicitly.

@mingmingtasd
Copy link
Contributor

The definition of "potentially abort" is the tricky one because it has to consider cases like destroying a single buffer that is part of a larger set of pending operations which should still be able to complete.

@reillyeon @huningxin @bbernhar I think the DirectML backend here has considered this, MLBuffer can be destroyed prior to Compute() and Dispatch but its resource will be kept alive anyway until all the pending GPU work done. For example, if you destroy a MLBuffer used by Dispatch before the Dispatch is done, Dispatch can still complete but you can't continue to call ReadBuffer to read back the results. It seems OK and as expected?

@reillyeon
Copy link
Contributor

I think we've settled on the idea that destroying an MLBuffer will only make readBuffer() fail. Pending compute tasks will still complete.

Destroying an MLContext (or context loss) is the equivalent of calling destroy() on all the builders, graphs and buffers created by the context. We haven't yet decided the semantics of destroying a graph but it will probably similarly allow pending compute tasks to complete.

@mingmingtasd
Copy link
Contributor

I think we've settled on the idea that destroying an MLBuffer will only make readBuffer() fail. Pending compute tasks will still complete.

Destroying an MLContext (or context loss) is the equivalent of calling destroy() on all the builders, graphs and buffers created by the context. We haven't yet decided the semantics of destroying a graph but it will probably similarly allow pending compute tasks to complete.

I have submitted a CL to expose MLGraph::destroy : https://chromium-review.googlesource.com/c/chromium/src/+/5799069

@inexorabletash inexorabletash marked this pull request as ready for review January 8, 2025 20:37
@inexorabletash
Copy link
Member Author

(not actually ready for review, but I wanted to force an updated preview to be generated)

@inexorabletash inexorabletash marked this pull request as draft January 9, 2025 17:18
@inexorabletash inexorabletash changed the title Specify behavior around context loss and error reporting. Specify destroy() methods and behavior around context loss and error reporting Jan 9, 2025
index.bs Outdated Show resolved Hide resolved
index.bs Outdated Show resolved Hide resolved
index.bs Outdated Show resolved Hide resolved
index.bs Outdated Show resolved Hide resolved
index.bs Outdated Show resolved Hide resolved
The <dfn method for=MLGraph>destroy()</dfn> method steps are:
</summary>
1. If [=this=].{{MLGraph/[[isDestroyed]]}} is true, then abort these steps.
1. Set [=this=].{{MLGraph/[[isDestroyed]]}} to true.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLGraph may own device (e.g. GPU/NPU) resources. The destroy steps may need to release them on the context timeline. Worth adding a note? (Like MLTensor.destory() does)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in 0fc4151 but it's a very generic note, it doesn't mention the timeline. Should it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can be a bit more explicit and queue a task on the context timeline to "mark resources owned by this graph as freeable".

index.bs Outdated Show resolved Hide resolved
index.bs Outdated
@@ -1080,16 +1097,17 @@ Reads back the {{MLTensor/[[data]]}} of an {{MLTensor}} from the {{MLContext}}.{
1. If |tensor|.{{MLTensor/[[isDestroyed]]}} is true, then return [=a new promise=] [=rejected=] with a {{TypeError}}.
1. If |tensor|.{{MLTensor/[[descriptor]]}}.{{MLTensorDescriptor/readable}} is false, then return [=a new promise=] [=rejected=] with a {{TypeError}}.
1. Let |promise| be [=a new promise=].
1. Enqueue the following steps to |tensor|.{{MLGraph/[[context]]}}.{{MLContext/[[timeline]]}}:
1. Enqueue the following steps to |tensor|.{{MLGraph/[[context]]}}.{{MLContext/[[timeline]]}}, which [=/abort when=] [=this=] [=MLContext/is lost=]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is |tensor|.{{MLGraph/[[context]]}} a typo (an old issue)? Should it be |tensor|.{{MLTensor/[[context]]}} or just [=this=].{{MLContext/[[timeline]]}}? If it is a typo, the line 1096 also needs to be fixed. For example,

Suggested change
1. Enqueue the following steps to |tensor|.{{MLGraph/[[context]]}}.{{MLContext/[[timeline]]}}, which [=/abort when=] [=this=] [=MLContext/is lost=]:
1. Enqueue the following steps to [=this=].{{MLContext/[[timeline]]}}, which [=/abort when=] [=this=] [=MLContext/is lost=]:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another question is whether it should check context lost earlier? Like createTensor steps do, e.g.

1. If [=this=] [=MLContext/is lost=], then return [=a new promise=] [=rejected=] with an "{{InvalidStateError}}" {{DOMException}}.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that the context loss will destroy every associated tensor, I don't think that is needed. I'm not opposed to adding it explicitly though.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the "type confusion" I added a lint tool test and caught several more places. It's not perfect, but it is progress!

index.bs Outdated Show resolved Hide resolved
The <dfn method for=MLGraph>destroy()</dfn> method steps are:
</summary>
1. If [=this=].{{MLGraph/[[isDestroyed]]}} is true, then abort these steps.
1. Set [=this=].{{MLGraph/[[isDestroyed]]}} to true.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can be a bit more explicit and queue a task on the context timeline to "mark resources owned by this graph as freeable".

index.bs Outdated Show resolved Hide resolved
@inexorabletash inexorabletash marked this pull request as ready for review January 14, 2025 01:18
Copy link
Contributor

@huningxin huningxin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@inexorabletash inexorabletash requested a review from fdwr January 14, 2025 18:17
@fdwr fdwr requested a review from RafaelCintron January 15, 2025 02:08
Copy link
Collaborator

@fdwr fdwr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I scanned it, and it looks fine (not as close scrutiny given Reilly and Ningxin already looked). I'll delay merging a bit though because Rafael wanted to look at it tomorrow. ⏳

Copy link
Collaborator

@RafaelCintron RafaelCintron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. @inexorabletash , thank you very much for putting this together.

@fdwr fdwr merged commit 3d430fa into webmachinelearning:main Jan 16, 2025
2 checks passed
github-actions bot added a commit that referenced this pull request Jan 16, 2025
…reporting (#744)

SHA: 3d430fa
Reason: push, by fdwr

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@inexorabletash inexorabletash deleted the context-lost branch January 16, 2025 15:04
@mwyrzykowski
Copy link

Also looks good from WebKit's side @inexorabletash - informal approval from discussion in today's call

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants