Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add worker pool to WASM capability #15088

Merged
merged 6 commits into from
Nov 13, 2024
Merged

Add worker pool to WASM capability #15088

merged 6 commits into from
Nov 13, 2024

Conversation

cedric-cordenier
Copy link
Contributor

@cedric-cordenier cedric-cordenier commented Nov 4, 2024

  • Add a worker pool to the wasm capability; previously there was no upper bound on the number of WASM instances that could be spun up at a given time. This adds a worker pool to limit concurrency and sets a conservative limit of 3 workers.
  • Incorporate some config optimizations from common
  • Shallow copy the request: since the binary can easily be 30 megabytes, this reduces the amount of copying we need to do.
  • Add a step-level timeout to replace the one provided by ExecuteSync in the engine, which has since been removed.

Copy link
Contributor

github-actions bot commented Nov 4, 2024

I see you updated files related to core. Please run pnpm changeset in the root directory to add a changeset as well as in the text include at least one of the following tags:

  • #added For any new functionality added.
  • #breaking_change For any functionality that requires manual action for the node to boot.
  • #bugfix For bug fixes.
  • #changed For any change to the existing functionality.
  • #db_update For any feature that introduces updates to database schema.
  • #deprecation_notice For any upcoming deprecation functionality.
  • #internal For changesets that need to be excluded from the final changelog.
  • #nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
  • #removed For any functionality/config that is removed.
  • #updated For any functionality that is updated.
  • #wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

🎖️ No JIRA issue number found in: PR title, commit message, or branch name. Please include the issue ID in one of these.

Copy link
Contributor

github-actions bot commented Nov 4, 2024

AER Report: CI Core ran successfully ✅

aer_workflow , commit

AER Report: Operator UI CI ran successfully ✅

aer_workflow , commit

@cedric-cordenier cedric-cordenier force-pushed the add-wasm-workers branch 3 times, most recently from 4c3a71a to 3feac2f Compare November 4, 2024 16:41
- Also add step-level timeout to engine. This was removed when we moved
  away from ExecuteSync().
@cedric-cordenier cedric-cordenier marked this pull request as ready for review November 4, 2024 17:08
@cedric-cordenier cedric-cordenier requested review from a team as code owners November 4, 2024 17:08
@cedric-cordenier cedric-cordenier changed the title Add WASM workers Add worker pool to WASM capability Nov 4, 2024
Inputs: req.Inputs.CopyMap(),
Config: req.Config.CopyMap(),
func (c *Compute) Execute(ctx context.Context, request capabilities.CapabilityRequest) (capabilities.CapabilityResponse, error) {
ch, err := c.enqueueRequest(ctx, request)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to worry about the capability timing out in the engine if this queue gets too long?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do -- this is why I added the step-level timeout; we'll wait for a maximum of 2 minutes (which is incredibly generous) before interrupting a step (and 10 minutes for the whole workflow).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I'll be more precise, I'm not convinced the timer in the engine should start until it's running. Users shouldn't be penalized if other tasks can block them. I'm worried that I can DoS the compute capability by making N compute steps that have infinite loops and intentionally time out.

For now, maybe this is ok, we can re-evaluate before we open compute for general use...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried that I can DoS the compute capability by making N compute steps that have infinite loops and intentionally time out.

This shouldn't be possible because we apply a lower-level timeout to the individual WASM call; the default setting for this is 2s. I set the step-level timeout to be very large partly as an attempt to compensate for this.

What solution would you propose? We could ignore the engine timeout I suppose, but that feels dangerous IMO.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of ignoring the engine timeout for this capability. For now, we don't need to block on this. We can think it out more later.

agparadiso
agparadiso previously approved these changes Nov 4, 2024
@cedric-cordenier cedric-cordenier force-pushed the add-wasm-workers branch 2 times, most recently from a78335e to 823a138 Compare November 11, 2024 10:31
@cedric-cordenier cedric-cordenier force-pushed the add-wasm-workers branch 2 times, most recently from f1bb917 to ef12210 Compare November 13, 2024 10:55
Inputs: req.Inputs.CopyMap(),
Config: req.Config.CopyMap(),
func (c *Compute) Execute(ctx context.Context, request capabilities.CapabilityRequest) (capabilities.CapabilityResponse, error) {
ch, err := c.enqueueRequest(ctx, request)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of ignoring the engine timeout for this capability. For now, we don't need to block on this. We can think it out more later.

log: lggr,
emitter: labeler,
registry: registry,
modules: newModuleCache(clockwork.NewRealClock(), 1*time.Minute, 10*time.Minute, 3),
transformer: NewTransformer(lggr, labeler),
outgoingConnectorHandler: handler,
idGenerator: idGenerator,
queue: make(chan request),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the queue be non-blocking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left this as non-blocking so that we backpressure onto the engine itself; if we don't succeed after 2 minutes we'll interrupt the request altogether.

@@ -270,25 +342,40 @@ func (c *Compute) createFetcher() func(ctx context.Context, req *wasmpb.FetchReq
}
}

const (
defaultNumWorkers = 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm worried about how low this is. We should be able to run a lot more. I don't get what takes so much memory. I'm not going to block the PR for now, since it'll fix some OOMs, but we need to get to the bottom of it...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, we agree to revisit WASM performance generally once we've hit the external audit.

@cedric-cordenier cedric-cordenier added this pull request to the merge queue Nov 13, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 13, 2024
@cedric-cordenier cedric-cordenier added this pull request to the merge queue Nov 13, 2024
Merged via the queue into develop with commit 1a9f8cc Nov 13, 2024
137 of 140 checks passed
@cedric-cordenier cedric-cordenier deleted the add-wasm-workers branch November 13, 2024 16:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants