Add worker pool to WASM capability #15088

cedric-cordenier · 2024-11-04T10:53:15Z

Add a worker pool to the wasm capability; previously there was no upper bound on the number of WASM instances that could be spun up at a given time. This adds a worker pool to limit concurrency and sets a conservative limit of 3 workers.
Incorporate some config optimizations from common
Shallow copy the request: since the binary can easily be 30 megabytes, this reduces the amount of copying we need to do.
Add a step-level timeout to replace the one provided by ExecuteSync in the engine, which has since been removed.

github-actions · 2024-11-04T10:56:11Z

I see you updated files related to core. Please run pnpm changeset in the root directory to add a changeset as well as in the text include at least one of the following tags:

#added For any new functionality added.
#breaking_change For any functionality that requires manual action for the node to boot.
#bugfix For bug fixes.
#changed For any change to the existing functionality.
#db_update For any feature that introduces updates to database schema.
#deprecation_notice For any upcoming deprecation functionality.
#internal For changesets that need to be excluded from the final changelog.
#nops For any feature that is NOP facing and needs to be in the official Release Notes for the release.
#removed For any functionality/config that is removed.
#updated For any functionality that is updated.
#wip For any change that is not ready yet and external communication about it should be held off till it is feature complete.

🎖️ No JIRA issue number found in: PR title, commit message, or branch name. Please include the issue ID in one of these.

github-actions · 2024-11-04T11:07:12Z

AER Report: CI Core ran successfully ✅

aer_workflow , commit

AER Report: Operator UI CI ran successfully ✅

aer_workflow , commit

- Also add step-level timeout to engine. This was removed when we moved away from ExecuteSync().

nolag · 2024-11-04T17:10:45Z

core/capabilities/compute/compute.go

-		Inputs:   req.Inputs.CopyMap(),
-		Config:   req.Config.CopyMap(),
+func (c *Compute) Execute(ctx context.Context, request capabilities.CapabilityRequest) (capabilities.CapabilityResponse, error) {
+	ch, err := c.enqueueRequest(ctx, request)


Do we need to worry about the capability timing out in the engine if this queue gets too long?

We do -- this is why I added the step-level timeout; we'll wait for a maximum of 2 minutes (which is incredibly generous) before interrupting a step (and 10 minutes for the whole workflow).

Sorry, I'll be more precise, I'm not convinced the timer in the engine should start until it's running. Users shouldn't be penalized if other tasks can block them. I'm worried that I can DoS the compute capability by making N compute steps that have infinite loops and intentionally time out.

For now, maybe this is ok, we can re-evaluate before we open compute for general use...

I'm worried that I can DoS the compute capability by making N compute steps that have infinite loops and intentionally time out.

This shouldn't be possible because we apply a lower-level timeout to the individual WASM call; the default setting for this is 2s. I set the step-level timeout to be very large partly as an attempt to compensate for this.

What solution would you propose? We could ignore the engine timeout I suppose, but that feels dangerous IMO.

I was thinking of ignoring the engine timeout for this capability. For now, we don't need to block on this. We can think it out more later.

core/capabilities/compute/compute.go

nolag · 2024-11-13T14:19:59Z

core/capabilities/compute/compute.go

-		Inputs:   req.Inputs.CopyMap(),
-		Config:   req.Config.CopyMap(),
+func (c *Compute) Execute(ctx context.Context, request capabilities.CapabilityRequest) (capabilities.CapabilityResponse, error) {
+	ch, err := c.enqueueRequest(ctx, request)


I was thinking of ignoring the engine timeout for this capability. For now, we don't need to block on this. We can think it out more later.

nolag · 2024-11-13T14:21:15Z

core/capabilities/compute/compute.go

 			log:                      lggr,
 			emitter:                  labeler,
 			registry:                 registry,
 			modules:                  newModuleCache(clockwork.NewRealClock(), 1*time.Minute, 10*time.Minute, 3),
 			transformer:              NewTransformer(lggr, labeler),
 			outgoingConnectorHandler: handler,
 			idGenerator:              idGenerator,
+			queue:                    make(chan request),


Should the queue be non-blocking?

I left this as non-blocking so that we backpressure onto the engine itself; if we don't succeed after 2 minutes we'll interrupt the request altogether.

nolag · 2024-11-13T14:22:41Z

core/capabilities/compute/compute.go

@@ -270,25 +342,40 @@ func (c *Compute) createFetcher() func(ctx context.Context, req *wasmpb.FetchReq
 	}
 }

+const (
+	defaultNumWorkers = 3


I'm worried about how low this is. We should be able to run a lot more. I don't get what takes so much memory. I'm not going to block the PR for now, since it'll fix some OOMs, but we need to get to the bottom of it...

As discussed offline, we agree to revisit WASM performance generally once we've hit the external audit.

cedric-cordenier force-pushed the add-wasm-workers branch 3 times, most recently from 4c3a71a to 3feac2f Compare November 4, 2024 16:41

[chore] Add worker pool to compute capability

09493e4

- Also add step-level timeout to engine. This was removed when we moved away from ExecuteSync().

cedric-cordenier force-pushed the add-wasm-workers branch from 3feac2f to 09493e4 Compare November 4, 2024 16:53

cedric-cordenier marked this pull request as ready for review November 4, 2024 17:08

cedric-cordenier requested review from a team as code owners November 4, 2024 17:08

cedric-cordenier requested a review from ilija42 November 4, 2024 17:08

cedric-cordenier changed the title ~~Add WASM workers~~ Add worker pool to WASM capability Nov 4, 2024

nolag reviewed Nov 4, 2024

View reviewed changes

agparadiso previously approved these changes Nov 4, 2024

View reviewed changes

cedric-cordenier added 2 commits November 4, 2024 18:45

WIP

2e1c6d9

Merge branch 'develop' into add-wasm-workers

486db86

cedric-cordenier dismissed agparadiso’s stale review via 486db86 November 8, 2024 17:55

Merge branch 'develop' into add-wasm-workers

d277afb

cedric-cordenier force-pushed the add-wasm-workers branch 2 times, most recently from a78335e to 823a138 Compare November 11, 2024 10:31

Some more comments

b3e3ca4

cedric-cordenier force-pushed the add-wasm-workers branch 2 times, most recently from f1bb917 to ef12210 Compare November 13, 2024 10:55

Merge branch 'develop' into add-wasm-workers

0da0b30

cedric-cordenier force-pushed the add-wasm-workers branch from ef12210 to 0da0b30 Compare November 13, 2024 10:57

cedric-cordenier requested a review from agparadiso November 13, 2024 12:27

cedric-cordenier requested a review from nolag November 13, 2024 12:27

nolag approved these changes Nov 13, 2024

View reviewed changes

cedric-cordenier enabled auto-merge November 13, 2024 14:50

george-dorin approved these changes Nov 13, 2024

View reviewed changes

cedric-cordenier added this pull request to the merge queue Nov 13, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Nov 13, 2024

cedric-cordenier added this pull request to the merge queue Nov 13, 2024

Merged via the queue into develop with commit 1a9f8cc Nov 13, 2024
137 of 140 checks passed

cedric-cordenier deleted the add-wasm-workers branch November 13, 2024 16:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add worker pool to WASM capability #15088

Add worker pool to WASM capability #15088

cedric-cordenier commented Nov 4, 2024 •

edited

Loading

github-actions bot commented Nov 4, 2024

github-actions bot commented Nov 4, 2024 •

edited

Loading

nolag Nov 4, 2024

cedric-cordenier Nov 4, 2024

nolag Nov 4, 2024

cedric-cordenier Nov 11, 2024

nolag Nov 13, 2024

nolag Nov 13, 2024

nolag Nov 13, 2024

cedric-cordenier Nov 13, 2024

nolag Nov 13, 2024

cedric-cordenier Nov 13, 2024

Add worker pool to WASM capability #15088

Add worker pool to WASM capability #15088

Conversation

cedric-cordenier commented Nov 4, 2024 • edited Loading

github-actions bot commented Nov 4, 2024

github-actions bot commented Nov 4, 2024 • edited Loading

AER Report: CI Core ran successfully ✅

AER Report: Operator UI CI ran successfully ✅

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cedric-cordenier commented Nov 4, 2024 •

edited

Loading

github-actions bot commented Nov 4, 2024 •

edited

Loading