-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add worker pool to WASM capability #15088
Conversation
cedric-cordenier
commented
Nov 4, 2024
•
edited
Loading
edited
- Add a worker pool to the wasm capability; previously there was no upper bound on the number of WASM instances that could be spun up at a given time. This adds a worker pool to limit concurrency and sets a conservative limit of 3 workers.
- Incorporate some config optimizations from common
- Shallow copy the request: since the binary can easily be 30 megabytes, this reduces the amount of copying we need to do.
- Add a step-level timeout to replace the one provided by ExecuteSync in the engine, which has since been removed.
I see you updated files related to
|
AER Report: CI Core ran successfully ✅AER Report: Operator UI CI ran successfully ✅ |
4c3a71a
to
3feac2f
Compare
- Also add step-level timeout to engine. This was removed when we moved away from ExecuteSync().
3feac2f
to
09493e4
Compare
Inputs: req.Inputs.CopyMap(), | ||
Config: req.Config.CopyMap(), | ||
func (c *Compute) Execute(ctx context.Context, request capabilities.CapabilityRequest) (capabilities.CapabilityResponse, error) { | ||
ch, err := c.enqueueRequest(ctx, request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to worry about the capability timing out in the engine if this queue gets too long?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do -- this is why I added the step-level timeout; we'll wait for a maximum of 2 minutes (which is incredibly generous) before interrupting a step (and 10 minutes for the whole workflow).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I'll be more precise, I'm not convinced the timer in the engine should start until it's running. Users shouldn't be penalized if other tasks can block them. I'm worried that I can DoS the compute capability by making N compute steps that have infinite loops and intentionally time out.
For now, maybe this is ok, we can re-evaluate before we open compute for general use...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that I can DoS the compute capability by making N compute steps that have infinite loops and intentionally time out.
This shouldn't be possible because we apply a lower-level timeout to the individual WASM call; the default setting for this is 2s. I set the step-level timeout to be very large partly as an attempt to compensate for this.
What solution would you propose? We could ignore the engine timeout I suppose, but that feels dangerous IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of ignoring the engine timeout for this capability. For now, we don't need to block on this. We can think it out more later.
a78335e
to
823a138
Compare
f1bb917
to
ef12210
Compare
ef12210
to
0da0b30
Compare
Inputs: req.Inputs.CopyMap(), | ||
Config: req.Config.CopyMap(), | ||
func (c *Compute) Execute(ctx context.Context, request capabilities.CapabilityRequest) (capabilities.CapabilityResponse, error) { | ||
ch, err := c.enqueueRequest(ctx, request) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking of ignoring the engine timeout for this capability. For now, we don't need to block on this. We can think it out more later.
log: lggr, | ||
emitter: labeler, | ||
registry: registry, | ||
modules: newModuleCache(clockwork.NewRealClock(), 1*time.Minute, 10*time.Minute, 3), | ||
transformer: NewTransformer(lggr, labeler), | ||
outgoingConnectorHandler: handler, | ||
idGenerator: idGenerator, | ||
queue: make(chan request), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the queue be non-blocking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left this as non-blocking so that we backpressure onto the engine itself; if we don't succeed after 2 minutes we'll interrupt the request altogether.
@@ -270,25 +342,40 @@ func (c *Compute) createFetcher() func(ctx context.Context, req *wasmpb.FetchReq | |||
} | |||
} | |||
|
|||
const ( | |||
defaultNumWorkers = 3 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried about how low this is. We should be able to run a lot more. I don't get what takes so much memory. I'm not going to block the PR for now, since it'll fix some OOMs, but we need to get to the bottom of it...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed offline, we agree to revisit WASM performance generally once we've hit the external audit.