proposal: providers (or clients) should be allowed to set a grace period for the deployments (leases) #160
andy108369
started this conversation in
General
Replies: 1 comment 1 reply
-
I think the Alternative proposal (client-defined) would be ideal if the timeout (the amount of time when the lease is down because it cannot redeploy as the worker node is down) could be configured by the clients themselves in their SDL (say |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Rationale
There are cases when worker nodes die due to either a disk/network/... failure. (Also during the node upgrade/planned maintenance)
It is often providers do not have enough free space (CPU/RAM) to accommodate the current deployments (leases) which kicks in the
monitorMaxRetries
counter (see issue 14) causing the lease to terminate.The counter is pretty aggressive and will kill the deployments in just 8 minutes⚠️ , not giving a chance for the provider owner to recover the failed worker node.
Some deployments might use persistent storage and so closing the lease mean the data is lost forever.
It is often more desirable that the deployment can return back to normal once the provider fixes the failed node.
Providers need around
1-3 days
on average (depending whether it is a weekend/holiday/etc) to fix (or upgrade/planned maintenance) their node.Proposal (provider-defined)
Provider should have an option to set the grace period (can be
--deployment-grace-period
) for the deployments instead of relying only onmonitorMaxRetries
.The currently aggressive
monitorMaxRetries
is good for the new lease, but once it's been deployed, the grace period should kick-in instead.Providers can also leverage the provider attributes, say
deployment_grace_period
, can maybe also be available over provider's API:8443/status
path.Alternative proposal (client-defined)
This can also be a customer-defined variable, so instead of having the providers set the grace period, the clients will do via the SDL deployment manifest. The attribute could be same
deployment_grace_period
or similar, depending on what fits the context best.Personally, I think the alternative proposal (client-defined) is better since it makes the expectations clear and different deployments may have different requirements.
Beta Was this translation helpful? Give feedback.
All reactions