You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We should consider retrying on failed task placements a few seconds after the initial failure, just to confirm that the failed task placement is in fact a product of limited resources and not a result of sub-second race conditions in the scheduler.
Watchbot's SQS-based try and retry system kinda sorta does this already. Is there an advantage to making a failed placement a special case and not just letting the usual retry + backoff routines handle it?
@rclark I don't think we want failed task placements to wind things up in the dead letter queue. Failed task placements represent a structural limitation of the scheduler, and should be retried as close to the scheduler as possible (ideally inside of the scheduler, per chat with David Myers). These failures don't represent chronic failures of a particular payload, which is what the dead letter queue should be signaling.
The dead letter queue isn't supposed to represent chronically malformed or rejected payloads -- the idea is that SQS should never ever drop your job until it has been completed successfully. If the scheduler can't place a task for some number of attempts, then yeah -- there's some other limitation at play, but we definitely don't want the application to lose track of the work that it was supposed to get done.
We should consider retrying on failed task placements a few seconds after the initial failure, just to confirm that the failed task placement is in fact a product of limited resources and not a result of sub-second race conditions in the scheduler.
cc/ @brendanmcfarland @rclark
The text was updated successfully, but these errors were encountered: