From 83dfcad9fcfa802b38bc4e97587e25218822814b Mon Sep 17 00:00:00 2001 From: Mariano Gappa Date: Fri, 4 Oct 2024 15:47:15 +0100 Subject: [PATCH] feat: Add time.Sleep to mitigate race condition. (#1923) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The ShuffleQueue scheduler strategy has an infrequent race condition, as explained by the comment: ``` // A race condition is possible when the last active table asynchronously // queues a relation. The table finishes (calling `.Done()`) a moment // before the queue receives the `.Push()`. At this point, the queue is // empty and there are no active workers. // // A moment later, the queue receives the `.Push()` and queues a new task. // // This is a very infrequent case according to tests, but it happens. ``` After many attempts at a more elegant solution, I finally yielded: ``` time.Sleep(10 * time.Millisecond) ``` Looks ugly, but after running the tests 300 times (so around 3000 syncs), it works 🤷 ``` ✓ cloudquery/plugin-sdk main* $ go test ./scheduler -count=100 -run TestScheduler ⏱ 15:04:12 ok github.com/cloudquery/plugin-sdk/v4/scheduler 143.523s ✓ cloudquery/plugin-sdk main* $ go test ./scheduler -count=100 -run TestScheduler ⏱ 15:06:56 ok github.com/cloudquery/plugin-sdk/v4/scheduler 142.796s ✓ cloudquery/plugin-sdk main* $ go test ./scheduler -count=100 -run TestScheduler ⏱ 15:09:22 ok github.com/cloudquery/plugin-sdk/v4/scheduler 144.304s ``` --- scheduler/queue/active_work_signal.go | 11 +++++++++++ 1 file changed, 11 insertions(+) diff --git a/scheduler/queue/active_work_signal.go b/scheduler/queue/active_work_signal.go index 106a9eb11c..00ac14f2c7 100644 --- a/scheduler/queue/active_work_signal.go +++ b/scheduler/queue/active_work_signal.go @@ -3,6 +3,7 @@ package queue import ( "sync" "sync/atomic" + "time" ) // activeWorkSignal is a thread-safe coordinator for awaiting a worker pool @@ -58,6 +59,16 @@ func (s *activeWorkSignal) IsIdle() bool { // Wait blocks until the count of active workers changes. func (s *activeWorkSignal) Wait() { + // A race condition is possible when the last active table asynchronously + // queues a relation. The table finishes (calling `.Done()`) a moment + // before the queue receives the `.Push()`. At this point, the queue is + // empty and there are no active workers. + // + // A moment later, the queue receives the `.Push()` and queues a new task. + // + // This is a very infrequent case according to tests, but it happens. + time.Sleep(10 * time.Millisecond) + if s.activeWorkUnitCount.Load() <= 0 { return }