Refactor `WorkerThread` runloop; avoid pathological starvation of pollers #4247

armanbilge · 2025-01-21T20:01:20Z

At a high level, a WorkerThread is always in one of the following three states:

Working (primarily on the local queue).
Looking for work (external or stolen).
Parked.

Previously, (3) was a separate parkLoop(), but (1) and (2) were entangled in the top-level runloop. Now, (2) is refactored into a separate lookForWork() loop. Thus, the top-level runloop is only responsible for tracking "ticks", such that polling runs every 64 ticks.

armanbilge · 2025-01-21T20:04:04Z

core/jvm/src/main/scala/cats/effect/unsafe/WorkerThread.scala

+    def lookForWork(): Unit = {
+      var state = 0
+      while (!done.get()) {
+        (state: @switch) match {
+          case 0 =>
+            // Check the external queue after a failed dequeue from the local
+            // queue (due to the local queue being empty).


Actually, we could lift case 0 out of the loop. It's always guaranteed to run exactly once, and never again. The loop just transitions between cases 1 and 2.

armanbilge · 2025-01-22T22:14:20Z

I think the failures are legit 😕 something must be broken.

armanbilge · 2025-01-23T20:15:57Z

Aha, all the CI failures are on JDK 21, and this is the test it's hanging on.

cats-effect/tests/jvm/src/test/scala/cats/effect/IOPlatformSpecification.scala

Lines 615 to 630 in fc5818d

    
           if (javaMajorVersion >= 21) 
        
             "block in-place on virtual threads" in real { 
        
               val loomExec = classOf[Executors] 
        
                 .getDeclaredMethod("newVirtualThreadPerTaskExecutor") 
        
                 .invoke(null) 
        
                 .asInstanceOf[ExecutorService] 
        
               val loomEc = ExecutionContext.fromExecutor(loomExec) 
        
               IO.blocking { 
        
                 classOf[Thread] 
        
                   .getDeclaredMethod("isVirtual") 
        
                   .invoke(Thread.currentThread()) 
        
                   .asInstanceOf[Boolean] 
        
               }.evalOn(loomEc) 
        
             }

Edit: but ... we are getting this weirdness, on the new test I added in this PR.

java.lang.InterruptedException
  | => tat java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1100)
        at java.base/java.util.concurrent.CountDownLatch.await(CountDownLatch.java:230)
        at cats.effect.IOPlatformSpecification.$anonfun$platformSpecs$184(IOPlatformSpecification.scala:601)
        at cats.effect.unsafe.WorkerThread.lookForWork$1(WorkerThread.scala:391)
        at cats.effect.unsafe.WorkerThread.run(WorkerThread.scala:819)

armanbilge · 2025-01-23T21:47:13Z

tests/jvm/src/test/scala/cats/effect/IOPlatformSpecification.scala

+            try {
+              latch.await() // wait until next task is in external queue
+            } catch {
+              case _: InterruptedException => // ignore, runtime is shutting down


The stray InterruptedException is apparently fatal and was nuking all the runtimes, causing remaining tests to hang. We only noticed this on JDK 21 CI runners because the last test in the suite happens to be JDK 21+ only.

djspiewak · 2025-01-25T19:15:41Z

Haven't reviewed the code yet, but I ran some benchmarks to sanity check. Looks like a very slight regression (note the parTraverse results in particular), but considering it fixes a number of fairness issues, I don't consider that to be an impediment.

Before

[info] Benchmark                                             (cpuTokens)   (size)   Mode  Cnt     Score   Error    Units
[info] ParallelBenchmark.parTraverse                               10000     1000  thrpt   10   755.287 ± 1.087    ops/s
[info] ParallelBenchmark.traverse                                  10000     1000  thrpt   10    68.832 ± 0.112    ops/s
[info] WorkStealingBenchmark.alloc                                   N/A  1000000  thrpt   10    11.866 ± 0.089  ops/min
[info] WorkStealingBenchmark.manyThreadsSchedulingBenchmark          N/A  1000000  thrpt   10    24.042 ± 2.950  ops/min
[info] WorkStealingBenchmark.runnableScheduling                      N/A  1000000  thrpt   10  2541.428 ± 5.020  ops/min
[info] WorkStealingBenchmark.runnableSchedulingScalaGlobal           N/A  1000000  thrpt   10  1874.609 ± 6.710  ops/min
[info] WorkStealingBenchmark.scheduling                              N/A  1000000  thrpt   10    25.641 ± 3.496  ops/min

After

[info] Benchmark                                             (cpuTokens)   (size)   Mode  Cnt     Score   Error    Units
[info] ParallelBenchmark.parTraverse                               10000     1000  thrpt   10   732.683 ± 1.429    ops/s
[info] ParallelBenchmark.traverse                                  10000     1000  thrpt   10    68.975 ± 0.100    ops/s
[info] WorkStealingBenchmark.alloc                                   N/A  1000000  thrpt   10    11.892 ± 0.135  ops/min
[info] WorkStealingBenchmark.manyThreadsSchedulingBenchmark          N/A  1000000  thrpt   10    30.569 ± 3.926  ops/min
[info] WorkStealingBenchmark.runnableScheduling                      N/A  1000000  thrpt   10  2576.645 ± 5.654  ops/min
[info] WorkStealingBenchmark.runnableSchedulingScalaGlobal           N/A  1000000  thrpt   10  1905.347 ± 2.478  ops/min
[info] WorkStealingBenchmark.scheduling                              N/A  1000000  thrpt   10    25.254 ± 4.653  ops/min

armanbilge added 2 commits January 21, 2025 17:00

Refactor worker runloop, extract lookForWork()

4302e2c

Test that external queue does not starve polling

ed5a775

armanbilge added the Cirrus JVM label Jan 21, 2025

armanbilge commented Jan 21, 2025

View reviewed changes

Fix park return value

6d56845

armanbilge linked an issue Jan 21, 2025 that may be closed by this pull request

External queue can starve timers/pollers #4228

Open

Fix comment

fc5818d

Fix test

8747c77

armanbilge commented Jan 23, 2025

View reviewed changes

armanbilge added the 🪲 bug label Jan 23, 2025

armanbilge mentioned this pull request Jan 23, 2025

Use our own NonFatal exception classifier #4254

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `WorkerThread` runloop; avoid pathological starvation of pollers #4247

Refactor `WorkerThread` runloop; avoid pathological starvation of pollers #4247

armanbilge commented Jan 21, 2025

armanbilge Jan 21, 2025 •

edited

Loading

armanbilge commented Jan 22, 2025

armanbilge commented Jan 23, 2025 •

edited

Loading

armanbilge Jan 23, 2025

djspiewak commented Jan 25, 2025

Refactor WorkerThread runloop; avoid pathological starvation of pollers #4247

Are you sure you want to change the base?

Refactor WorkerThread runloop; avoid pathological starvation of pollers #4247

Conversation

armanbilge commented Jan 21, 2025

armanbilge Jan 21, 2025 • edited Loading

Choose a reason for hiding this comment

armanbilge commented Jan 22, 2025

armanbilge commented Jan 23, 2025 • edited Loading

armanbilge Jan 23, 2025

Choose a reason for hiding this comment

djspiewak commented Jan 25, 2025

Before

After

Refactor `WorkerThread` runloop; avoid pathological starvation of pollers #4247

Refactor `WorkerThread` runloop; avoid pathological starvation of pollers #4247

armanbilge Jan 21, 2025 •

edited

Loading

armanbilge commented Jan 23, 2025 •

edited

Loading