Note that parallelism with STARPU_NWORKER_PER_CUDA needs asynchronism…

… or threads
starpu-runtime · Jul 18, 2024 · 5cbea78 · 5cbea78
1 parent c5330b8
commit 5cbea78
Showing 1 changed file with 5 additions and 0 deletions.
diff --git a/doc/doxygen/chapters/starpu_installation/environment_variables.doxy b/doc/doxygen/chapters/starpu_installation/environment_variables.doxy
@@ -316,6 +316,11 @@ create as many CUDA workers as there are GPU devices.
 Specify the number of workers per CUDA device, and thus the number of kernels
 which will be concurrently running on the devices, i.e. the number of CUDA
 streams. Default value is 1.
+
+For parallelism to be really achieved, one also needs to make CUDA codelets
+asynchronous (it is recommended for single-worker performance too anyway,
+see ::STARPU_CUDA_ASYNC in \ref CUDA-specificOptimizations), or to set \ref
+STARPU_CUDA_THREAD_PER_WORKER to 1.
 </dd>
 
 <dt>STARPU_CUDA_THREAD_PER_WORKER</dt>