Ideation on making Pthread more scalable #4645

shivammonaka · 2024-04-15T10:34:06Z

Hello,

I'm currently working on optimizing the scalability of the openBLAS Pthread flow. Presently, I've observed that even when a BLAS call requires only 8 threads for execution on a 64-core machine, it still locks all available resources using level3_lock in level3_thread.c. These resources are only released after the execution completes, resulting in poor CPU utilization (approximately 12.5%).

My goal is to maximize CPU resource utilization, ideally reaching close to 100%. To achieve this, I have a theoretical concept in mind and would greatly appreciate community suggestions and insights.

The Idea:
Instead of utilizing a mutex lock at level3_thread.c, I propose employing a locking mechanism with conditional wait. This would allow more BLAS calls to proceed until all CPUs are fully utilized. Upon completion of a BLAS operation, the corresponding CPU can be released, signaling the waiting threads to check for resource availability again. Resource allocation and deallocation can be managed through a thread-safe mechanism.

I'm seeking feedback on the feasibility and effectiveness of this approach. Are there any potential oversights or inaccuracies in my understanding? I'm open to any insights or suggestions for further improvement.

The text was updated successfully, but these errors were encountered:

brada4 · 2024-04-15T12:02:13Z

Your data set is too small for all 64 caches.

shivammonaka · 2024-04-15T12:05:25Z

Hi @brada4, I understand your point but if the dataset is soo small that it requires only few (Example 8) cores only then why should we lock all the resources and do poor utilization of resources. We could allow multiple calls to be executed at a time.

You meant core and not cache right? If you meant cache, Can you please elaborate your point.

brada4 · 2024-04-15T14:39:55Z

there are some badly modelled areas, if input+temp+output fits in one cache one core is optimal, arbitrary above that it switches to all core threads and there is observable glitch for some size range after until huge data gets linear speedup. More cpus more pessimal range. Better heuristics welcome

shivammonaka · 2024-04-16T04:41:49Z

Hi @brada4

I agree with your point but I have a different concern. Lets imagine this scenario

Scenario: I have 10 BLAS calls and suppose OpenBLAS decided that nthreads = 8 is enough for each of them based on their size ( It does happens when scale = m*n*k = around 10^6).
Now, What happens in current OpenBLAS is : Only 1 of the BLAS call (say 1) will get through the level3_lock in level3_thread.c and starts its execution while others are waiting to the acquire the lock. The lock taken by first BLAS call will only be released when its execution is completed. Similarly, BLAS call 2 will acquire the lock after 1 has completed its execution and BLAS calls from 3-10 will wait on lock. In this scenario, At a time only 8 out of 64 cores are being utilized. 56 cores are always doing nothing when they can be scheduled to handle other BLAS calls concurrently.

brada4 · 2024-04-16T11:00:53Z

I got your idea, you are talking scoreboard, not lock.
Currently OpenBLAS uses one or all threads, if it could be tuned to gradually rise threads based on size (vs cache line/l2/l3 ?) you could lock like 4 threads for one call and other unused slots for other call, which would use one etc.
I think first step would be to make thread pool better (smaller) at corner cases, so you actually get those free threads.
Even that would do nicely having some cores in max turbo for the smaller task.

martin-frbg · 2024-04-16T17:25:27Z

I've been trying to address this too, but at the current stage OpenBLAS simply uses a "sensible" maximum number of threads (based on what is accepted as the maximum workload for a single thread before switching to multithreading) instead of throwing all cores at any problem. I have (obviously) not addressed this level3_lock issue in my experiments so far, and I get the impression that you are more experienced with pthread locking algorithms than me anyway. Given that most of the core level3 code has not changed since the early days, one probably needs to look out for state variables that are safe only as long as only a single BLAS call is active - but those will probably show up readily enough, if they exist.

shivammonaka · 2024-04-17T06:44:00Z

@martin-frbg
I've been running some experiments of my own. I have gotten some good performance improvement on most cases and I am fixing some corner cases where I am getting spikes. I will update soon. Thanks for answering my queries.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ideation on making Pthread more scalable #4645

Ideation on making Pthread more scalable #4645

shivammonaka commented Apr 15, 2024

brada4 commented Apr 15, 2024

shivammonaka commented Apr 15, 2024 •

edited

Loading

brada4 commented Apr 15, 2024

shivammonaka commented Apr 16, 2024

brada4 commented Apr 16, 2024

martin-frbg commented Apr 16, 2024

shivammonaka commented Apr 17, 2024

Ideation on making Pthread more scalable #4645

Ideation on making Pthread more scalable #4645

Comments

shivammonaka commented Apr 15, 2024

brada4 commented Apr 15, 2024

shivammonaka commented Apr 15, 2024 • edited Loading

brada4 commented Apr 15, 2024

shivammonaka commented Apr 16, 2024

brada4 commented Apr 16, 2024

martin-frbg commented Apr 16, 2024

shivammonaka commented Apr 17, 2024

shivammonaka commented Apr 15, 2024 •

edited

Loading