Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ideation on making Pthread more scalable #4645

Open
shivammonaka opened this issue Apr 15, 2024 · 7 comments
Open

Ideation on making Pthread more scalable #4645

shivammonaka opened this issue Apr 15, 2024 · 7 comments

Comments

@shivammonaka
Copy link
Contributor

Hello,

I'm currently working on optimizing the scalability of the openBLAS Pthread flow. Presently, I've observed that even when a BLAS call requires only 8 threads for execution on a 64-core machine, it still locks all available resources using level3_lock in level3_thread.c. These resources are only released after the execution completes, resulting in poor CPU utilization (approximately 12.5%).

My goal is to maximize CPU resource utilization, ideally reaching close to 100%. To achieve this, I have a theoretical concept in mind and would greatly appreciate community suggestions and insights.

The Idea:
Instead of utilizing a mutex lock at level3_thread.c, I propose employing a locking mechanism with conditional wait. This would allow more BLAS calls to proceed until all CPUs are fully utilized. Upon completion of a BLAS operation, the corresponding CPU can be released, signaling the waiting threads to check for resource availability again. Resource allocation and deallocation can be managed through a thread-safe mechanism.

I'm seeking feedback on the feasibility and effectiveness of this approach. Are there any potential oversights or inaccuracies in my understanding? I'm open to any insights or suggestions for further improvement.

@brada4
Copy link
Contributor

brada4 commented Apr 15, 2024

Your data set is too small for all 64 caches.

@shivammonaka
Copy link
Contributor Author

shivammonaka commented Apr 15, 2024

Hi @brada4, I understand your point but if the dataset is soo small that it requires only few (Example 8) cores only then why should we lock all the resources and do poor utilization of resources. We could allow multiple calls to be executed at a time.

You meant core and not cache right? If you meant cache, Can you please elaborate your point.

@brada4
Copy link
Contributor

brada4 commented Apr 15, 2024

there are some badly modelled areas, if input+temp+output fits in one cache one core is optimal, arbitrary above that it switches to all core threads and there is observable glitch for some size range after until huge data gets linear speedup. More cpus more pessimal range. Better heuristics welcome

@shivammonaka
Copy link
Contributor Author

Hi @brada4

I agree with your point but I have a different concern. Lets imagine this scenario

Scenario: I have 10 BLAS calls and suppose OpenBLAS decided that nthreads = 8 is enough for each of them based on their size ( It does happens when scale = m*n*k = around 10^6).
Now, What happens in current OpenBLAS is : Only 1 of the BLAS call (say 1) will get through the level3_lock in level3_thread.c and starts its execution while others are waiting to the acquire the lock. The lock taken by first BLAS call will only be released when its execution is completed. Similarly, BLAS call 2 will acquire the lock after 1 has completed its execution and BLAS calls from 3-10 will wait on lock. In this scenario, At a time only 8 out of 64 cores are being utilized. 56 cores are always doing nothing when they can be scheduled to handle other BLAS calls concurrently.

@brada4
Copy link
Contributor

brada4 commented Apr 16, 2024

I got your idea, you are talking scoreboard, not lock.
Currently OpenBLAS uses one or all threads, if it could be tuned to gradually rise threads based on size (vs cache line/l2/l3 ?) you could lock like 4 threads for one call and other unused slots for other call, which would use one etc.
I think first step would be to make thread pool better (smaller) at corner cases, so you actually get those free threads.
Even that would do nicely having some cores in max turbo for the smaller task.

@martin-frbg
Copy link
Collaborator

I've been trying to address this too, but at the current stage OpenBLAS simply uses a "sensible" maximum number of threads (based on what is accepted as the maximum workload for a single thread before switching to multithreading) instead of throwing all cores at any problem. I have (obviously) not addressed this level3_lock issue in my experiments so far, and I get the impression that you are more experienced with pthread locking algorithms than me anyway. Given that most of the core level3 code has not changed since the early days, one probably needs to look out for state variables that are safe only as long as only a single BLAS call is active - but those will probably show up readily enough, if they exist.

@shivammonaka
Copy link
Contributor Author

@martin-frbg
I've been running some experiments of my own. I have gotten some good performance improvement on most cases and I am fixing some corner cases where I am getting spikes. I will update soon. Thanks for answering my queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants