Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameter adjustment of [CZ]GEMM_DEFAULT_[PQ] for Neoverse V1 #4742

Open
tetsuzo-usui opened this issue Jun 7, 2024 · 1 comment
Open

Parameter adjustment of [CZ]GEMM_DEFAULT_[PQ] for Neoverse V1 #4742

tetsuzo-usui opened this issue Jun 7, 2024 · 1 comment

Comments

@tetsuzo-usui
Copy link

Hello.
In the previous pull request #4381, the P and Q parameter of [SD]GEMM were increased to make better use of the L2 cache of Neoverse V1, but the complex [CZ]GEMM parameters remained unchanged. I tried adjusting them hoping for a similar performance improvement.

[CZ]GEMM_DEFAULT_P is adjusted to 120 since the data length of 240 of [SD]GEMM_DEFAULT_P corresponds to 120 in complex elements. [CZ]GEMM_DEFAULT_Q is set to 320 for double precision and 640 for single precision, which results in similar blocking as the real routines in terms of cache usage. Below is the performance graph of 64 threads that shows the improvement:

OpenBLAS_ZGEMM_PQparam

However, performance degradation of xTRMM has been observed as a side effect of this parameter change.

OpenBLAS_ZTRMM_PQparam

This is analyzed as follows:
In the TRMM calculation, the kernels used internally are reduced to GEMM_KERNEL and TRMM_KERNEL. The change in blocking reduces the amount of calculations performed by GEMM_KERNEL, and increases the amount of calculations performed by the less efficient TRMM_KERNEL. As a result, the overall calculation becomes less efficient.

If you agree that improving the performance of the TRMM_KERNEL as a separate issue to be addressed in the future, [CZ]GEMM_DEFAULT_[PQ] parameters can be changed in advance.
In that case, please let me know and I will fix the parameters.

@martin-frbg
Copy link
Collaborator

I'm a bit worried that the performance loss of TRMM appears to be proportionally greater than the gain achieved in GEMM, as far as I can make out from your graphs. Or am I mistaken ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants