You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello.
In the previous pull request #4381, the P and Q parameter of [SD]GEMM were increased to make better use of the L2 cache of Neoverse V1, but the complex [CZ]GEMM parameters remained unchanged. I tried adjusting them hoping for a similar performance improvement.
[CZ]GEMM_DEFAULT_P is adjusted to 120 since the data length of 240 of [SD]GEMM_DEFAULT_P corresponds to 120 in complex elements. [CZ]GEMM_DEFAULT_Q is set to 320 for double precision and 640 for single precision, which results in similar blocking as the real routines in terms of cache usage. Below is the performance graph of 64 threads that shows the improvement:
However, performance degradation of xTRMM has been observed as a side effect of this parameter change.
This is analyzed as follows:
In the TRMM calculation, the kernels used internally are reduced to GEMM_KERNEL and TRMM_KERNEL. The change in blocking reduces the amount of calculations performed by GEMM_KERNEL, and increases the amount of calculations performed by the less efficient TRMM_KERNEL. As a result, the overall calculation becomes less efficient.
If you agree that improving the performance of the TRMM_KERNEL as a separate issue to be addressed in the future, [CZ]GEMM_DEFAULT_[PQ] parameters can be changed in advance.
In that case, please let me know and I will fix the parameters.
The text was updated successfully, but these errors were encountered:
I'm a bit worried that the performance loss of TRMM appears to be proportionally greater than the gain achieved in GEMM, as far as I can make out from your graphs. Or am I mistaken ?
Hello.
In the previous pull request #4381, the P and Q parameter of [SD]GEMM were increased to make better use of the L2 cache of Neoverse V1, but the complex [CZ]GEMM parameters remained unchanged. I tried adjusting them hoping for a similar performance improvement.
[CZ]GEMM_DEFAULT_P is adjusted to 120 since the data length of 240 of [SD]GEMM_DEFAULT_P corresponds to 120 in complex elements. [CZ]GEMM_DEFAULT_Q is set to 320 for double precision and 640 for single precision, which results in similar blocking as the real routines in terms of cache usage. Below is the performance graph of 64 threads that shows the improvement:
However, performance degradation of xTRMM has been observed as a side effect of this parameter change.
This is analyzed as follows:
In the TRMM calculation, the kernels used internally are reduced to GEMM_KERNEL and TRMM_KERNEL. The change in blocking reduces the amount of calculations performed by GEMM_KERNEL, and increases the amount of calculations performed by the less efficient TRMM_KERNEL. As a result, the overall calculation becomes less efficient.
If you agree that improving the performance of the TRMM_KERNEL as a separate issue to be addressed in the future, [CZ]GEMM_DEFAULT_[PQ] parameters can be changed in advance.
In that case, please let me know and I will fix the parameters.
The text was updated successfully, but these errors were encountered: