`omatcopy` much slower than `copy` in OMP loop

I'm trying to use `cblas_domatcopy` to transpose large row-major matrices.

I'm finding that the function is slower than a simple loop of `cblas_dcopy` calls parallelized with OpenMP (with number of threads set to number of logical cores, otherwise OMP loop is much slower).

Function `cblas_domatcopy` appears to be especially slower when the inputs have more columns than rows - in this sense, in a `dcopy` loop, there's also a large timing difference according to whether the copies are by rows of the input or of the output, and I'm guessing that perhaps `omatcopy` always follows the same order.

(code is provided at the end of this post)

* Timings in seconds on an intel 12700H, average of 7 runs:
    * Input size: 100,000 x 5x000
        * OpenBLAS `cblas_domatcopy`: 3.12
        * OpenMP `dcopy` loop: 2.38
        * MKL `MKL_Domatcopy`: 1.26
    * Input size: 5,000 x 100,000
        * OpenBLAS `cblas_domatcopy`: 3.74
        * OpenMP `dcopy` loop: 1.23
        * MKL `MKL_Domatcopy`: 1.27
        
* Timings in seconds on an amd ryzen 7840HS, average of 7 runs:
    * Input size: 100,000 x 5x000
        * OpenBLAS `cblas_domatcopy`: 0.922
        * OpenMP `dcopy` loop: 0.586
        * MKL `MKL_Domatcopy`: 0.560
    * Input size: 5,000 x 100,000
        * OpenBLAS `cblas_domatcopy`: 1.12
        * OpenMP `dcopy` loop: 0.402
        * MKL `MKL_Domatcopy`: 0.516

OpenBLAS version: 0.3.26, OpenMP variant.

Code that I'm using for the OMP `dcopy` loop:
```cpp
void transpose_mat(const double *A, const int nrows, const int ncols, double *B, int nthreads)
{
    if (nrows >= ncols)
    {
        #pragma omp parallel for schedule(static) num_threads(nthreads)
        for (int row = 0; row < nrows; row++)
            cblas_dcopy(ncols, A + (size_t)row*(size_t)ncols, 1, B + row, nrows);
    }
    
    else
    {
        #pragma omp parallel for schedule(static) num_threads(nthreads
        for (int col = 0; col < ncols; col++)
            cblas_dcopy(nrows, A + col, ncols, B + (size_t)col*(size_t)nrows, 1);
    }
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`omatcopy` much slower than `copy` in OMP loop #4902

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

omatcopy much slower than copy in OMP loop #4902

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`omatcopy` much slower than `copy` in OMP loop #4902