Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenBLAS on windows substantially slower than other BLAS flavours #160

Open
h-vetinari opened this issue Jun 14, 2024 · 18 comments
Open

OpenBLAS on windows substantially slower than other BLAS flavours #160

h-vetinari opened this issue Jun 14, 2024 · 18 comments

Comments

@h-vetinari
Copy link
Member

In the context of the BLAS variant testing for scipy, I noticed that the OpenBLAS runs were much slower. I did some basic timing comparisons based on what pytest reports as the overall runtime of the scipy test suite (the timing also depends quite a bit on whether the agent has AVX512F/AVX512CD or not, and this is random across Azure's fleet, so I'm taking the average across implementation & CPU type):

Flavour Avg. (AVX2) #runs Avg. (AVX512) #runs
blis 00:26:10.0 2 00:18:28.2 6
mkl 00:24:35.4 5 00:18:56.4 5
netlib 00:25:44.7 3 00:19:03.0 5
openblas 01:34:15.0 2 01:24:06.8 4

Overall, OpenBLAS ends up being between 3-5x slower than all the other BLAS/LAPACK implementations, which to me is indicative of something going very wrong somewhere.

CC @martin-frbg @rgommers

@martin-frbg
Copy link

All these numbers look a little strange to me - why would MKL be only marginally better than the entirely unoptimized reference implementation ? And I'd expect even the baseline OpenBLAS - an MSVC build that can only make use of the generic C sources - to be slightly faster than the reference as well.

@h-vetinari
Copy link
Member Author

h-vetinari commented Jun 14, 2024

why would MKL be only marginally better than the entirely unoptimized reference implementation?

The answer is almost certainly that the cumulated runtime of blas/lapack calls as a portion of the total runtime of the entire scipy test suite gets dominated by random variability of the CI agent (memory and CPU contention, caches etc.).

Under this hypothesis, only when something takes way longer on average in the blas/lapack interface does it actually make a difference.

(The difference between CI agents is likely also determined by other aspects than just the available CPU instructions too - numpy does runtime dispatching, but none of our blas/lapack builds have been compiled to take advantage of avx instructions, for example)

@rgommers
Copy link

Sounds like the issue in OpenMathLib/OpenBLAS#4582. Was the openblas package in conda-forge rebuilt with the fix included?

@h-vetinari
Copy link
Member Author

Sounds like the issue in OpenMathLib/OpenBLAS#4582. Was the openblas package in conda-forge rebuilt with the fix included?

That issue was fixed in OpenMathLib/OpenBLAS#4587 which landed in 0.3.27 (which is the version we're using across conda-forge already).

@rgommers
Copy link

Hmm, that needs investigating then. The timings make it almost certain to be a similar deadlock lock contention issue. It's a 10x-100x slowdown for BLAS/LAPACK calls to explain the test suite being that slow.

There was a similar issue at scipy/scipy#20585 (comment).

@martin-frbg
Copy link

There have been no other changes to the Windows thread server code in 0.3.27 since then (as far as I am aware right now), so I think the best option for testing would be to swap in the old version of blas_server_win32.c from OpenMathLib/OpenBLAS@66904f8 (this will probably need adding back the global declaration of int blas_omp_threads_local = 1; but hopefully no other changes). Unfortunately I do not have local access to a Windows machine where I am right now

@martin-frbg
Copy link

(though I would assume that any residual problem in mseminatore's PRs should have come up when testing the fix in PR4587)

@martin-frbg
Copy link

The test case from OpenMathLib/OpenBLAS#4582 passes for me without any apparent delays, so I don't think it's that.

@martin-frbg
Copy link

Also seems to me the "similar" issue is/was plagued by some kind of infighting between duplicate libraries ?

@h-vetinari
Copy link
Member Author

Also seems to me the "similar" issue is/was plagued by some kind of infighting between duplicate libraries ?

That's an interesting conjecture, though I don't see how that can happen in conda-forge, where we generally take care to unvendor things so there's only one copy. Especially for numpy and scipy, where we keep a close look at the builds

@isuruf
Copy link
Member

isuruf commented Jun 22, 2024

Can you post a link to the logs?

@h-vetinari
Copy link
Member Author

Can you post a link to the logs?

Here you go (plus any other still remaining runs of TEST: 1.13.x + blas variants on the scipy feedstock, before azure deletes them after a month).

@martin-frbg
Copy link

First impression is that it appears to be linalg/tests/test_basic.py::TestLstsq::test_random_complex_exact that is taking unusually long, and several runs with what look like markedly shorter times simply aborted due to some missed dependencies in the python framework.

@h-vetinari
Copy link
Member Author

and several runs with what look like markedly shorter times simply aborted due to some missed dependencies in the python framework.

In those PRs you cannot just look at the overall runtime, because we're trying to forcefully make a distinction between CPUs with/without AVX512F/AVX512CD, and the abort if the CPU architecture expectation isn't met (because some past & present failures had different behaviours, and azure doesn't provide a way to influence that). However, if that happens, then there's zero runtime of the test suite.

On windows, roughly everything over 60min actually ran the tests, everything under 50min didn't, and in between it depends on circumstances (fast or slow agent).

@h-vetinari
Copy link
Member Author

First impression is that it appears to be [...] that is taking unusually long

It's quite unlikely that a single test (or anything less than widespread/systemic) can blow out the test times like that. This is especially the case as many tests are completely unrelated to blas. Tests also have a 20min timeout, and none of them is hit.

@martin-frbg
Copy link

umm, did you ever get around to testing with the pre-0.3.26 blas_server_win32.c as per #160 (comment) ?

@h-vetinari
Copy link
Member Author

Thanks for the reminder, I didn't understand the ask at the time, but I've dug a bit into the git history and I think/hope I got it right; currently building an openblas version with this in #162; we might already see a difference in the blas testing on this feedstock (compare "Good news #2" from here), and if that's not conclusive, we can publish those builds to a separate label to test them in SciPy.

@martin-frbg
Copy link

thank you - meanwhile it looks like i might have to revert that PR anyway due to the sunpy thread safety issue that came up on numpy :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants