-
-
Notifications
You must be signed in to change notification settings - Fork 5.6k
poor performance of exp()
on 32 bit
#10425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I can confirm very similar numbers on both Windows and Linux. This is probably an openlibm issue on 32 bit, wouldn't be the first one. Linux numbers:
|
exp()
on Win32exp()
on 32 bit
Comparing with the system libm would be interesting. |
I think that'll need someone with a real 32 bit Linux system to try out. I don't think system libm is usable on Windows (not even positive where it lives - inside msvcrt I think?), and my 32 bit Linux builds are multilib compiled from a 64 bit OS. |
Machine is old & slow, but here you go ... $ ./julia
_
_ _ _(_)_ | A fresh approach to technical computing
(_) | (_) (_) | Documentation: http://docs.julialang.org
_ _ _| |_ __ _ | Type "help()" for help.
| | | | | | |/ _` | |
| | |_| | | | (_| | | Version 0.4.0-dev+3727 (2015-03-08 23:04 UTC)
_/ |\__'_|_|_|\__'_| | Commit 768401c* (0 days old master)
|__/ | i686-redhat-linux
julia> versioninfo()
Julia Version 0.4.0-dev+3727
Commit 768401c* (2015-03-08 23:04 UTC)
Platform Info:
System: Linux (i686-redhat-linux)
CPU: Genuine Intel(R) CPU T2250 @ 1.73GHz
WORD_SIZE: 32
BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Banias)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.3
julia> mysysexp(x::Float64) = ccall((:exp, "libm"), Float64, (Float64,), x)
mysysexp (generic function with 1 method)
julia> @vectorize_1arg Float64 mysysexp
mysysexp (generic function with 4 methods)
julia> r0 = rand(Float64,(5000, 5000));
julia> r = r0; @time exp(r);
elapsed time: 2.727944529 seconds (190 MB allocated)
julia> r = r0; @time exp(r);
elapsed time: 2.676581093 seconds (190 MB allocated)
julia> r = r0; @time mysysexp(r);
elapsed time: 3.142749421 seconds (190 MB allocated)
julia> r = r0; @time mysysexp(r);
elapsed time: 3.012173802 seconds (190 MB allocated)
|
Interesting, thanks. What about for the large values? |
Oh yeah, ... julia> r = 10000 * r0; @time exp(r);
elapsed time: 6.087031901 seconds (190 MB allocated)
julia> r = 10000 * r0; @time exp(r);
elapsed time: 6.066508487 seconds (190 MB allocated)
julia> r = 10000 * r0; @time mysysexp(r);
elapsed time: 10.517670321 seconds (190 MB allocated)
julia> r = 10000 * r0; @time mysysexp(r);
elapsed time: 10.469830777 seconds (190 MB allocated)
|
Thanks! I don't know much about the internals of how different libm's implement |
I don't feel terribly worried about 32-bit. Is there a particular application motivating this - or just an observation, in case we can do something better? |
We surely should. I was just curious. Should we move this issue to openlibm? |
Sounds like that's the next step on its journey. (FWIW, I just moved the program over to a 64-bit linux machine--I have the luxury of SSH--but unfortunately for Reasons Windows is definitely more convenient.) |
One point to keep in mind is that One thing I don't understand with openlibm is what determines whether src/e_exp.c or i387/e_exp.S is used? |
Both are getting linked. |
Going back to the original issue, where we are computing PDFs, the same is true for the underflow branch (where the exponential evaluates to floating-point 0). |
Having looked again, my guess is that on 686 we're calling the x87 assembly code, which doesn't have an early branch for under or overflow. |
We are using a native julia |
was there any benchmarking on 32 bit? |
No, only with the two different julia versions. |
then there's no evidence this is fixed. the new implementation could easily still be slow on 32 bit |
Do you have access to a 32 bit machine so you can test it? |
On the same hardware with a x64 build with avx2 and a generic i686 build, the performance appear to scale similarly on the two build. The x64 is faster but I think it's likely because of the use of fma. AVX2 on i686 is a pretty weird combination (though supported by the hardware) so I think the difference is OK and this can be closed. |
Oh, and I should say that it's ~2x faster with AVX2. |
Besides being slow on its own, note that it gets worse for large-valued inputs; on Win64 and Linux performance is faster for large-valued inputs. In all cases,
r0 = rand(5000, 5000)
.Win32:
Win64:
(Note that my Win64 installation is affected by #10249, so it's hard to test there.)
The text was updated successfully, but these errors were encountered: