Skip to content

poor performance of exp() on 32 bit #10425

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
pao opened this issue Mar 6, 2015 · 22 comments
Closed

poor performance of exp() on 32 bit #10425

pao opened this issue Mar 6, 2015 · 22 comments
Labels
performance Must go faster system:32-bit Affects only 32-bit systems

Comments

@pao
Copy link
Member

pao commented Mar 6, 2015

Besides being slow on its own, note that it gets worse for large-valued inputs; on Win64 and Linux performance is faster for large-valued inputs. In all cases, r0 = rand(5000, 5000).

Win32:

julia> r = r0; @time exp(r);
elapsed time: 1.138791707 seconds (190 MB allocated)

julia> r = 10000*r0; @time exp(r);
elapsed time: 3.855262381 seconds (190 MB allocated)

Win64:

julia> @time exp(r0);
elapsed time: 0.463... seconds # what I read before Julia crashes due to #10259

julia> r=10000*r0; @time exp(r0);
elapsed time: 0.235... seconds # what I read before Julia crashes due to #10259

(Note that my Win64 installation is affected by #10249, so it's hard to test there.)

@tkelman
Copy link
Contributor

tkelman commented Mar 7, 2015

I can confirm very similar numbers on both Windows and Linux. This is probably an openlibm issue on 32 bit, wouldn't be the first one.

Linux numbers:

julia> versioninfo()
Julia Version 0.4.0-dev+3666
Commit 400fa31* (2015-03-03 22:51 UTC)
Platform Info:
  System: Linux (i686-linux-gnu)
  CPU: Intel(R) Core(TM)2 Duo CPU     E8400  @ 3.00GHz
  WORD_SIZE: 32
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Penryn)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> r0 = rand(5000,5000);

julia> r=r0; @time exp(r);
elapsed time: 1.534604899 seconds (190 MB allocated)

julia> r=10000*r0; @time exp(r);
elapsed time: 3.664688047 seconds (190 MB allocated)

julia> exit()
tkelman@ygdesk:~/Julia/julia-linux32$ cd ../julia
tkelman@ygdesk:~/Julia/julia$ ./julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.0-dev+3690 (2015-03-06 17:50 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit 753390b* (0 days old master)
|__/                   |  x86_64-linux-gnu

julia> r0 = rand(5000,5000);

julia> r=r0; @time exp(r);
elapsed time: 0.952396283 seconds (191 MB allocated)

julia> r=10000*r0; @time exp(r);
elapsed time: 0.350571556 seconds (190 MB allocated)

@tkelman tkelman removed the system:windows Affects only Windows label Mar 7, 2015
@tkelman tkelman changed the title poor performance of exp() on Win32 poor performance of exp() on 32 bit Mar 7, 2015
@nalimilan
Copy link
Member

Comparing with the system libm would be interesting.

@tkelman
Copy link
Contributor

tkelman commented Mar 8, 2015

I think that'll need someone with a real 32 bit Linux system to try out. I don't think system libm is usable on Windows (not even positive where it lives - inside msvcrt I think?), and my 32 bit Linux builds are multilib compiled from a 64 bit OS.

@rickhg12hs
Copy link
Contributor

Machine is old & slow, but here you go ...

$ ./julia 
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" for help.
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.4.0-dev+3727 (2015-03-08 23:04 UTC)
 _/ |\__'_|_|_|\__'_|  |  Commit 768401c* (0 days old master)
|__/                   |  i686-redhat-linux

julia> versioninfo()
Julia Version 0.4.0-dev+3727
Commit 768401c* (2015-03-08 23:04 UTC)
Platform Info:
  System: Linux (i686-redhat-linux)
  CPU: Genuine Intel(R) CPU           T2250  @ 1.73GHz
  WORD_SIZE: 32
  BLAS: libopenblas (DYNAMIC_ARCH NO_AFFINITY Banias)
  LAPACK: libopenblas
  LIBM: libopenlibm
  LLVM: libLLVM-3.3

julia> mysysexp(x::Float64) = ccall((:exp, "libm"), Float64, (Float64,), x)
mysysexp (generic function with 1 method)

julia> @vectorize_1arg Float64 mysysexp
mysysexp (generic function with 4 methods)

julia> r0 = rand(Float64,(5000, 5000));

julia> r = r0; @time exp(r);
elapsed time: 2.727944529 seconds (190 MB allocated)

julia> r = r0; @time exp(r);
elapsed time: 2.676581093 seconds (190 MB allocated)

julia> r = r0; @time mysysexp(r);
elapsed time: 3.142749421 seconds (190 MB allocated)

julia> r = r0; @time mysysexp(r);
elapsed time: 3.012173802 seconds (190 MB allocated)

@tkelman
Copy link
Contributor

tkelman commented Mar 9, 2015

Interesting, thanks. What about for the large values?

@rickhg12hs
Copy link
Contributor

Oh yeah, ...

julia> r = 10000 * r0; @time exp(r);
elapsed time: 6.087031901 seconds (190 MB allocated)

julia> r = 10000 * r0; @time exp(r);
elapsed time: 6.066508487 seconds (190 MB allocated)

julia> r = 10000 * r0; @time mysysexp(r);
elapsed time: 10.517670321 seconds (190 MB allocated)

julia> r = 10000 * r0; @time mysysexp(r);
elapsed time: 10.469830777 seconds (190 MB allocated)

@tkelman
Copy link
Contributor

tkelman commented Mar 9, 2015

Thanks! I don't know much about the internals of how different libm's implement exp (paging @simonbyrne?) but it sounds like the timing trends for openlibm are consistent with glibc, and actually a little better. Something is allowing the 64 bit version to be quite a bit faster, and show the opposite timing trend versus input values. Different allowed instruction sets, I suppose?

@ViralBShah
Copy link
Member

I don't feel terribly worried about 32-bit. Is there a particular application motivating this - or just an observation, in case we can do something better?

@tkelman
Copy link
Contributor

tkelman commented Mar 9, 2015

Probably because win64 Julia is completely broken for @pao at this time - #10249

Can we at least look into this? There are still 32-bit bugs in openlibm.

@ViralBShah
Copy link
Member

We surely should. I was just curious. Should we move this issue to openlibm?

@pao
Copy link
Member Author

pao commented Mar 9, 2015

Sounds like that's the next step on its journey. (FWIW, I just moved the program over to a 64-bit linux machine--I have the luxury of SSH--but unfortunately for Reasons Windows is definitely more convenient.)

@simonbyrne
Copy link
Contributor

One point to keep in mind is that Float64 arguments greater than 710 will overflow, so this test is mostly just detecting how fast that branch occurs.

One thing I don't understand with openlibm is what determines whether src/e_exp.c or i387/e_exp.S is used?

@ViralBShah
Copy link
Member

Both are getting linked.

@pao
Copy link
Member Author

pao commented Mar 9, 2015

One point to keep in mind is that Float64 arguments greater than 710 will overflow, so this test is mostly just detecting how fast that branch occurs.

Going back to the original issue, where we are computing PDFs, the same is true for the underflow branch (where the exponential evaluates to floating-point 0).

@simonbyrne
Copy link
Contributor

Having looked again, my guess is that on 686 we're calling the x87 assembly code, which doesn't have an early branch for under or overflow.

@KristofferC
Copy link
Member

We are using a native julia exp function now. Benchmarking shows no difference in speed on win32 and win64. We have extensive benchmarks for exp already. If we want to compare benchmarks on 32 and 64 bit seems like a different issue.

@tkelman
Copy link
Contributor

tkelman commented Jan 26, 2017

was there any benchmarking on 32 bit?

@KristofferC
Copy link
Member

No, only with the two different julia versions.

@tkelman
Copy link
Contributor

tkelman commented Jan 26, 2017

then there's no evidence this is fixed. the new implementation could easily still be slow on 32 bit

@tkelman tkelman reopened this Jan 26, 2017
@KristofferC
Copy link
Member

Do you have access to a 32 bit machine so you can test it?

@yuyichao
Copy link
Contributor

On the same hardware with a x64 build with avx2 and a generic i686 build, the performance appear to scale similarly on the two build. The x64 is faster but I think it's likely because of the use of fma. AVX2 on i686 is a pretty weird combination (though supported by the hardware) so I think the difference is OK and this can be closed.

@yuyichao
Copy link
Contributor

Oh, and I should say that it's ~2x faster with AVX2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Must go faster system:32-bit Affects only 32-bit systems
Projects
None yet
Development

No branches or pull requests

8 participants