-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash when compiling with ACFL and '-O3 -mcpu=native' flags #28
Comments
Thanks Antoine. We'd need to dig into this to understand it more, but right away I can say that Fiat isn't the problem here. The problem is that the iterative algorithm used to compute the points and weights required for Gaussian quadrature in the Legendre transform failed to converge. That's what This could be annoying to debug because it means some arithmetic error is happening elsewhere. My guess is that something wrong is happening in this file with those compile options: https://github.com/ecmwf-ifs/ectrans/blob/main/src/trans/internal/cpledn_mod.F90. Are you able to compile with floating-point exception trapping enabled? E.g., with the Cray compiler this is off by default, and so FPEs can manifest as other kinds of errors. You have to add |
I suppose you are mentionning the I just launched a run forcing this option, and the output does not change :
|
Right, so that's not the problem. Firstly it's no surprise that you see this problem with both double and single precision, as this part of the code is always double precision regardless of the working precision Secondly I can suggest something to make the code run, but it won't necessarily be doing the right thing. Can you change this value to If that doesn't work, is it a problem if you just have to disable |
here is the log:
Obviously it's wrong with max error at -0.999E+03 =)
You can reproduce on the A64FX, you just need to setup ACFL 23.04.1. It's free to setup/use nowadays. I'll dump the arrays if I find some time :) |
Hello, FYI, I tried with latest release of ACFL (24.04). I could not reproduce the issue with that version of the ARM compiler. I let you try out and close the issue if this works for you. Best regards. |
Hi Antoine, at the moment we do not have access to any machines with the ARM compiler (or at least, I don't personally). So we don't currently have a way to test this. I wish I had a Raspberry Pi right now... I think for now we can just close this issue. |
If it's the ARM compiler, it's free, and you can install it in your $HOME. If it's an ARM cpu, that's another matter :) If you trust me enough, you can sure close the issue ;) |
@samhatfield
Hello,
I am playing with ecTrans on the Graviton3 system. Compiling with ACFL (Arm Compiler for Linux = armclang/armflang) led the app to crash when using some performance flags. I confirmed the issue to happen on other systems with SVE, but not on systems without. Below is a table summarizing my experiments.
The hardware I tested on:
The software stack consists of:
And the command run is
mpiexec -n 1 ./ectrans-benchmark-dp --meminfo --norms -n 20 -f 5 -l 40 --vordiv
. Note that similar behavior occurs with single precision.As we can see, the non-SVE AmepreQ8030 system seems unaffected by this issue, whereas both SVE systems exhibit similar behavior. We can also observe that removing the
-mcpu=native
flag leads to successful run.Typical output looks when crashing like this (here was a run on Graviton3 using the double precision benchmark) :
The build uses following parameters (excerpt from this full script : https://gist.github.com/antoine-morvan/611c4d779fd704279bb0b938598fb597):
Then this benchmark causes the run to fail :
Looking at the backtrace it feels like the problem originates from fiat, but I did not investigate further.
Also, ARM is aware of this issue.
Feel free to ask more details.
Best.
The text was updated successfully, but these errors were encountered: