Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Program received signal SIGSEGV: Segmentation fault - invalid memory reference. #9

Closed
asterismo opened this issue May 10, 2018 · 33 comments

Comments

@asterismo
Copy link

I modified NMAX to 20000 objects, and recompiled. Executed the integrator and i got a segfault.

I pasted in Los Molinos Observatory pastebin.

https://pastebin.oalm.gub.uy/view/94773e01

At the bottom of the pastebin is the error.

@texadactyl
Copy link
Contributor

texadactyl commented May 10, 2018

I cannot reproduce the reported segmentation fault. But, I have fixed some bugs and got rid of the warnings. You can find my work under the issue, "donationware"

I am using Xubuntu 17.10 on a Celeron 1.8 GHz motherboard.

@asterismo
Copy link
Author

I'm using Debian 8 Jessie on a 1st Gen Core i5 (2.4 GHz) in my personal laptop, and i also tried in a Intel Xeon CPU E3-1225 v3 @ 3.20GHz. It throw the same error in both machines. I will try your code, thanks!

@texadactyl
Copy link
Contributor

texadactyl commented May 14, 2018

I discovered character-handling anomalies (not in the Science code). I had different symptoms with NMAX=20,000. Cannot claim that I found them all.

@asterismo
Copy link
Author

I got the same error:

./mercury6
Integrating massive bodies and particles up to the same epoch.
Beginning the main integration.

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x7F6CA8F86407
#1 0x7F6CA8F86A1E
#2 0x7F6CA84A30DF
#3 0x414EF4 in mco_x2ov_
#4 0x415775 in mio_ce_
#5 0x419C89 in mal_hvar_
#6 0x424DAD in MAIN__ at mercury6_2.for:?
Violación de segmento

@texadactyl
Copy link
Contributor

Could you try adding a -g parameter to FFLAGS in the Makefile to see if we could get line numbers in the traceback?

FFLAGS=-g -O2 -Wline-truncation -Wsurprising -Werror

@texadactyl
Copy link
Contributor

Are you using the exec_all.sh script that I created after the make step? See file 20180507_texadactyl.txt.

@texadactyl
Copy link
Contributor

My last results (should still be in the log folder):

===== Thu May 10 13:04:57 CDT 2018 ==============================================================
Begin mercury6 (basic integration) .....

Integrating massive bodies and particles up to the same epoch.
Beginning the main integration.
Date: 2119 8 6.8 dE/E: -5.74125E-13 dL/L: -3.01314E-13
Date: 2229 2 12.0 dE/E: -3.80231E-13 dL/L: -2.23733E-13
Date: 2338 8 24.2 dE/E: -2.39739E-13 dL/L: -2.57743E-13
Date: 2448 3 1.5 dE/E: -5.57036E-13 dL/L: -4.18236E-13
Date: 2557 9 10.4 dE/E: -1.39440E-12 dL/L: -8.00990E-13
Date: 2667 3 22.5 dE/E: -1.55674E-12 dL/L: -9.02654E-13
Date: 2776 9 27.8 dE/E: -2.84598E-13 dL/L: -4.58864E-13
Date: 2886 4 6.7 dE/E: -3.24691E-13 dL/L: -5.87001E-13
Date: 2995 10 18.5 dE/E: 1.11851E-12 dL/L: -1.20047E-13
Date: 3105 4 26.1 dE/E: 1.33508E-12 dL/L: -2.04062E-13
Date: 3214 11 1.1 dE/E: 2.18526E-12 dL/L: 9.32068E-14
Date: 3324 5 8.6 dE/E: 2.61544E-12 dL/L: 2.91202E-13
Date: 3433 11 17.8 dE/E: 2.35533E-12 dL/L: 1.08282E-13
Date: 3543 5 27.1 dE/E: 1.56611E-12 dL/L: -2.68590E-13
Date: 3652 12 2.6 dE/E: 2.16965E-12 dL/L: -1.24092E-13
Date: 3762 6 10.9 dE/E: 1.91726E-12 dL/L: -2.17666E-13
Date: 3871 12 24.0 dE/E: 2.18246E-12 dL/L: -3.00946E-13
etc.

@texadactyl
Copy link
Contributor

texadactyl commented May 14, 2018

Set the NMAX back to 20000 then:

elkins@biostar:/projects/mercury-master$ simple_clean.sh
elkins@biostar:
/projects/mercury-master$ make clean all
rm -f ./bin/element6 ./bin/close6 ./bin/mercury6 ./src/flog_*.log
gfortran -g -O2 -Wline-truncation -Wsurprising -Werror -o ./bin/close6 ./src/close6.for 2>&1 | tee ./src/flog_close6.log
gfortran -g -O2 -Wline-truncation -Wsurprising -Werror -o ./bin/element6 ./src/element6.for 2>&1 | tee ./src/flog_element6.log
gfortran -g -O2 -Wline-truncation -Wsurprising -Werror -o ./bin/mercury6 ./src/mercury6_2.for 2>&1 | tee ./src/flog_mercury6.log
elkins@biostar:~/projects/mercury-master$ exec_all.sh

===== Mon May 14 10:02:18 CDT 2018 ==============================================================
Begin mercury6 (basic integration) .....

Integrating massive bodies and particles up to the same epoch.
Beginning the main integration.
Date: 2119 8 6.8 dE/E: -5.74125E-13 dL/L: -3.01314E-13
Date: 2229 2 12.0 dE/E: -3.80231E-13 dL/L: -2.23733E-13
Date: 2338 8 24.2 dE/E: -2.39739E-13 dL/L: -2.57743E-13
Date: 2448 3 1.5 dE/E: -5.57036E-13 dL/L: -4.18236E-13
Date: 2557 9 10.4 dE/E: -1.39440E-12 dL/L: -8.00990E-13
etc.

@asterismo
Copy link
Author

Nope, i spotted the issue. I had also to edit the CMAX parameter and recompile

This is the info.out. Now is running fine

tail -f info.out
Integration details
-------------------

Initial energy: -3.32264E-08 solar masses AU^2 day^-2
Initial angular momentum: 6.08216E-05 solar masses AU^2 day^-1

Integrating massive bodies and particles up to the same epoch.

Beginning the main integration.

WARNING: Total number of current close encounters exceeds CMAX.
Modify mercury.inc and recompile Mercury.

WARNING: Total number of current close encounters exceeds CMAX.
Modify mercury.inc and recompile Mercury.

WARNING: Total number of current close encounters exceeds CMAX.
Modify mercury.inc and recompile Mercury.

WARNING: Total number of current close encounters exceeds CMAX.
Modify mercury.inc and recompile Mercury.

WARNING: Total number of current close encounters exceeds CMAX.
Modify mercury.inc and recompile Mercury.

WARNING: Total number of current close encounters exceeds CMAX.
Modify mercury.inc and recompile Mercury.

WARNING: Total number of current close encounters exceeds CMAX.
Modify mercury.inc and recompile Mercury.

@texadactyl
Copy link
Contributor

When you had the crash, what were the parameters set to? Cut and paste or just tell me. I'll try to see if there is a relatively simple fix and put in some diagnostics. Crashing is dumb.

@texadactyl
Copy link
Contributor

Actually, that "warning" should be a termination message. When it appears, various vector and array initialization is by passed, leaving random values - see if-statement at line 1663 in mercury6_2.for:
if (nclo.gt.CMAX) then

@asterismo
Copy link
Author

param.in
param.in.txt

@asterismo
Copy link
Author

small.in
small.in.txt

@texadactyl
Copy link
Contributor

So, you modified param.in and small.in in the exec subfolder.
How about the src subfolder before you compiled? mercury.inc (NMAX=20000, right?)

@asterismo
Copy link
Author

Steps to reproduce the initial problem.

  1. clone and compile
  2. execute, expect crash and warning about NMAX and recompile
  3. modify NMAX from 2000 to 20000 and recompile, crash and the "...received signal SIGSEGV: Segmentation fault - invalid memory reference.
  4. modify CMAX for encounters from 50 to 5000 and recompile
    Then it works

@texadactyl
Copy link
Contributor

Compiled with -g option? Did you see a traceback with line numbers this time?

@asterismo
Copy link
Author

I executed the ./compile from the git repo, and got plenty of warnings.
Now i executed make in your personal mercury-master, and it compiled without warnings.
I'm using Debian 8 Jessie

@texadactyl
Copy link
Contributor

Okay, with your small.in and param.in, I have mercury6 in a loop.
top shows mercury6 eating nearly 100% of a CPU core.
I was waiting to see if it would crash finally.
It seems to be writing to info.out over & over again:

WARNING: Total number of current close encounters exceeds CMAX.
Modify mercury.inc and recompile Mercury.

Again, this is not a "WARNING" situation, in my opinion. This should be a fatal error.

@asterismo
Copy link
Author

yes, with your version, i still waiting to start... the full-of-warnings-git-version seems to do the trick.

@texadactyl
Copy link
Contributor

I suspect that this old Fortran IV/77 code has never really been fully diagnosed. Ideally, it will someday be converted to Python and use the numpy libraries for vector and matrix calls without hand-coding.

But, in the meantime, Fortran diagnostic code can be added.

@texadactyl
Copy link
Contributor

Finally, a traceback with line numbers!

Begin mercury6 (basic integration) .....

Integrating massive bodies and particles up to the same epoch.
Beginning the main integration.

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0 0x7fb31e67d16a in ???
#1 0x7fb31e67c393 in ???
#2 0x7fb31dd4c13f in ???
#3 0x56300c09836c in mco_x2ov_
at ./src/mercury6_2.for:2878
#4 0x56300c098b87 in mio_ce_
at ./src/mercury6_2.for:5155
#5 0x56300c09d054 in mal_hvar_
at ./src/mercury6_2.for:401
#6 0x56300c0a6f3e in MAIN__
at ./src/mercury6_2.for:170
#7 0x56300c085ffe in main
at ./src/mercury6_2.for:217
exec_all.subsh: line 11: 15109 Segmentation fault (core dumped) $BIN/mercury6

@texadactyl
Copy link
Contributor

texadactyl commented May 14, 2018

Memory is trashed. Doesn't matter which version is used.

The crash is caused by calculations in subroutine mco_x2ov, line 2878:
v2 = uu + vv + w*w

Variables u, v, and w are parameters passed in. E.g. ending on line 5155 in the middle of a do-loop from 1 to nclo in subroutine mio_ce:
call mco_x2ov (rcen,rmax,m(1),0.d0,jxvclo(1,k),jxvclo(2,k),
% jxvclo(3,k),jxvclo(4,k),jxvclo(5,k),jxvclo(6,k),fr,theta,phi,
% fv,vtheta,vphi)

but k exceeds CMAX!

See subroutine mce_stat starting at line 1662.
nclo keeps getting incremented regardless of the array bounds.

See local data arrays of dimension CMAX starting at line 296. nclo, as used all over the place, is passed in as an array-element counter. nclo must never exceed CMAX.

Failure to control nclo causes the array bounds to be exceeded upon reference.
Therefore, illegal memory reference.

@texadactyl
Copy link
Contributor

Suggested solution: Once CMAX is breached in line 1663, put out an ERROR message and exit to the O/S.

Make sense to you?

@asterismo
Copy link
Author

I'm afraid that is too technical for my knowledge, but if you say so... go ahead.

@texadactyl
Copy link
Contributor

Sorry, I wasn't trying to be obtuse. You are too modest!

@texadactyl
Copy link
Contributor

Ok, the new code, with your small.in and param.in displays this and exits peacefully:

Begin mercury6 (basic integration) .....

Integrating massive bodies and particles up to the same epoch.
Beginning the main integration.

ERROR: Total number of current close encounters exceeds CMAX.
Modify mercury.inc and recompile Mercury.

That error message also appears in info.out. I just copied it to the console to wake up the sleeping scientist.

@asterismo
Copy link
Author

Is that a valid advice? if i keep increasing CMAX value it eventually run?

@texadactyl
Copy link
Contributor

That depends on how much your small.in population causes close encounters. Your case was a lot of asteroids (4667?) compared to the default which was 2. On the other hand, CMAX cannot be arbitrarily large or it won't fit into RAM. This code needs to give some better advice.

I have not read all of John Chambers' code.

@smirik
Copy link
Owner

smirik commented May 14, 2018

As far as I know mercury6 was initially designed to work with no more than 1000 objects. If you increase the number of object it might cause segfault errors.

Please also take into account that if you increase the number of object you usually decrease the accuracy of the calculations.

So, I would like to advise:

  1. Decrease the number of the asteroids to 1000 or less.
  2. Test the software with different numbers of the bodies and compare the results.

@smirik
Copy link
Owner

smirik commented May 16, 2018

Fix was applied in #11.

@smirik smirik closed this as completed May 16, 2018
@asterismo
Copy link
Author

Thanks, i will take that into account.

@texadactyl
Copy link
Contributor

@asterismo, did you solve your issue by a combination of decreasing the number of objects and/or increasing CMAX?

@Aneeskhan673
Copy link

properly specified in dipoleconfig.f
Dipole configurations for this process not
properly specified in dipoleconfig.f
Dipole configurations for this process not
properly specified in dipoleconfig.f

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
free(): corrupted unsorted chunks
Dipole configurations for this process not
properly specified in dipoleconfig.f

Program received signal SIGABRT: Process abort signal.
Thanks for using LHAPDF 6.2.3. Please make sure to cite the paper:
Eur.Phys.J. C75 (2015) 3, 132 (http://arxiv.org/abs/1412.7420)

Backtrace for this error:
#0 0x7f7803a8ed21 in ???
#1 0x7f7803a8def5 in ???
#0 0x7f7803a8ed21 in ???
#1 0x7f7803a8def5 in ???
#2 0x7f780375620f in ???
#2 0x7f780375620f in ???
#3 0x7f780375618b in ???
#3 0x7f78037599f6 in ???
#4 0x7f7803735858 in ???
#4 0x7f7803759bdf in ???
#5 0x7f78037a03ed in ???
#5 0x7f7803a90f14 in ???
#6 0x7f78037a847b in ???
#6 0x559f7b0087e3 in ???
#7 0x559f7b008e66 in ???
#7 0x7f78037aa1c1 in ???
#8 0x559f7b0fba15 in ???
#8 0x559f802784f0 in ???
#9 0x559f7af809ab in ???
#9 0x7f780375a43e in ???
#10 0x7f7803f7c78d in ???
#10 0x7f7803759b8c in ???
#11 0x7f7803759bdf in ???
#12 0x7f7803a90f14 in ???
#13 0x559f7b0087e3 in ???
#14 0x559f7b008e66 in ???
#15 0x559f7b0fba15 in ???
#16 0x559f7af809ab in ???
#17 0x7f7803f7c78d in ???
#18 0x7f78036f0608 in start_thread
#11 0x7f78036f0608 in start_thread
at /build/glibc-eX1tMB/glibc-2.31/nptl/pthread_create.c:477
at /build/glibc-eX1tMB/glibc-2.31/nptl/pthread_create.c:477
#19 0x7f7803832292 in ???
#12 0x7f7803832292 in ???
#20 0xffffffffffffffff in ???
#13 0xffffffffffffffff in ???
Aborted (core dumped)
why this error came?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants