Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to debug DGO issue with Int64 #292

Merged
merged 2 commits into from
Jun 7, 2024

Conversation

amontoison
Copy link
Member

@amontoison amontoison commented Jun 7, 2024

 C sparse matrix indexing

 tests options for all-in-one storage format


 DGO solver, problem:  (n = 3)

At line 416 of file ../src/hash/hash.F90
Fortran runtime error: Unit number in I/O statement too large

@amontoison amontoison merged commit bb9ed5f into ralna:master Jun 7, 2024
11 of 21 checks passed
@amontoison amontoison deleted the ci-debug-int64 branch June 7, 2024 23:50
@amontoison
Copy link
Member Author

amontoison commented Jun 7, 2024

@nimgould I was able to connect to the virtual machine and reproduce the error with dgo.

You just need to uncomment the following lines here and after that you can click on the yellow build of macos-13/gcc-v11/Int64.
After the compilation, you will see that a ssh command to connect to the CI machine will be print every 5 seconds.

@nimgould
Copy link
Contributor

nimgould commented Jun 8, 2024

OK, I am on ... but now what. How do I test an individual package? Am I supposed to use a meson command? I don't know what it/they are ... I'll need to keep editing files, recompiling them and then run the dgo test. Sorry, I need help to proceed. Oh, and I see that the shell has no emacs, so I'll be pretty helpless

@nimgould
Copy link
Contributor

nimgould commented Jun 8, 2024

Sorry, Just read the README, now I see how to do this. Still no usable editor, though. And the issue on the macos is to do with ssids, not dgo. Indeed, none of the failures are now for dgo, sheesh, this action system is so maddening!

I've now re-commented the ssh workflow out.

@nimgould
Copy link
Contributor

nimgould commented Jun 8, 2024

I am trying to see what is going wrong wth nvfortran. I tried this locally:

CC=nvc CXX=nvc++ FC=nvfortran meson setup builddir/pc64.lnx.nvf_64 -Dc_std=none -Dcpp_std=none -Dgalahad_int64=true
meson compile -C builddir/pc64.lnx.nvf_64

... which is ok until

[519/1348] Compiling Fortran object li...on-generated_single_cutest_dummy.f90.o
FAILED: libgalahad_single_64.so.p/meson-generated_single_cutest_dummy.f90.o libgalahad_single_64.so.p/galahad_cutest_single_64.mod
nvfortran -Ilibgalahad_single_64.so.p -I. -I../.. -Iinclude -I../../include -I../../src/dum/include -I../../src/metis/include -Isrc/ampl -I../../src/ampl -I/usr/lib/x86_64-linux-gnu/openmpi/include -I/usr/lib/x86_64-linux-gnu/openmpi/include/openmpi -I/usr/lib/x86_64-linux-gnu/openmpi/lib -O3 -mp -fPIC -DSINGLE -DDUMMY_MKL_PARDISO -DDUMMY_PARDISO -DDUMMY_PASTIXF -DDUMMY_SPMF -DDUMMY_WSMP -DDUMMY_HSL -DINTEGER_64 -module libgalahad_single_64.so.p -o libgalahad_single_64.so.p/meson-generated_single_cutest_dummy.f90.o -c libgalahad_single_64.so.p/single_cutest_dummy.f90
NVFORTRAN-S-0034-Syntax error at or near / (libgalahad_single_64.so.p/single_cutest_dummy.f90: 644)
NVFORTRAN-S-0034-Syntax error at or near / (libgalahad_single_64.so.p/single_cutest_dummy.f90: 646)
....etc

On examing the geneated libgalahad_single_64.so.p/single_cutest_dummy.f90 file, I see on line 644 and onwards that it has inserted the cutest_routines.h header file verbatim, i.e.,
/* \file cutest_routines.h */

/*

  • assign names for each CUTEst routine using the C pre-processor.
  • possibilities are (currently) single (r4 and double (r8, default) reals
  • Nick Gould for CUTEst
  • initial version, 2023-11-11
  • this version 2024-04-05
    */

/*

  • assign names for each single precision CUTEst routine using
  • the C pre-processor
  • Nick Gould for CUTEst
  • initial version, 2023-11-11
  • this version 2024-01-16
    */

Poor old fortran can make no sense of this, and it doesn't happen with other compilers (it leaves the cpp header files alone)

Any ideas?

@nimgould
Copy link
Contributor

nimgould commented Jun 8, 2024

I commented out the nvidia tests as the copmpiler clearly has issues and isn't ready for proper deployment; it was unable to resolve generic interfaces in many places (and all the other compilers had no issues)

@amontoison
Copy link
Member Author

Sorry, Just read the README, now I see how to do this. Still no usable editor, though. And the issue on the macos is to do with ssids, not dgo. Indeed, none of the failures are now for dgo, sheesh, this action system is so maddening!

I've now re-commented the ssh workflow out.

I think the best solution is to add multiple print to isolate the issue.
But it can wait next week... :)

@amontoison
Copy link
Member Author

amontoison commented Jun 8, 2024

@nimgould I wonder if the issue is not just with the WRITE statement in Fortran being platform-dependent. I suspect that the channel can be an integer with 4 or 8 bits only on Linux, while other platforms require a 4-bit integer.
It could explain why we have an error at line 416 of hash.F90 (control%out is a 8-bit integer).

@nimgould
Copy link
Contributor

nimgould commented Jun 9, 2024

That is possible, I suppose, but then why doesn't the compiler object that the variable is the wrong type for the write function? Moreover, this would be true for all write statements (in both HSL and GALAHAD), and we don't see warnings from any other runs. I will output the varaibles before the write to check

@nimgould
Copy link
Contributor

nimgould commented Jun 9, 2024

Ah ha, bug splatted. It was simply that in the C interface, I had commented out the copy of the hash control components from C to fortran, so they took random values!

Of the two remaining failures, both are timeouts. The Windows one looks like it needs a bit more time, not sure about the Mac one, though. I cannot reproduce here, as the same Mac test seems to work

@nimgould
Copy link
Contributor

nimgould commented Jun 9, 2024

OK, doubling the timeout cured the Windows issue. Unfortunately, now one of the Ubuntu intel ones is failing (odd that it didn't before, and all that has changed is the timeout!) when testing the Julia. I can see why that might be, and can put in a precaution. The other timeout failure, on the Mac, produces no output from the test (for sbls), so I can't say what is happening.

@nimgould
Copy link
Contributor

nimgould commented Jun 9, 2024

"Precaution" works, but now another timeout for the Windows 64bit. Will tihis cycle of inconsistent runtimes ever cease ... I'll double the timeout and try again ...

@nimgould
Copy link
Contributor

nimgould commented Jun 9, 2024

I give up ... the more I increase the timeout period, the more runs timeout

@nimgould
Copy link
Contributor

nimgould commented Jun 9, 2024

Is there something wrong with these Windows virtual machines? Timeout for nlst_single after 120 seconds, while for the Mac and Ubuntu the run is 0.4 seconds

@nimgould
Copy link
Contributor

nimgould commented Jun 9, 2024

And now, not changing a thing, the times dropped to 1 second, and the tests passed. So, only the Mac issue to sort out.

@amontoison
Copy link
Member Author

I give up ... the more I increase the timeout period, the more runs timeout

If we have a timeout, it means that we have an infinite recursion during the test.
Are some tests with random values?

@nimgould
Copy link
Contributor

No, this is all deterministic. Times vary considerably during both compilation and runs

@nimgould
Copy link
Contributor

Sometimes it times out, others it doesn't, with a factor of 10 in different times

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants