Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unit tests for prif_stop and prif_error_stop make fragile non-portable assumptions #137

Open
bonachea opened this issue Sep 12, 2024 · 1 comment

Comments

@bonachea
Copy link
Member

Currently the approach taken to unit testing prif_stop and prif_error_stop is to unconditionally invoke ./build/run-fpm.sh in the fpm built Caffeine unit test, and inspecting the resulting process exit code.

I consider this entire approach to be very fragile for multiple reasons:

  1. Assumes Caffiene test executable is run from the source/build directory
  2. Assumes fpm (and possibly the compiler) are available on the compute node
  3. Assumes fpm is capable of launching parallel jobs at all
  4. Assumes parallel jobs can be launched at all (by any command) from the compute node
  5. Currently appears to have EVERY image launch the subjob
  6. Relies on process exit code propagation, which can be unreliable in loosely coupled distributed systems

I expect one or more of the above assumptions to be violated on some systems (completely breaking the Caffeine unit test) once we incorporate distributed conduits and non-trivial job spawners.

As such that we'll eventually need a "kill switch" to disable this practice, or better yet a more robust approach to exit testing that doesn't rely on programmatically invoking fom to spawn a sub-job.

@bonachea
Copy link
Member Author

Direct evidence that subjob invocations of fpm are not being invoked the correctly once-per-test, but rather once-per-image-per-test (problem 5 listed above):

{pcp-d-10} env CC=gcc CXX=c++ FC=gfortran GASNET_PSHM_NODES=1 ./build/run-fpm.sh test | tail -n 50
Project is up to date
        sums integer(c_int64_t) scalars with no optional arguments present
        multiplies default real scalars with all optional arguments present
        multiplies real(c_double) scalars with all optional arguments present
        performs a collective .and. operation across logical scalars
        sums default complex scalars with a stat-variable present
        sums complex(c_double) scalars with a stat-variable present
        sums default integer elements of a 2D array across images
    The prif_co_sum subroutine
        sums default integer scalars with no optional arguments present
        sums default integer scalars with all arguments present
        sums integer(c_int64_t) scalars with stat argument present
        sums default integer 1D arrays with no optional arguments present
        sums default integer 15D arrays with stat argument present
        sums default real scalars with result_image argument present
        sums double precision 2D arrays with no optional arguments present
        sums default complex scalars with stat argument present
        sums double precision 1D complex arrays with no optional arguments present
    A program that executes the prif_error_stop function
        exits with a non-zero exitstat when the program omits the stop code
        prints a character stop code and exits with a non-zero exitstat
        prints an integer stop code and exits with exitstat equal to the stop code
    prif_image_index
        returns 1 for the simplest case
        returns 1 when given the lower bounds
        returns 0 with invalid subscripts
        returns the expected answer for a more complicated case
    The prif_num_images function result
        is a valid number of images when invoked with no arguments
    PRIF RMA
        can send a value to another image
        can send a value with indirect interface
        can get a value from another image
        can get a value with indirect interface
    A program that executes the prif_stop function
        exits with a zero exitstat when the program omits the stop code
        prints an integer stop code and exits with exitstat equal to the stop code
        prints a character stop code and exits with a non-zero exitstat
    Teams
        can be created, changed to, and allocate coarrays
    The prif_this_image_no_coarray function result
        is the proper member of the set {1,2,...,num_images()} when invoked as this_image()

A total of 57 test cases

All Passed
Took 8.07435 seconds

A total of 57 test cases containing a total of 82 assertions

           0
{pcp-d-10} env CC=gcc CXX=c++ FC=gfortran GASNET_PSHM_NODES=8 ./build/run-fpm.sh test | tail -n 50 
Project is up to date
A total of 57 test cases

All Passed
Took 13.3721 seconds

All Passed
A total of 57 test cases containing a total of 82 assertions

All Passed
All Passed
All Passed
All Passed
All Passed
All Passed
Took 13.3721 seconds

A total of 57 test cases containing a total of 82 assertions

Took 13.3721 seconds

Took 13.3722 seconds

A total of 57 test cases containing a total of 82 assertions

A total of 57 test cases containing a total of 82 assertions
Took 13.3721 seconds


A total of 57 test cases containing a total of 82 assertions

Took 13.3721 seconds

Took 13.3722 seconds

Took 13.3721 seconds

A total of 57 test cases containing a total of 82 assertions

A total of 57 test cases containing a total of 82 assertions

A total of 57 test cases containing a total of 82 assertions

           0
           0
           0
           0
           0
           0
           0
           0

Note that when the number of images is increased from 1 to 8, the fpm "summary" outputs grows by a factor of 8.

This is non-scalable and will definitely fail when using a real job scheduler that enforces process parallelism limits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant