Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Two-stage compile has 0 file reported at the second stage #371

Open
jasonjunweilyu opened this issue Dec 10, 2024 · 7 comments
Open

Two-stage compile has 0 file reported at the second stage #371

jasonjunweilyu opened this issue Dec 10, 2024 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@jasonjunweilyu
Copy link
Collaborator

When two-stage compilation is enabled, the build failed for LFRic gungho at the moment with an error /scratch/hc46/hc46_gitlab/lfric_fab/gungho_model-mpif90-ifort/build_output/_prebuild/physics_mappings_alg_mod.cb6a16b9a.o not found at link stage.

It is observed that the following messges are logged during the compile_fortran step:

Starting two-stage compile: mod files, multiple passes
...
Finalising two-stage compile: object files, single pass
...
stage 2 compiled 0 files

There are actually a number of compiling commands being executed between Finalising two-stage compile: object files, single pass and stage 2 compiled 0 files. So I am wondering whether there is a bug in updating the build tree.

The build log is here: https://git.nci.org.au/bom/ngm/lfric/lfric_atm-fab/-/jobs/80826

@jasonjunweilyu jasonjunweilyu added the bug Something isn't working label Dec 10, 2024
@jasonjunweilyu jasonjunweilyu changed the title Two-stage compile has 0 file compiled at the second stage Two-stage compile has 0 file reported at the second stage Dec 10, 2024
@hiker
Copy link
Collaborator

hiker commented Dec 10, 2024

The number of files ** reported in 'stage 2 compiled ** files` is incorrect, I have a fix. But compilation in two phases works for me (the mechanism works, it is just printing the wrong number).

Looking at the log file, I can't see a problem - the file is compiled twice (first only for interfaces, then fully), used in the right way in the linking command. My feeling is that this crash is unrelated? Don't we have an issue with the CI that if two of them are running at the same time they'll overwrite each other's data?

@hiker
Copy link
Collaborator

hiker commented Dec 12, 2024

There seems to be two additional problem with two-stage compilation (besides reporting the wrong number):

  • Compilation errors in stage 2 are not picked up and appear to be just ignored.
  • The compilation errors which is triggered in stage 2 seems to be related to finding module file. I believe that this happens when two files are compiled at about the same time, one of which needing the mod file from the other. Since at that time the new mod file is being written, the useing one can't compile and aborts.

The first one needs some debugging (there seems to be error handling??), for the second one, my current idea is to redirect the module files of stage 2 to a different (temporary) directory, so the mod files from stage 1 are not overwritten.

@hiker
Copy link
Collaborator

hiker commented Dec 12, 2024

The error message are indeed that it can't read a module file ('Error in readingthe compiled module file'), which would fit the above suspicion.

I also seem to see that somehow the exception about the error is lost? The return code from compilation is confirmed to be 1, meaning an exception will be raised from Tool.run

I've added the following logging:

            try:
                logger.debug(f'CompileFortran compiling {analysed_file.fpath}')
                compile_file(analysed_file.fpath, flags,
                             output_fpath=obj_file_prebuild,
                             mp_common_args=mp_common_args)
                logger.debug(f"No error for {analysed_file.fpath} ---------------------")
            except Exception as err:
                logger.debug(f'CompileFortran compiling {analysed_file.fpath} ERROR {err.value}')
                return Exception(f"Error compiling {analysed_file.fpath}:\n"
                                 f"{err}"), None

And grepping for the filename, I see:

CompileFortran compiling /home/903/jxh903/fab-workspace/gungho_model-mpif90-ifort/build_output/algorithm/physics/physics_mappings_alg_mod.f90
run_command: mpif90 -warn all -gen-interfaces nosource -O2 -fp-model=strict -stand f08 -c -qopenmp -warn all -gen-interfaces nosource -O2 -fp-model=strict -stand f08 -g -traceback -module /home/903/jxh903/fab-workspace/gungho_model-mpif90-ifort/build_output physics_mappings_alg_mod.f90 -o /home/903/jxh903/fab-workspace/gungho_model-mpif90-ifort/build_output/_prebuild/physics_mappings_alg_mod.9eead9a44.o
Running 1  <---------------RETURN CODE is 1

STDERR IS: 
'physics_mappings_alg_mod.f90(10): error #7005: Error in reading the compiled module file.   [GALERKIN_PROJECTION_ALGORITHM_MOD]
...
physics_mappings_alg_mod.f90(159): catastrophic error: Too many errors, exiting
compilation aborted for physics_mappings_alg_mod.f90 (code 1)

That's it. I see neither the "No error" nor the "CompileFortran compiling ... ERROR" at all??? Debug print do indeed confirm that tool.run raises the exception.

@jasonjunweilyu
Copy link
Collaborator Author

jasonjunweilyu commented Dec 12, 2024

I tried to add error capturing to the compile_fortran step def compile_file function that calls the compiler compile_file function, as is shown below. This does not seem to work looking at the latest lfric_baf build: https://git.nci.org.au/bom/ngm/lfric/lfric_atm-fab/-/jobs/81152. Don't know why the error is suppressed. Should we add the error capturing to compiler.compile_file instead?

    try:    
        compiler.compile_file(input_file=analysed_file, output_file=output_fpath,
                              openmp=config.openmp,
                              add_flags=flags,
                              syntax_only=mp_common_args.syntax_only)
    except Exception as err:
        return Exception(f"Error compiling {analysed_file.fpath}:\n"
                                 f"{err}"), None

@hiker
Copy link
Collaborator

hiker commented Dec 13, 2024

OK, I found the error, it is indeed a bug in fab. Why my debug logging messages did not show up ... no idea. My best idea is that apparently python logging uses syslog, and that has a limit of 2b messages (and I added quite a bit of logging, command line parameters, compiler output). Maybe it was just coincidence that the messages I added exceeded the buffer length (and then got chopped off).

@hiker
Copy link
Collaborator

hiker commented Dec 13, 2024

Unfortunately, my solution for ifort doesn't work 😢 Parallel compilation for stage2 with ifort still crashes now and again because it is reading an incomplete .mod file. I added a scratch directory for module output path in stage 2 , and an explicit include path to the original stage 1 directory, e.g.:

mpif90 ... -I my_fab_work/build_output -module my_fab_work/build_output/modules_second_stage create_wthetamask_lbc_kernel_mod.f90 -o my_fab_work/build_output/_prebuild/create_wthetamask_lbc_kernel_mod.43e4c407d.o

So it adds build_output using -I, then build_output/modules_second_stage as -module (to store the newly created modules in stage 2).

Problem seems to be that ifort always searches in the -module path first:

Simple reproducer, where the directory gfortran and ifort contain mod1.mod compiled with the corresponding compiler:

ifort -c -I ./ifort/ -module gfortran/ ./mod2.f90 
./mod2.f90(2): error #7013: This module file was not generated by any release of this compiler.   [MOD1]
        use mod1, only: mod1_a
------------^

Removing the -module fixes the failure:

$ ifort -c -I ./ifort/  ./mod2.f90 
$ 

@hiker
Copy link
Collaborator

hiker commented Dec 13, 2024

I'll try to ask Intel. For now best solution: in phase 2, let each compile process write in its own directory. Therefore no compilation process will ever read anything else from the module path. That means a lot of directories, each with one file (using the source code filename as unique directory name)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants