Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Closes #2716) Add support for module-inlining calls to polymorphic kernels/routines #2732

Draft
wants to merge 47 commits into
base: master
Choose a base branch
from

Conversation

arporter
Copy link
Member

@arporter arporter commented Oct 3, 2024

No description provided.

@arporter arporter self-assigned this Oct 3, 2024
@arporter arporter marked this pull request as draft October 3, 2024 09:12
@arporter arporter added in progress NG-ARCH Issues relevant to the GPU parallelisation of LFRic and other models expected to be used in NG-ARCH labels Oct 3, 2024
Copy link

codecov bot commented Oct 3, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 99.88%. Comparing base (a33a99b) to head (8b025db).

Additional details and impacted files
@@           Coverage Diff            @@
##           master    #2732    +/-   ##
========================================
  Coverage   99.88%   99.88%            
========================================
  Files         356      356            
  Lines       49479    49581   +102     
========================================
+ Hits        49421    49523   +102     
  Misses         58       58            

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

@arporter
Copy link
Member Author

arporter commented Oct 3, 2024

As well as addressing the coverage, I need to extend the LFRic integration tests to use this new functionality.

@arporter
Copy link
Member Author

arporter commented Oct 7, 2024

Added the KernelModuleInlineTrans back into the LFRic transformation script and get a crash when trying to add the interface symbol into the symbol table.

@arporter
Copy link
Member Author

arporter commented Oct 8, 2024

The OMP offload LFRic integration test failed with an ICE:

17:47:36 Pre-process and compile inventory/id_r32_field_array_pair_mod.F90
inventory/id_r32_field_array_pair_mod.F90:
NVFORTRAN-S-0000-Internal compiler error. flowgraph: node is zero       3  (lfric_xios_setup_mod_psy.f90: 85)
NVFORTRAN-F-0000-Internal compiler error. Invalid key for hash       0  (lfric_xios_setup_mod_psy.f90: 85)
NVFORTRAN/x86-64 Linux 24.5-1: compilation aborted

It's seems odd that there's a _psy.f90 for xios setup??

EDIT: L85 of that file corresponds to the end of the (module-inlined?) nodal_coordinates_code kernel (subroutine) which is called from within an offloaded region. However, the routine itself does not contain an offload directive?

@arporter
Copy link
Member Author

arporter commented Oct 8, 2024

Do the build manually for OpenACC (which doesn't give an ICE) and things are looking better:
image
(note: I've marked MATMUL as being available on the GPU now.)
I think @sergisiso has seen that module-inlining a routine is sufficient for the compiler to do the right thing, even if it's not had e.g. acc routine added to it?

@arporter
Copy link
Member Author

arporter commented Oct 8, 2024

I need to figure out why we get an inlined routine without an offload directive added to it. The build log says:

PSy name = 'lfric_xios_setup_mod_psy'
Transforming invoke 'invoke_0_nodal_xyz_coordinates_kernel_type' ...
Failed to annotate 'nodal_xyz_coordinates_code' with GPU-enabled directive due to:
Transformation Error: Kernel 'nodal_xyz_coordinates_code' accesses the symbol 'chi2xyz: RoutineSymbol<NoType, pure=unknown, elemental=unknown>' which is imported. If this symbol represents data then it must first be converted to a Kernel argument using the KernelImportsToArguments transformation.
Failed to annotate 'nodal_xyz_coordinates_code' with GPU-enabled directive due to:
Transformation Error: Kernel 'nodal_xyz_coordinates_code' accesses the symbol 'chi2xyz: RoutineSymbol<NoType, pure=unknown, elemental=unknown>' which is imported. If this symbol represents data then it must first be converted to a Kernel argument using the KernelImportsToArguments transformation.
Transforming invoke 'invoke_1_nodal_coordinates_kernel_type' ...
Failed to module-inline kernel 'nodal_coordinates_code' due to:
Transformation Error: Cannot inline subroutine 'nodal_coordinates_code' because another, different, subroutine with the same name already exists and versioning of module-inlined subroutines is not implemented yet.
Successfully offloaded loop with ['nodal_coordinates_code']
Successfully offloaded loop with ['nodal_coordinates_code']
Transforming invoke 'invoke_2_nodal_coordinates_kernel_type' ...
Failed to module-inline kernel 'nodal_coordinates_code' due to:
Transformation Error: Cannot inline subroutine 'nodal_coordinates_code' because another, different, subroutine with the same name already exists and versioning of module-inlined subroutines is not implemented yet.
Successfully offloaded loop with ['nodal_coordinates_code']
Successfully offloaded loop with ['nodal_coordinates_code']

@arporter
Copy link
Member Author

arporter commented Oct 9, 2024

Part of the problem was that the optimisation script wasn't checking that it hadn't already transformed a given kernel for a given invoke. Fixing that removes the 'failed to inline' messages but we still end up with an inlined kernel that doesn't have a directive added to it.

@arporter
Copy link
Member Author

arporter commented Oct 9, 2024

The problem was that, having successfully module-inlined the kernel routine, we proceed to apply the annotation transformation to the original Kern and that does not update the Routine that has been inlined. It probably should? For now, I can search for the newly inlined routine and apply the transformation to that and we get the expected code.

@arporter
Copy link
Member Author

arporter commented Oct 9, 2024

Now get a compilation failure:

NVFORTRAN-S-0155-Ambiguous interfaces for generic procedure matrix_vector_code (
algorithm/norm_alg_mod_psy.f90: 338)

This seems to be because we have successfully inlined this interface and its routines but subsequent PSy-layer subroutines are still importing it from a module. In turn, this is because I now check whether or not we've already transformed a kernel of a given name. Essentially, we need multiple LFRicKern objects to all point to the same bit of PSyIR.

@arporter
Copy link
Member Author

arporter commented Oct 9, 2024

Kern.get_kernel_schedule() (which is to be replaced/removed if/when we migrate Kern to subclass Call) currently caches the PSyIR of the kernel. KernelModuleInlineTrans creates a copy of this PSyIR and then inserts it into the PSy-layer. Therefore, get_kernel_schedule() should now return that copy of the PSyIR. In fact, that would happen automatically if I undid my changes to KernelModuleInlineTrans so that it doesn't copy the PSyIR. I can't remember why I made that change.

I think I made that change because, without copying, a routine gets removed from its original Container but that breaks any Interface that refers to it. This becomes a problem in the (unlikely) event that a routine is called directly as well as via an interface.

@arporter
Copy link
Member Author

arporter commented Oct 10, 2024

If we have an Algorithm layer that contains two invokes that each call the same kernel then we attempt to apply KernelModuleInlineTrans to each kernel call. If the first one succeeds then we immediately (in global.py) proceed to add e.g. acc routine to it. That done, we move on to the next invoke where we try to do the same thing. We then find that the body of the routine we want to inline does not match the one we've already inlined because we've added acc routine. This results in:

  1. The second kernel is not flagged as being module inlined and thus is handled by _rename_and_write.
  2. The result of _rename_and_write is a renamed kernel for which the PSy layer does not contain a symbol.

This shouldn't happen. However, we also want to mark all kernels of the same name as being module-inlined and pointing to a single implementation. Since global.py works invoke by invoke, it's not simple (or natural) to alter it so that it works kernel by kernel.

@arporter
Copy link
Member Author

Thinking about this a bit more, I belatedly realise that if a given PSy layer routine calls the same kernel more than once then either all of them must be module inlined or none of them (unless we attempt to rename the inlined version and that gets complicated). I think the reason that LFRic is still not working is that we fail to inline a second instance of the same kernel (because the first instance has been transformed) but then we do proceed to transform it. Therefore, rename_and_write() is called in order to generate a new version of that kernel on disk. I need a reproducer for this.

@arporter
Copy link
Member Author

arporter commented Oct 11, 2024

I've realised that Kern.module_inline sets this flag for all Kernels with that name in the current InvokeSchedule. This seems sensible but is causing me problems because we end up with a Kernel marked as 'module inlined' but which actually isn't because the transformation on it failed and so it was written to file and renamed instead.

The solution might therefore be as simple as not attempting to transform a Kernel that is already marked as module inlined.

@arporter
Copy link
Member Author

The trouble with the LFRic example is that we have two, separate invoke calls, each with the same Kernel. The first Kernel gets module inlined but has to be modified to have some imports made local to it. That then means we refuse to inline the second Kernel because it is no longer identical. The question is, why does the search for the source of the second kernel not return the module-inlined source of the first one?

@arporter
Copy link
Member Author

With this branch and latest gpu script in lfric/examples:
image

@arporter
Copy link
Member Author

Adding RANDOM_NUMBER intrinsic to available_on_device - build still worked OK on GPU but profile didn't look much different. Need to update the Sankey diagram and see whether I've missed anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in progress NG-ARCH Issues relevant to the GPU parallelisation of LFRic and other models expected to be used in NG-ARCH
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants