6 September 2024

HiREP WP can be pushed to the bottom (only if we have time at the end)
Square -> unitary -> symplectic matrices (moving to the latter with an extra constraint/symmentry)
4D structured, isotropic grid w/ periodic BCs. In each point there’s a fermion, in each link b/ween points there’s a 3x3 matrix (gluon-like thing)
Using Hamiltonian MC (in Bayesian stats)
2 fields: gauge field (~gluon), fermion field. It’s a BSM (Beyond the Standard Model) theory Sp(2N). They can both be optimised.
- Fermion field is well optimised for SU(N), want the same for Sp(N). Do it first on CPU to compare w/ HiREP (currently GRID is 10-100x slower), then move to GPU.
- Need to do the whole thing for gluons - wasn’t optimised because it is small part of the time for SM. Should be the same for BSM, main devs are working on it and we should optimise it too (the host-device copies). Makis has done some work in that direction.
GRID is optimised for arbitrary vector length machines => depends heavily on templates. Arrays are represented as NxN but linearised in the back end. Fully separated concerns. Seamless for GPU as well.
There’s only one (or 2?) kernel: lambda_apply (or something). Everything is done by passing lambdas to this kernel.
Focus on single node performance (==single GPU), MPI implementation has been already optimised.
Exploit Sp(2N) matrix symmetry to minimise memory footprint and transfer (see chapter in proposal).
We should not mess with data structure in GRID, they’re highly optimised. Add another template if needed. Need to first understand the data structure.
In GRID code:
- Grid/qcd/action/fermion (we care about wilson fermions, would add an optimised kernel here. Also see in instantiation - don’t touch but might need to write one for new implementations of kernels) and (lesser) Grid/tensor (eg. tensor_ta).
- AutoView object (grid/lattice/lattice_view) does host->device memory transfer (don’t modify!).
- In action/gauge (also Wilson) -> calls Staple (-> the one to make faster) and Ta. Look in + optimise qcd/utils/WilsonLoops. (note on notation: spacetime dimensions are mu and nu, 0->3). Xxx_view is a kind of unified memory pointer (view of memory address on host or device).
There’s doxygen, not very descriptive but up to date. If adding comments, add in doxygen format
Weaker coupling (i.e. larger beta) moves towards continuous lattice (smaller size)
Make fork of repo, then big PR. Main branch is develop (main used only for releases). There are tests + CI (Team City), up to date and reliable. But it’s only for main repo and compilation takes hours => develop, compile + test manually on tursa, then CI will check when opening the big PR to the main repo
- To run on spent CPU credits on tursa: qos=low
Team involvement: Ed Bennet will oversee + answer Qs, but no coding. Peter Boyle approves all PRs, could have look earlier but very busy. Frederic Bonnet can help with physics but still learning the code.
- Alessandro Lupo did the Sp(2N) implementation (not too good c++, but could answer questions)