Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU optimisation: deriv::Staple and optimising views #11

Open
ilectra opened this issue Nov 25, 2024 · 7 comments
Open

CPU optimisation: deriv::Staple and optimising views #11

ilectra opened this issue Nov 25, 2024 · 7 comments
Assignees

Comments

@ilectra
Copy link
Collaborator

ilectra commented Nov 25, 2024

Continuing #7 - see for more details.

@qiUip
Copy link
Collaborator

qiUip commented Nov 27, 2024

Initial improvement consists of moving the creation of the U vector inside Staple (in Grid/qcd/utils/WilsonLoops.h:306) to the calling function deriv (in Grid/qcd/action/gauge/WilsonGaugeAction.h:66) and only creating it once instead of once for each direction. Committed in 4742921 (which also contains the patch from Ed to cleanly compile without SU3 features).

Brief inspection showed that the results are identical and the profiler shows some noticeable improvements in the 'deriv' runtime (looking at the flame graph). Some numbers for the gains to follow shortly. Also, we need to discuss how to merge changes within our fork so that they will be compatible with the upstream (e.g. do we use the version with the patch applied or upstream development branch?).

@qiUip
Copy link
Collaborator

qiUip commented Nov 28, 2024

Some initial observations from the profiler:

  1. The time spent on PeekIndex in the original implementation is around 1.1% in each Staple call (for which there are 4 in our example), taking around 8.7 seconds each out of a ~800s run. The improved implementation reduces PeekIndex in each Staple to 0.4%, saving around 6s in each call (24s in total, which is approx 3% reduction in runtime) which while not massive is still significant.

  2. In the original implementation, the PokeIndex function was executed with OpenMP (on a single thread), while in the improved implementation it was not (the profiler didn't show an OpenMP call before it). As threading / parallelisation is not "explicitly" done in Grid, but rather seems to depends on the data types, we should investigate further how to do this correctly. This would be particularly important on GPUs as we want to avoid having large transfers from CPU to GPU if this can be done directly on the GPU. It illustrates that we must better understand the data movements in the code and what Peek and Poke actually do.

@qiUip
Copy link
Collaborator

qiUip commented Nov 29, 2024

Looking at the makeup of the time spent in PeekIndex throughout the code, it's about an order of magnitude higher for Gauge actions (WilsonGaugeActions.h - 10.7s 1omp / 1.11 16omp) than Fermions actions (TwoFlavour.h - 1.2s 1omp / 0.11 16omp).

This confirms that OpenMP is working around the PeekIndex after it was moved, which is good! It then leads me to suspect that in general data movements within the Gauge side of things are not done 'correctly' and there are redundant copies / creation of data etc. going around. This will be the focus of the investigation for the near future.

@qiUip
Copy link
Collaborator

qiUip commented Dec 2, 2024

From our discussions so far, I'm not sure where we should be looking at to optimise further. In Staple, the only thing I can think of is potentially what Ed discussed with splitting up the GagueField matrix to exploit its symmetry. Other than that, it's places like Ta(...) or update_field(...) where we could look in more detail, but in the tests we are looking at these are relatively small parts anyway.

@ilectra
Copy link
Collaborator Author

ilectra commented Dec 16, 2024

@qiUip can you open a PR with what you have so far? We can attempt to merge it upstream...

@qiUip
Copy link
Collaborator

qiUip commented Dec 16, 2024

Sure, let me copy it to a new, correctly named, branch and create a PR.

@qiUip
Copy link
Collaborator

qiUip commented Dec 16, 2024

Addressed in PR #16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants