Extreme efficiency enhancements #151

rwsmith7531 · 2024-05-03T13:33:34Z

Description

Pairwise intermolecular energy computation vectorization (except for compute system total energy)
Cell list overhaul and switch to cell neighbor lists
Add option to use cell neighbor lists for CBMC neighbor-finding
Refactored and vectorized CBMC routines (mainly just those used for Widom insertions)
Vectorized RNG for CBMC insertion trial positions
BOVINE
BOVINE-Cavs
Refactored and vectorized Ewald summation reciprocal part.
Widom insertion energy computation redundancy elimination
Double-precision fast inverse square root function in energy_routines.f90

Describe your changes in detail

The CBMC regrowth moves involve the computation of intramolecular nonbonded and intermolecular nonbonded energies that are used for biasing the dihedral selection. The intramolecular nonbonded energy portion is computed with single precision. The parameters for this computation are stored in ppvdw_table2_sp. The intermolecular portion is computed with double precision. This could be done in single precision and could be done in the future. This change affects any CBMC move for a multifragment molecule (i.e. anything that uses fragment_placement, like GEMC, NVT, NPT).

Related Issue

This project only accepts pull requests related to open issues
If suggesting a new feature or change, please discuss it in an issue first
If fixing a bug, there should be an issue describing it with steps to reproduce
Please include a reference to the issue.

How Has This Been Tested?

Please describe in detail how you tested your changes.
Include details, and the tests you ran to.
see how your change affects other areas of the code, etc.

Backward Compatibility

Please state whether any changes in the pull request break backward compatibility for inputs, and - if yes - explain what has been
changed and why.

Post Submission Checklist

Please check the fields below as they are completed

Suitable new documentation files and/or updates to the existing docs are included.
One or more example input decks are included.
Suitable tests were added to the test suite
My name is in the contributor list at /Documentation/source/reference/acknowledgements.rst

Further Information, Files, and Links

Any additional information here, attach relevant text or image files and URLs to external sites, publications , etc.

… are disabled.

Previously, the CBMC Fragment_Placement subroutine would sometimes choose a dihedral trial with trial overlap despite its weight being zero. This was fixed by changing the "<=" operator to "<" and flagging cbmc_overlap if none of the trials are picked.

…enerated with the bug.

…overflow.

This results in creation of rminsq file for each tolerance. Also output information regarding maximum individual widom_var for insertions that would have been excluded if the rminsq table had been used.

Combine xtc reading capability with atompair rminsq table feature.

…he stack or the heap.

…rgy_table usage in documentation.

…ble with MacOS.

This updates the Makefiles and improves the linking of the xtc reader libraries. It also adds more tests.

…or list. Members of gathered overlap cells and cell neighbor lists are now filtered by proximity. CBMC cell list option would now be more appropriately called a cell neighbor list method, since the possible neighbors for a cell are now gathered and filtered by proximity. CBMC cells are now the same size as overlap cells; the gathering algorithm just searches more cells to capture all possible neighbors. Trial insertion of first fragment in CBMC are now greatly vectorized. CBMC dihedral trials are not yet, but applying vectorization and bitcell overlap detection to dihedral trials should be fairly straightforward. Dimension padding currently assumes vector size no greater than 256 bits (the size of AVX2 vector registers), and if we want Cassandra to support AVX-512, changes need to be made to accommodate that since it would violate the alignment assumptions made in some ifort compiler directives. While intermolecular CBMC energy estimation is vectorized when used with CBMC cell neighbor lists, it can apparently sometimes still be slightly slower than directly computing the energy, most likely due to slower memory access for the very large, precomputed energy table. I still left it as an option though because for more expensive force fields, it may be faster. Some cheap WRITE statements used for debugging are still present in the code and should probably be removed to avoid excessive verbosity, especially to STDOUT. Repeating an old simulation (from before this commit) using the same seeds and simulation options will not give identical results even with a single thread due to the way CBMC insertion trial positions are calculated from the random numbers differing from how it used to be done; for example, using rranf() - 0.5 instead of 0.5 - rranf() as fractional COM coordinate. Restricted insertion trial coordinates are now generated within the inner volume the first time, rather than being generated anywhere in the box and re-generating them within the inner volume them if they're outside the inner volume, as was done previously, and this process is now vectorized. Widom insertions will no longer be restricted ever, even if the inserted species is designated with restricted GCMC insertions. It's likely this was never a problem for anyone, but this fix should make sure it won't be a problem in the future. If restricted Widom insertions are ever allowed in the future, additional changes will need to be made for it to be done properly.

… to RB form where possible. All OPLS dihedrals are internally converted to RB torsions now because RB torsions are much faster to compute. CHARMM style dihedrals are converted to RB torsions when it is possible to do so (I don't think I've seen one that isn't possible to convert to RB but they might exist), and they are left as CHARMM style if it isn't possible to convert to RB format. All dihedrals formatted as RB torsions (whether explicitly input or internally converted) that are stacked on the same or reverse 4-atom sequence as each other are collapsed into a single RB torsion by adding together the coefficients of the stacked RB torsions. RB torsions are implemented in the protein convention (based on phi) in Cassandra, like the other dihedral types are. This differs from how they are implemented in GROMACS, which uses the polymer convention (based on psi, which is phi - pi). To convert from one convention to the other (either direction), simply flip the sign of the coefficients of the even-powered terms of the series. I also commented out the code that reads parameters for AMBER-style dihedrals because Cassandra has no code to compute the energies of AMBER-style dihedrals and they aren't converted to another style either. Dihedral styles are now allowed to be specified in all-caps or all lowercase in the mcf files, to make things more user-friendly. I also renamed get_internal_coords.f90 to internal_coordinate_routines.f90 and made Internal_Coordinate_Routines a module, since the file previously just contained a collection of subroutines, one of which is named Get_Internal_Coords, not encompassed by a module.

…ore vectorization. Also designate several procedures as ELEMENTAL for ease of use and optimization. Add optional argument l_skip_dihed_vec to Compute_Molecule_Dihedral_Energy that specifies which dihedrals to skip computing energy for. Allow different species to use different CBMC kappa values. Add way to specify minimum ideal_bitcell_length, which overrides the ideal_bitcell_length computed by the default method if it is greater than the computed ideal_bitcell_length, but not if it is smaller, since the computed value is the minimum value required for the algorithm to work properly. The user-defined minimum is an option because it can be beneficial to lower the resolution of the bitcell grid so it occupies less memory and allows faster memory access and probably has a lower cache miss rate. This will result in the bitcell overlap method catching fewer overlaps (which will be caught by the cell list overlap detection instead if they are overlaps), but allowing the bitcells to be checked faster can be worth it (tested with min_ideal_bitcell_length = 0.2). New function Excess_Molecule_Intrafragment_Energy was added, and optional excess_flag_o and/or minimg_flag_o arguments were added to a few subroutines to cause them to instead compute the "excess" energy (energy minus what it would be if computed with the minimum image sum style as during fragment library generation) and minimum image energy (essentially forces the subroutine to act as if the sum style is minimum image, even if it isn't) so you only get the intramolecular parts you need for Widom insertions. Interfragment intramolecular energy is now optionally output by Build_Molecule as E_interfrag, though the logic in Fragment_Placement causes it to only do so during Widom insertions. That should probably be changed if/when the new intramolecular energy accounting done in Widom insertions is applied to other CBMC moves. Widom insertions now include no intramolecular energy except for what is computed by Excess_Molecule_Intrafragment_Energy and the interfragment intramolecular energy, since any remaining parts would have been used to generate the fragment libraries and would have to be subtracted back out if included, which is inefficient. This method should be more robust than what was previously done, and should probably be applied to other CBMC moves as well. Use of undamped shifted force method for coulombic interactions in CBMC trial energies is only partially implemented.

…ments. Improve vectorization of reciprocal ewald energy calculation for Widom insertions. Add vectorized random number generation subroutines, which are used in Build_Molecule. Fix bug causing problems when writing fragment mcf file with RB torsions. Change file unit numbers to not be problematic for Widom insertion simulations with more than 10 species. Stop wasting time setting and applying bitcell overlap mask where mask bits are known to be permanently zero. Allow user to specify the use of shifted force electrostatics for cbmc trial energy calculation.

… and optimize overlap voxel grid setup. Add a minor optimimization to vectorized random number generation. Correct stack memory inflation due to certain array bounds increasing by 8 every Widom insertion frame. Add some code to help visualize bitcell overlap detection masks and grids outside Cassandra; this will need to be removed eventually.

Cavity biasing is implemented. This is the version of Cassandra used for the simulations in Ryan Smith's dissertation chapter 4 and the test particle insertion enhancement paper unless they are later rerun with a faster version. BOVINE overlap checking code is made more concise with forced inlining. Atom ID pair overlap radius optimization histogram creation portion of widom_insert is made robust to some atomic overlap not being detected due to floating point rounding, which hasn't been a problem but possibly could have been if the algorithm were not made robust. It is also now parallelized with OMP WORKSHARE without leaving and re-entering the parallel region.

…ran.

…matrix basis. Previously, Cassandra's trajectory reader would not work when the trajectory coordinates were PBC-wrapped by atoms rather than by molecule center of mass or not at all (unwrapped) if the trajectory molecules are polyatomic since Cassandra wraps molecules by center of mass, which requires molecules to be intact. This commit allows the trajectroy reader to repair partially-wrapped molecules. It also optimizes parts of the trajectory reader, including more vectorization. The LAMMPS trajectory conversion script now accepts wrapped coordinates, not just unwrapped coordinates. The subroutine Load_Next_Frame and other (non-XTC) trajectory reader procedures are now included in a module, Trajectory_Reader_Routines, rather than having Load_Next_Frame be a non-module subroutine containing the other (non-XTC) trajectory reader procedures. Box cell matrices are now automatically converted to the upper triangular form used by LAMMPS, since it allows better optimization. Coordinates loaded from a trajectory file, checkpoint file, or configuration file are automatically converted to the new basis if the basis is changed.

…ength do detect box size/shape change.

Improve vectorization and multithreading and improve mathematical formulation of Ewald summation. Also overhaul data structures for Ewald data and molecule pair energy and replace large array copying with memory allocation transfers or remove them entirely (if they're unnecessary). The arbitrary limit on the number of kspace vectors was removed and replaced with a much larger limit based on implementation limitations that are unlikely to be met for sane systems. If the new limits ever become too low for a sane application, the limits may be increased by updating the integer components of a kspace vector to be encoded in a 64-bit integer instead of a 32-bit integer. For triclinic and non-cubic, orthogonal boxes, the range within which to check kspace vectors is automatically computed based on the face distances of a box in reciprocal space for which the cell matrix is the transpose of the inverse of the cell matrix for the real box. Previously, the range to check in reciprocal space was hardcoded for triclinic and non-cubic, orthogonal boxes.

Also improve cavity biasing random position generation and use 32-bit integers to encode cavity voxel coordinates when voxel grid is small enough. This commit also adds a "compatibility mode" that enabled by default in this commit. When compatibility_mode is true, several changes are made to the CBMC routines to try to emulate their old implementation. Although the new implementation is correct, it generates and uses random numbers differently, causing many tests to fail at the moment. Also improve trajectory reader parallelization and efficiency. Add special system Ewald reciprocal energy routine for simulations using the trajectory reader. Trajectory reader simulations (sim type pregen) don't need sin_mol and cos_mol, so they are not allocated.

…t padding and vectorization accordingly. This commit also adds lossless compression for cavity_locs and cavity_locs_int32, which store cavity voxel locations. Target architecture optimization flags were added to gfortran Makefiles. This commit also reduces stack usage when creating atompair_nrg_table_reduced, which would previously sometimes cause Cassandra to run out of stack space unless the stack size limit is increased from the default, depending on the default limit and memory requirements. For the Intel compiler, Cassandra derives memory padding parameters from the -align arraynbyte compiler option. For the gfortran compiler, Cassandra derives this from the -m option, such as -mavx2 or -msse4.2. With gfortran, the -m option should always be included even if it is redundant with -march since Cassandra uses it to determine memory padding and in rare cases vector size.

rwsmith7531 added 30 commits July 21, 2022 16:14

Implement adaptive rcut_low for cell list overlap detection with lj.

f3e08e2

Disable charge correction to adaptive rminsq when charge interactions…

896f6f5

… are disabled.

Move precalculation of rcut_lowsq.

904fab1

Correctly compute pair minimum qq.

31a559f

Add more detailed Widom insertion output.

4a15546

Merge branch 'type_pair_rmin' into timed_adaptive_rmin

2c0b799

Initialize t_cpu to zero prior to parallel section in widom_insert

e943784

Let type max charge be negative and let type minimum charge be positive.

2c8ec51

Merge branch 'widom_species_timing' into timed_adaptive_rmin

f4fb8a9

Merge branch 'type_pair_rmin' into timed_adaptive_rmin

44da93a

Add cell lists for neighbor-finding.

415042c

Merge branch 'cbmc_cell_list' into cbmc_cell_list_merge

3a38beb

Fix CBMC cell list bugs.

62b7e98

Merge branch 'cbmc_cell_list' into cbmc_cell_list_merge

575b111

Estimate appropriate value for Umax to use.

0955297

Fix CBMC dihedral sampling.

16445ae

Previously, the CBMC Fragment_Placement subroutine would sometimes choose a dihedral trial with trial overlap despite its weight being zero. This was fixed by changing the "<=" operator to "<" and flagging cbmc_overlap if none of the trials are picked.

Merge branch 'fix_dihedral_selection' into Eij_max_estimation

a2a2dfc

Flag overlapping cbmc dihedral trials as overlap.

efd9d41

Update test examples that were failing because they were originally g…

d03ee55

…enerated with the bug.

Update examples corresponding to the corrected tests.

e0ad108

Add support for reading .xtc trajectory files.

15f50d4

Merge branch 'fix_dihedral_selection' into Eij_max_estimation

5713224

Add atom pair energy table for intermolecular CBMC trial energies.

6eabb4f

Fix bugs with atompair_nrg_table.

a96cd3e

Add atompair rminsq table feature with limited rminsq resolution.

650a294

Move some memory allocation from stack to heap to avoid stack buffer …

26eca17

…overflow.

Enable custom tolerance list for atompair rminsq table creation.

a58cf8d

This results in creation of rminsq file for each tolerance. Also output information regarding maximum individual widom_var for insertions that would have been excluded if the rminsq table had been used.

Merge branch 'read_xtc' into atompair_rmin_xtc

8c277a2

Combine xtc reading capability with atompair rminsq table feature.

Include libgmxfort and libxdrfile and update Makefiles.

2a55824

Add ability to choose whether some large private arrays are kept on t…

e17516b

…he stack or the heap.

rwsmith7531 added 27 commits August 9, 2023 12:56

Include example input for adaptive and specific overlap radii and ene…

75790c3

…rgy_table usage in documentation.

Add GCMC and GEMC tests using CBMC energy table.

fb7dcd1

Change CMake settings to keep full RPATH when installing gmxfort.

0a1233f

Correct executable name in Makefile.gfortran.openMP

fc96649

Clarify some parts of the documentation.

652ca79

Attempt to fix linking on MacOS by setting CMake policy CMP0042 NEW.

adbaaee

Attempt to set CMake policy CMP0042 NEW.

f7b8d93

Replace '=' with ',' when specifying linker flag -rpath to be compati…

98de8a4

…ble with MacOS.

Merge branch 'atompair_rmin_xtc' into nonwidom_vectorization

98c6e1e

This updates the Makefiles and improves the linking of the xtc reader libraries. It also adds more tests.

Move all 'none' type dihedrals to the end of dihedral_list.

0c7d121

Change 'pentuple angle' to 'quintuple angle' in explanation comment.

355cf26

Merge branch 'RB_torsions' into bit_cell_overlap

098477d

Fix minor timing bug and make changes necessary to compile with gfort…

8bee4d9

…ran.

Remove some unused subroutines in energy_routines.f90.

2727efe

Rename load_next_frame.f90 to trajectory_reader_routines.f90

db25d46

Compare read cell matrix this_length against orig_length instead of l…

b7ba1ff

…ength do detect box size/shape change.

Make further enhancements to Ewald summations.

b5bd87c

Merge branch 'master' into extreme_efficiency_enhancements

201a106

rwsmith7531 requested review from ejmaginn and emarinri May 3, 2024 13:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extreme efficiency enhancements #151

Extreme efficiency enhancements #151

rwsmith7531 commented May 3, 2024 •

edited by emarinri

Loading

Extreme efficiency enhancements #151

Are you sure you want to change the base?

Extreme efficiency enhancements #151

Conversation

rwsmith7531 commented May 3, 2024 • edited by emarinri Loading

Description

Describe your changes in detail

Related Issue

How Has This Been Tested?

Backward Compatibility

Post Submission Checklist

Further Information, Files, and Links

rwsmith7531 commented May 3, 2024 •

edited by emarinri

Loading