Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extreme efficiency enhancements #151

Draft
wants to merge 97 commits into
base: master
Choose a base branch
from

Conversation

rwsmith7531
Copy link
Contributor

@rwsmith7531 rwsmith7531 commented May 3, 2024

Description

Pairwise intermolecular energy computation vectorization (except for compute system total energy)
Cell list overhaul and switch to cell neighbor lists
Add option to use cell neighbor lists for CBMC neighbor-finding
Refactored and vectorized CBMC routines (mainly just those used for Widom insertions)
Vectorized RNG for CBMC insertion trial positions
BOVINE
BOVINE-Cavs
Refactored and vectorized Ewald summation reciprocal part.
Widom insertion energy computation redundancy elimination
Double-precision fast inverse square root function in energy_routines.f90

Describe your changes in detail

  • The CBMC regrowth moves involve the computation of intramolecular nonbonded and intermolecular nonbonded energies that are used for biasing the dihedral selection. The intramolecular nonbonded energy portion is computed with single precision. The parameters for this computation are stored in ppvdw_table2_sp. The intermolecular portion is computed with double precision. This could be done in single precision and could be done in the future. This change affects any CBMC move for a multifragment molecule (i.e. anything that uses fragment_placement, like GEMC, NVT, NPT).

Related Issue

This project only accepts pull requests related to open issues
If suggesting a new feature or change, please discuss it in an issue first
If fixing a bug, there should be an issue describing it with steps to reproduce
Please include a reference to the issue.

How Has This Been Tested?

Please describe in detail how you tested your changes.
Include details, and the tests you ran to.
see how your change affects other areas of the code, etc.

Backward Compatibility

Please state whether any changes in the pull request break backward compatibility for inputs, and - if yes - explain what has been
changed and why.

Post Submission Checklist

Please check the fields below as they are completed

  • Suitable new documentation files and/or updates to the existing docs are included.
  • One or more example input decks are included.
  • Suitable tests were added to the test suite
  • My name is in the contributor list at /Documentation/source/reference/acknowledgements.rst

Further Information, Files, and Links

Any additional information here, attach relevant text or image files and URLs to external sites, publications , etc.

Previously, the CBMC Fragment_Placement subroutine would sometimes
choose a dihedral trial with trial overlap despite its weight being
zero.  This was fixed by changing the "<=" operator to "<" and
flagging cbmc_overlap if none of the trials are picked.
This results in creation of rminsq file for each tolerance.
Also output information regarding maximum individual widom_var for
insertions that would have been excluded if the rminsq table had
been used.
Combine xtc reading capability with atompair rminsq table feature.
This updates the Makefiles and improves the linking of the xtc reader libraries.
It also adds more tests.
…or list.

Members of gathered overlap cells and cell neighbor lists are now filtered by proximity.
CBMC cell list option would now be more appropriately called a cell neighbor list method, since
the possible neighbors for a cell are now gathered and filtered by proximity.  CBMC cells are now
the same size as overlap cells; the gathering algorithm just searches more cells to capture all possible
neighbors.  Trial insertion of first fragment in CBMC are now greatly vectorized.  CBMC dihedral trials are not yet,
but applying vectorization and bitcell overlap detection to dihedral trials should be fairly straightforward.
Dimension padding currently assumes vector size no greater than 256 bits (the size of AVX2 vector registers), and if
we want Cassandra to support AVX-512, changes need to be made to accommodate that since it would violate the alignment
assumptions made in some ifort compiler directives.  While intermolecular CBMC energy estimation is vectorized when used
with CBMC cell neighbor lists, it can apparently sometimes still be slightly slower than directly computing the energy,
most likely due to slower memory access for the very large, precomputed energy table.  I still left it as an option though because
for more expensive force fields, it may be faster.  Some cheap WRITE statements used for debugging are still present in the code
and should probably be removed to avoid excessive verbosity, especially to STDOUT.
Repeating an old simulation (from before this commit) using the same seeds and simulation options will not give identical
results even with a single thread due to the way CBMC insertion trial positions are calculated from the random numbers
differing from how it used to be done; for example, using rranf() - 0.5 instead of 0.5 - rranf() as fractional COM coordinate.
Restricted insertion trial coordinates are now generated within the inner volume the first time, rather than
being generated anywhere in the box and re-generating them within the inner volume them if they're outside the inner volume, as
was done previously, and this process is now vectorized.  Widom insertions will no longer be restricted ever, even if the inserted
species is designated with restricted GCMC insertions. It's likely this was never a problem for anyone, but this fix should make sure
it won't be a problem in the future.  If restricted Widom insertions are ever allowed in the future, additional changes will need
to be made for it to be done properly.
… to RB form where possible.

All OPLS dihedrals are internally converted to RB torsions now because RB torsions are much faster to compute.
CHARMM style dihedrals are converted to RB torsions when it is possible to do so (I don't think I've seen one that
isn't possible to convert to RB but they might exist), and they are left as CHARMM style if it isn't possible to convert to RB format.
All dihedrals formatted as RB torsions (whether explicitly input or internally converted) that are stacked on the same or reverse
4-atom sequence as each other are collapsed into a single RB torsion by adding together the coefficients of the stacked RB torsions.
RB torsions are implemented in the protein convention (based on phi) in Cassandra, like the other dihedral types are.
This differs from how they are implemented in GROMACS, which uses the polymer convention (based on psi, which is phi - pi).
To convert from one convention to the other (either direction), simply flip the sign of the coefficients of the
even-powered terms of the series.

I also commented out the code that reads parameters for AMBER-style dihedrals because Cassandra has no
code to compute the energies of AMBER-style dihedrals and they aren't converted to another style either.
Dihedral styles are now allowed to be specified in all-caps or all lowercase in the mcf files, to make things more user-friendly.
I also renamed get_internal_coords.f90 to internal_coordinate_routines.f90 and made Internal_Coordinate_Routines a module, since
the file previously just contained a collection of subroutines, one of which is named Get_Internal_Coords, not encompassed by a module.
…ore vectorization.

Also designate several procedures as ELEMENTAL for ease of use and optimization.
Add optional argument l_skip_dihed_vec to Compute_Molecule_Dihedral_Energy that specifies
which dihedrals to skip computing energy for.
Allow different species to use different CBMC kappa values.  Add way to specify minimum
ideal_bitcell_length, which overrides the ideal_bitcell_length computed by the default method
if it is greater than the computed ideal_bitcell_length, but not if it is smaller, since
the computed value is the minimum value required for the algorithm to work properly.  The
user-defined minimum is an option because it can be beneficial to lower the resolution of the
bitcell grid so it occupies less memory and allows faster memory access and probably has a
lower cache miss rate. This will result in the bitcell overlap method catching fewer overlaps
(which will be caught by the cell list overlap detection instead if they are overlaps), but
allowing the bitcells to be checked faster can be worth it (tested with min_ideal_bitcell_length = 0.2).
New function Excess_Molecule_Intrafragment_Energy was added, and optional excess_flag_o and/or minimg_flag_o arguments
were added to a few subroutines to cause them to instead compute the "excess" energy (energy minus what it would be if
computed with the minimum image sum style as during fragment library generation) and minimum image energy
(essentially forces the subroutine to act as if the sum style is minimum image, even if it isn't) so you only get
the intramolecular parts you need for Widom insertions.
Interfragment intramolecular energy is now optionally output by Build_Molecule as E_interfrag, though the logic in
Fragment_Placement causes it to only do so during Widom insertions.  That should probably be changed if/when the new
intramolecular energy accounting done in Widom insertions is applied to other CBMC moves.  Widom insertions now include
no intramolecular energy except for what is computed by Excess_Molecule_Intrafragment_Energy and the interfragment
intramolecular energy, since any remaining parts would have been used to generate the fragment libraries and would
have to be subtracted back out if included, which is inefficient.  This method should be more robust than what was
previously done, and should probably be applied to other CBMC moves as well.
Use of undamped shifted force method for coulombic interactions in CBMC trial energies is only partially implemented.
…ments.

Improve vectorization of reciprocal ewald energy calculation for Widom insertions.
Add vectorized random number generation subroutines, which are used in Build_Molecule.
Fix bug causing problems when writing fragment mcf file with RB torsions.
Change file unit numbers to not be problematic for Widom insertion simulations with more than 10 species.
Stop wasting time setting and applying bitcell overlap mask where mask bits are known to be permanently zero.
Allow user to specify the use of shifted force electrostatics for cbmc trial energy calculation.
… and optimize overlap voxel grid setup.

Add a minor optimimization to vectorized random number generation.
Correct stack memory inflation due to certain array bounds increasing by 8 every Widom insertion frame.
Add some code to help visualize bitcell overlap detection masks and grids outside Cassandra;
this will need to be removed eventually.
Cavity biasing is implemented. This is the version of Cassandra used for the simulations
in Ryan Smith's dissertation chapter 4 and the test particle insertion enhancement paper
unless they are later rerun with a faster version. BOVINE overlap checking code is made
more concise with forced inlining. Atom ID pair overlap radius optimization histogram
creation portion of widom_insert is made robust to some atomic overlap not being
detected due to floating point rounding, which hasn't been a problem but possibly could
have been if the algorithm were not made robust. It is also now parallelized with
OMP WORKSHARE without leaving and re-entering the parallel region.
…matrix basis.

Previously, Cassandra's trajectory reader would not work when the trajectory coordinates
were PBC-wrapped by atoms rather than by molecule center of mass or not at all (unwrapped)
if the trajectory molecules are polyatomic since Cassandra wraps molecules by center of mass,
which requires molecules to be intact.
This commit allows the trajectroy reader to repair partially-wrapped molecules.
It also optimizes parts of the trajectory reader, including more vectorization.
The LAMMPS trajectory conversion script now accepts wrapped coordinates, not just
unwrapped coordinates.
The subroutine Load_Next_Frame and other (non-XTC) trajectory reader procedures are now
included in a module, Trajectory_Reader_Routines, rather than having Load_Next_Frame be
a non-module subroutine containing the other (non-XTC) trajectory reader procedures.
Box cell matrices are now automatically converted to the upper triangular form used by LAMMPS,
since it allows better optimization. Coordinates loaded from a trajectory file, checkpoint file,
or configuration file are automatically converted to the new basis if the basis is changed.
Improve vectorization and multithreading and improve mathematical formulation of Ewald summation.
Also overhaul data structures for Ewald data and molecule pair energy and replace
large array copying with memory allocation transfers or remove them entirely (if they're unnecessary).
The arbitrary limit on the number of kspace vectors was removed and replaced with a much larger limit
based on implementation limitations that are unlikely to be met for sane systems.
If the new limits ever become too low for a sane application, the limits may be increased by updating
the integer components of a kspace vector to be encoded in a 64-bit integer instead of a 32-bit integer.
For triclinic and non-cubic, orthogonal boxes, the range within which to check kspace vectors is automatically
computed based on the face distances of a box in reciprocal space for which the cell matrix is
the transpose of the inverse of the cell matrix for the real box.
Previously, the range to check in reciprocal space was hardcoded for triclinic and non-cubic, orthogonal boxes.
Also improve cavity biasing random position generation and use 32-bit integers to encode
cavity voxel coordinates when voxel grid is small enough.

This commit also adds a "compatibility mode" that enabled by default in this commit.
When compatibility_mode is true, several changes are made to the CBMC routines to try to
emulate their old implementation. Although the new implementation is correct, it generates and
uses random numbers differently, causing many tests to fail at the moment.

Also improve trajectory reader parallelization and efficiency.

Add special system Ewald reciprocal energy routine for simulations using the trajectory reader.
Trajectory reader simulations (sim type pregen) don't need sin_mol and cos_mol, so they are not allocated.
…t padding and vectorization accordingly.

This commit also adds lossless compression for cavity_locs and cavity_locs_int32, which store cavity voxel locations.
Target architecture optimization flags were added to gfortran Makefiles.
This commit also reduces stack usage when creating atompair_nrg_table_reduced, which would previously sometimes cause
Cassandra to run out of stack space unless the stack size limit is increased from the default, depending on the default limit
and memory requirements.

For the Intel compiler, Cassandra derives memory padding parameters from the -align arraynbyte compiler option.
For the gfortran compiler, Cassandra derives this from the -m option, such as -mavx2 or -msse4.2.
With gfortran, the -m option should always be included even if it is redundant with -march since Cassandra
uses it to determine memory padding and in rare cases vector size.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant