Skip to content

PoseStack

Jeff Flatten edited this page Jun 6, 2024 · 14 revisions

PoseStack

A PoseStack represents a set (a batch) of molecular systems. PoseStack is optimized for compactness for efficient processing on the GPU. This class is the focus of the work that tmol will perform.

A PoseStack is generated from tmol's other molecular system datastructure, Canonical Form, via the function tmol.pose_stack_from_canonical_form. Most often users will instead use convenience functions for loading in from PDBs, from RosettaFold, or from OpenFold.

Each pose in a PoseStack is composed of some number of Blocks - substructures that are instances of pre-defined types defined in the database. Each "Pose" in the stack can hold as many blocks as desired.

Note

tmol will be more efficient in computing scores for systems that are approximately the same size than for systems that have very different sizes) and these blocks can be bonded together into one or more chains.

The PoseStack includes the following data members:

  • A PackedBlockTypes object, used for storing any data on block types that the GPU may need in a compact form.
  • A tensor that gives the type of each block in the PoseStack.
  • An atom coordinates tensor that contains the coordinates for all blocks in a compact (gap-less) format.
  • Block offsets into the coordinate tensor that tell you where to find the beginning of the atoms for a specific block.
  • A tensor describing connections between blocks.
  • A tensor describing how many chemical bonds separate a pair of blocks, used for efficiently determining bond separation for interatomic energy calculations.

A Note on Hydrogen Atoms

tmol, and Rosetta before it, creates an all-atom representation of a molecular system. For this reason, there are several differences between tmol's representation of several residue types and the representations for most of the popular ML models (such as AlphaFold/RosettaFold/OpenFold/ESMFold). In particular, tmol models hydrogens explicitly and is thus aware of chemical differences that other representations gloss over. For example, cysteine can either form or not form a disulfide bond with another cysteine. In other ML models, there is no representational difference between cysteine in these two states; in tmol, the disulfide-bonded cysteine has an extra inter-residue connection between the SG sulfur and the other cysteine, and the non-disulfide-bonded cysteine has a sulfhydryl hydrogen in its place. Thus tmol represents these two states with two different residue types. In another case, tmol differentiates between the two tautomers of histidine; one tautomer protonates the NE2 nitrogen in the imidazole ring, the other tautomer ("HIS_D") protonates the ND1 nitrogen.

tmol will build hydrogens for you if you do not have their coordinates. This calculation is deterministic. For aliphatic hydrogens and most polar hydrogens, it is trivially deterministic; for some polar hydrogens, however, there are degrees of freedom that go beyond heavy-atom coordinates. tmol places hydroxyl, phenolic, and sulfhydryl hydrogens with a dihedral angle of 180 -- almost certainly not the optimal location for these atoms -- and it will always choose the NE2-protonated histidine tautomer. The hydrogen-placement step is differentiable, so if you include the tmol energy in your ML model's loss and you only provide heavy-atom coordinates, then the energetic contribution of the automatically-built hydrogen atoms will feed into the positional derivatives of the heavy-atoms that define their geometries.

A Note on Cropping

tmol includes explicit representations of termini atoms, using different residue types for amino acids (and eventually other polymer subunits) that are in the middle of a peptide chain than at the ends. Also, tmol will define chemical bonds between sequential residues that are part of the same chain. In several modeling tasks, the first and last residues of a chain may be absent or sequential residues will not actually have chemical bonds between them. For example, in modeling a protein/protein interface, it might be most computationally efficient to only represent a small number of residues on either side of the interface. The first residue might not be the N terminus, and residues i and i+1 might be separated by many cropped-out residues between them; in such cases, adding the formal positive charge on the first residue's amino group might create unrealistic electrostatic interactions, and declaring a chemical bond between i and i+1 might put large forces on these residues to try and correct the "bad" covalent geometry. tmol's PoseStack construction process allows control over which residues are treated like regular polymeric positions and which are "exceptions to the rule" through the variable res_not_connected. More on this variable below.

Clone this wiki locally