-
Notifications
You must be signed in to change notification settings - Fork 3
PoseStack
A PoseStack
represents a set (a batch) of molecular systems. PoseStack
is optimized for compactness for efficient processing on the GPU. This class is the focus of the work that tmol
will perform.
A PoseStack
is generated from tmol
's other molecular system datastructure, Canonical Form, via the function tmol.pose_stack_from_canonical_form
. Most often users will instead use convenience functions for loading in from PDBs, from RosettaFold, or from OpenFold.
Each pose in a PoseStack is composed of some number of Blocks - substructures that are instances of pre-defined types defined in the database.
The PoseStack tracks the block types, coordinates of atoms, and the bonds that join a pose's blocks together. Each "Pose" in the stack can hold as many blocks as desired.
Note
tmol
will be more efficient in computing scores for systems that are approximately the same size than for systems that have very different sizes) and these blocks can be bonded together into one or more chains.
The PoseStack includes a datastructure, PackedBlockTypes
that serves to accumulate ("pack") all block type specific information into Tensors so that this data can be efficiently passed and made available to C++ kernels.
This class is built from the collection of RefinedResidueType
objects used by the PoseStack.
Other classes such as the EnergyTerms may accumulate any block-type specific data for that term into tensors and cache those tensors inside the PackedBlockTypes object.
The creation of the data stored in the PackedBlockType object can be somewhat slow; thus, it is most efficient to share a single PackedBlockTypes
object between multiple PoseStacks
.
The coordinates of a PoseStack
can be modified after construction; however, all other data members must be left unaltered. If you want to modify the residue type information for an existing PoseStack, you should construct a new PoseStack
object instead.
tmol
, and Rosetta before it, creates an all-atom representation of a molecular system. For this reason, there are several differences between tmol
's representation of several residue types and the representations for most of the popular ML models (such as AlphaFold/RosettaFold/OpenFold/ESMFold). In particular, tmol
models hydrogens explicitly and is thus aware of chemical differences that other representations gloss over. For example, cysteine can either form or not form a disulfide bond with another cysteine. In other ML models, there is no representational difference between cysteine in these two states; in tmol, the disulfide-bonded cysteine has an extra inter-residue connection between the SG sulfur and the other cysteine, and the non-disulfide-bonded cysteine has a sulfhydryl hydrogen in its place. Thus tmol
represents these two states with two different residue types. In another case, tmol
differentiates between the two tautomers of histidine; one tautomer protonates the NE2 nitrogen in the imidazole ring, the other tautomer ("HIS_D") protonates the ND1 nitrogen.
tmol
will build hydrogens for you if you do not have their coordinates. This calculation is deterministic. For aliphatic hydrogens and most polar hydrogens, it is trivially deterministic; for some polar hydrogens, however, there are degrees of freedom that go beyond heavy-atom coordinates. tmol
places hydroxyl, phenolic, and sulfhydryl hydrogens with a dihedral angle of 180 -- almost certainly not the optimal location for these atoms -- and it will always choose the NE2-protonated histidine tautomer. The hydrogen-placement step is differentiable, so if you include the tmol
energy in your ML model's loss and you only provide heavy-atom coordinates, then the energetic contribution of the automatically-built hydrogen atoms will feed into the positional derivatives of the heavy-atoms that define their geometries.
tmol
includes explicit representations of termini atoms, using different residue types for amino acids (and eventually other polymer subunits) that are in the middle of a peptide chain than at the ends. Also, tmol
will define chemical bonds between sequential residues that are part of the same chain. In several modeling tasks, the first and last residues of a chain may be absent or sequential residues will not actually have chemical bonds between them. For example, in modeling a protein/protein interface, it might be most computationally efficient to only represent a small number of residues on either side of the interface. The first residue might not be the N terminus, and residues i and i+1 might be separated by many cropped-out residues between them; in such cases, adding the formal positive charge on the first residue's amino group might create unrealistic electrostatic interactions, and declaring a chemical bond between i and i+1 might put large forces on these residues to try and correct the "bad" covalent geometry. tmol
's PoseStack construction process allows control over which residues are treated like regular polymeric positions and which are "exceptions to the rule" through the variable res_not_connected
. More on this variable below.