Skip to content

PoseStack

Jeff Flatten edited this page Jun 5, 2024 · 14 revisions

PoseStack

A PoseStack represents a set (a batch) of molecular systems.

PoseStacks are generated from a Canonical Form via the function tmol.pose_stack_from_canonical_form. Most often users will instead use convenience functions for loading in from PDBs, from RosettaFold, or from OpenFold.

Each pose in a PoseStack is composed of some number of Blocks - substructures that are instances of pre-defined types defined in the database.

The PoseStack tracks the types, coordinates, and the bonds that join a pose's blocks together. Each "Pose" in the stack can hold as many blocks as desired.

Note

tmol will be more efficient in computing scores for systems that are approximately the same size than for systems that have very different sizes) and these blocks can be bonded together into one or more chains.

PackedBlockTypes

The PoseStack includes a datastructure, PackedBlockTypes that serves to accumulate ("pack") all block type specific information into Tensors so that this data can be efficiently passed and made available to C++ kernels.

This class is built from the collection of RefinedResidueType objects used by the PoseStack.

It is also used by classes such as the EnergyTerm classes to accumulate any block-type specific data for that term into tensors.

The creation of the data stored in the PackedBlockType object can be somewhat slow; thus, it is most efficient to share a single PackedBlockTypes object between multiple PoseStacks.

The coordinates of a PoseStack can be modified after construction; however, all other data members must be left unaltered. If you want to modify the residue type information for an existing PoseStack, you should construct a new PoseStack object instead.

A Note on Hydrogen Atoms

tmol, and Rosetta before it, creates an all-atom representation of a molecular system. For this reason, there are several differences between tmol's representation of several residue types and the representations for most of the popular ML models (such as AlphaFold/RosettaFold/OpenFold/ESMFold). In particular, tmol models hydrogens explicitly and is thus aware of chemical differences that other representations gloss over. For example, cysteine can either form or not form a disulfide bond with another cysteine. In other ML models, there is no representational difference between cysteine in these two states; in tmol, the disulfide-bonded cysteine has an extra inter-residue connection between the SG sulfur and the other cysteine, and the non-disulfide-bonded cysteine has a sulfhydryl hydrogen in its place. Thus tmol represents these two states with two different residue types. In another case, tmol differentiates between the two tautomers of histidine; one tautomer protonates the NE2 nitrogen in the imidazole ring, the other tautomer ("HIS_D") protonates the ND1 nitrogen.

tmol will build hydrogens for you if you do not have their coordinates. This calculation is deterministic. For aliphatic hydrogens and most polar hydrogens, it is trivially deterministic; for some polar hydrogens, however, there are degrees of freedom that go beyond heavy-atom coordinates. tmol places hydroxyl, phenolic, and sulfhydryl hydrogens with a dihedral angle of 180 -- almost certainly not the optimal location for these atoms -- and it will always choose the NE2-protonated histidine tautomer. The hydrogen-placement step is differentiable, so if you include the tmol energy in your ML model's loss and you only provide heavy-atom coordinates, then the energetic contribution of the automatically-built hydrogen atoms will feed into the positional derivatives of the heavy-atoms that define their geometries.

A Note on Cropping

tmol includes explicit representations of termini atoms, using different residue types for amino acids (and eventually other polymer subunits) that are in the middle of a peptide chain than at the ends. Also, tmol will define chemical bonds between sequential residues that are part of the same chain. In several modeling tasks, the first and last residues of a chain may be absent or sequential residues will not actually have chemical bonds between them. For example, in modeling a protein/protein interface, it might be most computationally efficient to only represent a small number of residues on either side of the interface. The first residue might not be the N terminus, and residues i and i+1 might be separated by many cropped-out residues between them; in such cases, adding the formal positive charge on the first residue's amino group might create unrealistic electrostatic interactions, and declaring a chemical bond between i and i+1 might put large forces on these residues to try and correct the "bad" covalent geometry. tmol's PoseStack construction process allows control over which residues are treated like regular polymeric positions and which are "exceptions to the rule" through the variable res_not_connected. More on this variable below.

Clone this wiki locally