Skip to content

PoseStack

Jeff Flatten edited this page Jun 5, 2024 · 14 revisions

PoseStack

A PoseStack represents a set (a batch) of molecular systems. PoseStacks maybe be loaded in from PDBs, or may come in from other sources such as RosettaFold.

Each pose in a PoseStack is composed of some number of Blocks

Blocks

tmol breaks poses down into sub-structures called 'blocks' (a generalization of the concept of 'residue' from Rosetta). Each block is one of several "block types". Block types comprise a set of atoms, the properties of those atoms, the chemical bonds between those atoms, and the inter-block chemical bonds that will join blocks together. The molecular system records the block type for each block, the coordinates of each atom, and how each block is connected (or not connected) to other blocks.

Each "Pose" in the stack can hold as many residues as desired

Note

tmol will be more efficient in computing scores for systems that are approximately the same size than for systems that have very different sizes) and these blocks can be bonded together into one or more chains.

PoseStack contains a class that holds the residue-type data for its residues. Most users will not interact directly with this class, but it is good to be aware that it is there. This class, PackedBlockTypes, is built from a collection of RefinedResidueType objects that are constructed either from tmol's default set of residue types, described in a .yaml file, or that are created programmatically by enterprising developers. Most users will be content to interact with the PackedBlockTypes object returned by tmol.default_packed_block_types() if they want to think about this class at all. The PackedBlockTypes object will be used to cache energy-term-specific data that is needed during energy evaluation, and the creation of this data can be somewhat slow; thus, it is most efficient to share a single PackedBlockTypes object between multiple PoseStacks.

The coordinates of a PoseStack can be modified after construction; however, all other data members must be left unaltered. If you want to modify the residue type information for an existing PoseStack, you should construct a new PoseStack object instead.

A Note on Hydrogen Atoms

tmol, and Rosetta before it, creates an all-atom representation of a molecular system. For this reason, there are several differences between tmol's representation of several residue types and the representations for most of the popular ML models (such as AlphaFold/RosettaFold/OpenFold/ESMFold). In particular, tmol models hydrogens explicitly and is thus aware of chemical differences that other representations gloss over. For example, cysteine can either form or not form a disulfide bond with another cysteine. In other ML models, there is no representational difference between cysteine in these two states; in tmol, the disulfide-bonded cysteine has an extra inter-residue connection between the SG sulfur and the other cysteine, and the non-disulfide-bonded cysteine has a sulfhydryl hydrogen in its place. Thus tmol represents these two states with two different residue types. In another case, tmol differentiates between the two tautomers of histidine; one tautomer protonates the NE2 nitrogen in the imidazole ring, the other tautomer ("HIS_D") protonates the ND1 nitrogen.

tmol will build hydrogens for you if you do not have their coordinates. This calculation is deterministic. For aliphatic hydrogens and most polar hydrogens, it is trivially deterministic; for some polar hydrogens, however, there are degrees of freedom that go beyond heavy-atom coordinates. tmol places hydroxyl, phenolic, and sulfhydryl hydrogens with a dihedral angle of 180 -- almost certainly not the optimal location for these atoms -- and it will always choose the NE2-protonated histidine tautomer. The hydrogen-placement step is differentiable, so if you include the tmol energy in your ML model's loss and you only provide heavy-atom coordinates, then the energetic contribution of the automatically-built hydrogen atoms will feed into the positional derivatives of the heavy-atoms that define their geometries.

A Note on Cropping

tmol includes explicit representations of termini atoms, using different residue types for amino acids (and eventually other polymer subunits) that are in the middle of a peptide chain than at the ends. Also, tmol will define chemical bonds between sequential residues that are part of the same chain. In several modeling tasks, the first and last residues of a chain may be absent or sequential residues will not actually have chemical bonds between them. For example, in modeling a protein/protein interface, it might be most computationally efficient to only represent a small number of residues on either side of the interface. The first residue might not be the N terminus, and residues i and i+1 might be separated by many cropped-out residues between them; in such cases, adding the formal positive charge on the first residue's amino group might create unrealistic electrostatic interactions, and declaring a chemical bond between i and i+1 might put large forces on these residues to try and correct the "bad" covalent geometry. tmol's PoseStack construction process allows control over which residues are treated like regular polymeric positions and which are "exceptions to the rule" through the variable res_not_connected. More on this variable below.

Clone this wiki locally