-
Notifications
You must be signed in to change notification settings - Fork 3
CanonicalForm
The CanonicalForm is a data format for representing sets of molecular structures. The CanonicalForm acts as an intermediate representation between the PoseStack and the source data from a variety of input sources, including PDBs, RosettaFold2, and OpenFold.
Because tmol
represents structures with a higher chemical granularity than most other ML packages, it has to resolve the chemical structure of the molecules coming in to it from other sources. Even the process of reading in a PDB file requires this chemical type resolution.
The chemical type resolution step for creating a PoseStack requires 3 objects:
- A
CanonicalOrdering
object - This represents an mapping of residues to integers and also a mapping for each residue of the atom names to unique integers. - A
PackedBlockTypes
object - This is an object containing information about the block (residue) types for a PoseStack. - A set of three or more tensors typically bundled together in a dictionary referred to as the "canonical form".
CanonicalForm is not a type, but rather a dictionary containing several required and several optional values.
The canonical form dictionary must contain:
- "chain_id": a tensor of
torch.int32
of size [n_poses x max_n_residues], specifying the chain identifier for each residue in each pose. - "res_types": a tensor of
torch.int32
of size [n_poses x max_n_residues], specifying the integer representation of each residue's three-letter code in line with the ordering specified by theCanonicalOrdering
object, where masked-out residues are indicated by a sentinel value of -1, and - "coords": a tensor of
torch.int32
of size [n_poses x max_n_residues x max_n_atoms_per_residue x 3], where the position in the third dimension is used to indicate which atom is being described, and where atoms that are not being given totmol
should have their coordinates given asnumpy.NaN
In addition, the canonical form may also contain:
-
"disulfides": a tensor of
torch.int64
of size [n_dslf x 3] which lists disulfides as tuples of (pose_index, res1_index, res2_index). In many modeling problems, the indices of disulfide-bonded residues is known up front and can be given totmol
to avoid the step of detecting disulfide bonds based on distance. There are two reasons to skip this step: 1) it is possible that a model might not place two disulfide-bonded residues close enough together fortmol
to declare them to be disulfide-bonded, and thustmol
will be of no help in pushing these residues closer together and 2) this step takes place on the CPU. -
"find_additional_disulfides": a boolean that controls whether or not the disulfide-detection step should be performed
-
"res_not_connected": a tensor
torch.bool
of size [n_poses x max_n_residues x 2]. This tensor is used to indicate that a given (polymeric) residue is not connected to its previous (position 0) or next (position 1) residue; for termini residues, a value ofTrue
will cause the residue to not be built with its down (position 0) or up (position 1) termini-variant types. The purpose is to allow the user to include a subset of the residues in a protein where a series of "gap" residues can be omitted between i and i+1 without those two residues being treated as if they are chemically bonded. This will keep the Ramachandran term from scoring nonsense dihdral angles and will keep the cart-bonded term from scoring nonsense bond lengths and angles. -
"return_chain_ind": a boolean that when
True
alters the return type of this function so that it will be a tuple with the first element being thePoseStack
and the second element being a tensor oftorch.int32
for the re-indexed residues of thePoseStack
. There are two things that should be noted. 1.PoseStack
does not keep track of a chain identifier; chain is essentially an emergent property of the chemical bonds. However,PoseStack
can be used to represent disconnected segments of a single chain, in which case, it seems that chain identifier cannot be perfectly recovered from the set of chemical bonds. At the moment, if you wish to keep track of the chain identifier for a particular residue, that must be stored separately from thePoseStack
. 2. Keeping track of chain identifier is made more challenging by the fact thatPoseStack
construction will excise out any residues with a residue type of-1
(i.e. gap residues), and all residues appearing after those gap residues are given "new indices" (that is, they will appear earlier in the list of non-gap residues). For convenience, this function returns the chain_id after the gap residues have been removed. -
"return_atom_mapping": a boolean that when
True
alters the return type of this function to that it will be a tuple with the first element being thePoseStack
and the last two elements being tensorst1
andt2
that describe the mapping for atoms in the canonical-form tensor to their PoseStack index; this could be used to update the coordinates in a PoseStack without rebuilding it (as long as the chemical identity is meant to be unchanged) or to perhaps remap derivatives to or from pose stack ordering. If requested, the atom mapping will be the last two arguments returned by this function, as two tensors:ps, t1, t2 = pose_stack_from_canonical_form( ..., return_atom_mapping=True ) can_ord_coords[ t1[:, 0], t1[:, 1], t1[:, 2] ] = ps.coords[ t2[:, 0], t2[:, 1] ]
where t1 is a tensor nats x 3 where
- position [i, 0] is the pose index
- position [i, 1] is the residue index, and
- position [i, 2] is the canonical-ordering atom index
and t2 is a tensor nats x 2 where
- position [i, 0] is the pose index, and
- position [i, 1] is the pose-ordered atom index