Skip to content

CanonicalForm

Jeff Flatten edited this page Jun 6, 2024 · 7 revisions

The CanonicalForm is a data format for representing sets of molecular structures. The CanonicalForm acts as an intermediate representation between the PoseStack and the source data from a variety of input sources, including PDBs, RosettaFold2, and OpenFold.

Because tmol represents structures with a higher chemical granularity than most other ML packages, it has to resolve the chemical structure of the molecules coming in to it from other sources. Even the process of reading in a PDB file requires this chemical type resolution.

The chemical type resolution step for creating a PoseStack requires 3 objects:

  1. A CanonicalOrdering object - This represents an mapping of residues to integers and also a mapping for each residue of the atom names to unique integers.
  2. A PackedBlockTypes object - This is an object containing information about the block (residue) types for a PoseStack.
  3. A set of three or more tensors typically bundled together in a dictionary referred to as the "canonical form".

CanonicalForm Dictionary

CanonicalForm is not a type, but rather a dictionary containing several required and several optional values.

The canonical form dictionary must contain:

  1. "chain_id": a tensor of torch.int32 of size [n_poses x max_n_residues], specifying the chain identifier for each residue in each pose.
  2. "res_types": a tensor of torch.int32 of size [n_poses x max_n_residues], specifying the integer representation of each residue's three-letter code in line with the ordering specified by the CanonicalOrdering object, where masked-out residues are indicated by a sentinel value of -1, and
  3. "coords": a tensor of torch.int32 of size [n_poses x max_n_residues x max_n_atoms_per_residue x 3], where the position in the third dimension is used to indicate which atom is being described, and where atoms that are not being given to tmol should have their coordinates given as numpy.NaN

In addition, the canonical form may also contain:

  1. "disulfides": a tensor of torch.int64 of size [n_dslf x 3] which lists disulfides as tuples of (pose_index, res1_index, res2_index). In many modeling problems, the indices of disulfide-bonded residues is known up front and can be given to tmol to avoid the step of detecting disulfide bonds based on distance. There are two reasons to skip this step: 1) it is possible that a model might not place two disulfide-bonded residues close enough together for tmol to declare them to be disulfide-bonded, and thus tmol will be of no help in pushing these residues closer together and 2) this step takes place on the CPU.

  2. "find_additional_disulfides": a boolean that controls whether or not the disulfide-detection step should be performed

  3. "res_not_connected": a tensor torch.bool of size [n_poses x max_n_residues x 2]. This tensor is used to indicate that a given (polymeric) residue is not connected to its previous (position 0) or next (position 1) residue; for termini residues, a value of True will cause the residue to not be built with its down (position 0) or up (position 1) termini-variant types. The purpose is to allow the user to include a subset of the residues in a protein where a series of "gap" residues can be omitted between i and i+1 without those two residues being treated as if they are chemically bonded. This will keep the Ramachandran term from scoring nonsense dihdral angles and will keep the cart-bonded term from scoring nonsense bond lengths and angles.

  4. "return_chain_ind": a boolean that when True alters the return type of this function so that it will be a tuple with the first element being the PoseStack and the second element being a tensor of torch.int32 for the re-indexed residues of the PoseStack. There are two things that should be noted. 1. PoseStack does not keep track of a chain identifier; chain is essentially an emergent property of the chemical bonds. However, PoseStack can be used to represent disconnected segments of a single chain, in which case, it seems that chain identifier cannot be perfectly recovered from the set of chemical bonds. At the moment, if you wish to keep track of the chain identifier for a particular residue, that must be stored separately from the PoseStack. 2. Keeping track of chain identifier is made more challenging by the fact that PoseStack construction will excise out any residues with a residue type of -1 (i.e. gap residues), and all residues appearing after those gap residues are given "new indices" (that is, they will appear earlier in the list of non-gap residues). For convenience, this function returns the chain_id after the gap residues have been removed.

  5. "return_atom_mapping": a boolean that when True alters the return type of this function to that it will be a tuple with the first element being the PoseStack and the last two elements being tensors t1 and t2 that describe the mapping for atoms in the canonical-form tensor to their PoseStack index; this could be used to update the coordinates in a PoseStack without rebuilding it (as long as the chemical identity is meant to be unchanged) or to perhaps remap derivatives to or from pose stack ordering. If requested, the atom mapping will be the last two arguments returned by this function, as two tensors:

            ps, t1, t2 = pose_stack_from_canonical_form(
                ...,
                return_atom_mapping=True
            )
            can_ord_coords[
                t1[:, 0], t1[:, 1], t1[:, 2]
            ] = ps.coords[
                t2[:, 0], t2[:, 1]
            ]
    

    where t1 is a tensor nats x 3 where

    • position [i, 0] is the pose index
    • position [i, 1] is the residue index, and
    • position [i, 2] is the canonical-ordering atom index

    and t2 is a tensor nats x 2 where

    • position [i, 0] is the pose index, and
    • position [i, 1] is the pose-ordered atom index
Clone this wiki locally