Skip to content
Andrew Leaver-Fay edited this page Mar 27, 2024 · 11 revisions

tmol

tmol, short for TensorMol, is a faithful reimplementation of the Rosetta molecular modeling energy function ("beta_nov2016_cart") in PyTorch with custom kernels written in C++ and CUDA. Given the coordinates of one or more proteins, tmol can compute both energies and derivatives. tmol can also perform gradient-based minimization on those structures. Thus, ML models that produce cartesian coordinates for proteins can include biophysical features in their loss during training or refine their output structures using Rosetta's experimentally validated energy function.

Overview

tmol, like Rosetta before it, is residue-centric*. Molecules are represented as collections of residues. Each residue is one of several "residue types." Residue types comprise a set of atoms, the properties of those atoms, the chemical bonds between those atoms, and the inter-residue chemical bonds that will join residues together. The molecular system records the residue type for each residue, the coordinates of each atom, and how each residue is connected (or not connected) to other residues.

Scoring is computed on a residue-single- and residue-pair basis. Scores can be reported either for entire structures or broken down on a per-residue-pair basis.

*tmol generalizes "residue," which denotes certain molecules (e.g. glucose), into "blocks," which may be whole residues or may be smaller parts of those residues. This future-proofing will allow us to eventually optimize separately the several, essentially independent hydroxyl groups on a single glucose residue. For now, "block" and "residue" are interchangeable.

API

There are three main tasks that most tmol users need to perform:

  • the construction of PoseStack objects for representing molecular systems,
  • the scoring of a PoseStack, and
  • the minimization of a PoseStack

PoseStack

A PoseStack represents a set (a batch) of molecular systems. Each "Pose" in the stack can hold as many residues as desired (with the warning that tmol will be more efficient in computing scores for systems that are approximately the same size than for systems that have very different sizes) and these residues can be bonded together into one or more chains. Users will create PoseStack objects in order to then compute energies.

PoseStack holds a class that holds the residue-type data for its residues. Most users will not interact directly with this class, but it is good to be aware that it is there. This class, PackedBlockTypes, is built from a collection of RefinedResidueType objects that are constructed either from tmol's default set of residue types, described in a .yaml file, or that are created programmatically by enterprising developers. Most users will be content to interact with the PackedBlockTypes object returned by tmol.default_packed_block_types() if they want to think about this class at all. The PackedBlockTypes object will be used to cache energy-term-specific data that is needed during energy evaluation, and the creation of this data can be somewhat slow; thus, it is most efficient to share a single PackedBlockTypes object between multiple PoseStacks.

The coordinates of a PoseStack can be modified after construction; however, all other data members must be left unaltered. If you want to modify the residue type information for an existing PoseStack, you should construct a new PoseStack object instead.

Creating a PoseStack

The main function for creating a PoseStack is tmol.pose_stack_from_canonical_form, but two convenience functions, tmol.pose_stack_from_openfold and tmol.pose_stack_from_rosettafold2 exist to convert from two popular representations of proteins. Both of these convenience functions first create a "canonical form" for a structure and then pass that canonical form into tmol.pose_stack_from_canonical_form. More on the "canonical form" later.

Constructing a PoseStack from an OpenFold dictionary

Given a model that returns an OpenFold-style dictionary, a PoseStack can be constructed using

    output = openfold_model.infer(sequences)
    pose_stack = tmol.pose_stack_from_openfold(output)

Constructing a PoseStack from a RosettaFold2 dictionary

Given a model that returns an RosettaFold2-style dictionary, a PoseStack can be constructed using

    seq, xyz, chainlens = rosettafold2_model.infer(sequence)
    pose_stack = tmol.pose_stack_from_rosettafold2(seq[0], xyz[0], chainlens[0])

A Note on Hydrogen Atoms

tmol, and Rosetta before it, creates an all-atom representation of a molecular system. For this reason, there are several differences between tmol's representation of several residue types and the representations for most of the popular ML models (such as AlphaFold/RosettaFold/OpenFold/ESMFold). In particular, tmol models hydrogens explicitly and is thus aware of chemical differences that other representations gloss over. For example, cysteine can either form or not form a disulfide bond with another cysteine. In other ML models, there is no representational difference between cysteine in these two states; in tmol, the disulfide-bonded cysteine has an extra inter-residue connection between the SG sulfur and the other cysteine, and the non-disulfide-bonded cysteine has a sulfhydryl hydrogen in its place. Thus tmol represents these two states with two different residue types. In another case, tmol differentiates between the two tautomers of histidine; one tautomer protonates the NE2 nitrogen in the imidazole ring, the other tautomer ("HIS_D") protonates the ND1 nitrogen.

tmol will build hydrogens for you if you do not have their coordinates. This calculation is deterministic. For aliphatic hydrogens and most polar hydrogens, it is trivially deterministic; for some polar hydrogens, however, there are degrees of freedom that go beyond heavy-atom coordinates. tmol places hydroxyl, phenolic, and sulfhydryl hydrogens with a dihedral angle of 180 -- almost certainly not the optimal location for these atoms -- and it will always choose the NE2-protonated histidine tautomer. The hydrogen-placement step is differentiable, so if you include the tmol energy in your ML model's loss and you only provide heavy-atom coordinates, then the energetic contribution of the automatically-built hydrogen atoms will feed into the positional derivatives of the heavy-atoms that define their geometries.

A Note on Cropping

tmol includes explicit representations of termini atoms, using different residue types for amino acids (and eventually other polymer subunits) that are in the middle of a peptide chain than at the ends. Also, tmol will define chemical bonds between sequential residues that are part of the same chain. In several modeling tasks, the first and last residues of a chain may be absent or sequential residues will not actually have chemical bonds between them. For example, in modeling a protein/protein interface, it might be most computationally efficient to only represent a small number of residues on either side of the interface. The first residue might not be the N terminus, and residues i and i+1 might be separated by many cropped-out residues between them; in such cases, adding the formal positive charge on the first residue's amino group might create unrealistic electrostatic interactions, and declaring a chemical bond between i and i+1 might put large forces on these residues to try and correct the "bad" covalent geometry. tmol's PoseStack construction process allows control over which residues are treated like regular polymeric positions and which are "exceptions to the rule" through the variable res_not_connected. More on this variable below.

The Canonical Form and Class CanonicalOrdering

Because tmol represents structures with a higher chemical granularity than most other ML packages, it has to resolve the chemical structure of the molecules coming in to it from other sources. Even the process of reading in a PDB file requires this chemical type resolution. The input to tmol's PoseStack construction function is

  1. a CanoncicalOrdering object,
  2. a PackedBlockTypes object, and
  3. a set of three or more tensors typically bundled together in a dictionary referred to as the "canonical form"

The canonical form dictionary must contain:

  1. "chain_id": a tensor of torch.int32 of size [n_poses x max_n_residues], specifying the chain identifier for each residue in each pose.
  2. "res_types": a tensor of torch.int32 of size [n_poses x max_n_residues], specifying the integer representation of each residue's three-letter code in line with the ordering specified by the CanonicalOrdering object, where masked-out residues are indicated by a sentinel value of -1, and
  3. "coords": a tensor of torch.int32 of size [n_poses x max_n_residues x max_n_atoms_per_residue x 3], where the position in the third dimension is used to indicate which atom is being described, and where atoms that are not being given to tmol should have their coordinates given as numpy.NaN

In addition, the canonical form may also contain:

  1. "disulfides": a tensor of torch.int64 of size [n_dslf x 3] which lists disulfides as tuples of (pose_index, res1_index, res2_index). In many modeling problems, the indices of disulfide-bonded residues is known up front and can be given to tmol to avoid the step of detecting disulfide bonds based on distance. There are two reasons to skip this step: 1) it is possible that a model might not place two disulfide-bonded residues close enough together for tmol to declare them to be disulfide-bonded, and thus tmol will be of no help in pushing these residues closer together and 2) this step takes place on the CPU.

  2. "find_additional_disulfides": a boolean that controls whether or not the disulfide-detection step should be performed

  3. "res_not_connected": a tensor torch.bool of size [n_poses x max_n_residues x 2]. This tensor is used to indicate that a given (polymeric) residue is not connected to its previous (position 0) or next (position 1) residue; for termini residues, a value of True will cause the residue to not be built with its down (position 0) or up (position 1) termini-variant types. The purpose is to allow the user to include a subset of the residues in a protein where a series of "gap" residues can be omitted between i and i+1 without those two residues being treated as if they are chemically bonded. This will keep the Ramachandran term from scoring nonsense dihdral angles and will keep the cart-bonded term from scoring nonsense bond lengths and angles.

  4. "return_chain_ind": a boolean that when True alters the return type of this function so that it will be a tuple with the first element being the PoseStack and the second element being a tensor of torch.int32 for the re-indexed residues of the PoseStack. There are two things that should be noted. 1. PoseStack does not keep track of a chain identifier; chain is essentially an emergent property of the chemical bonds. However, PoseStack can be used to represent disconnected segments of a single chain, in which case, it seems that chain identifier cannot be perfectly recovered from the set of chemical bonds. At the moment, if you wish to keep track of the chain identifier for a particular residue, that must be stored separately from the PoseStack. 2. Keeping track of chain identifier is made more challenging by the fact that PoseStack construction will excise out any residues with a residue type of -1 (i.e. gap residues), and all residues appearing after those gap residues are given "new indices" (that is, they will appear earlier in the list of non-gap residues). For convenience, this function returns the chain_id after the gap residues have been removed.

  5. "return_atom_mapping": a boolean that when True alters the return type of this function to that it will be a tuple with the first element being the PoseStack and the last two elements being tensors t1 and t2 that describe the mapping for atoms in the canonical-form tensor to their PoseStack index; this could be used to update the coordinates in a PoseStack without rebuilding it (as long as the chemical identity is meant to be unchanged) or to perhaps remap derivatives to or from pose stack ordering. If requested, the atom mapping will be the last two arguments returned by this function, as two tensors:

            ps, t1, t2 = pose_stack_from_canonical_form(
                ...,
                return_atom_mapping=True
            )
            can_ord_coords[
                t1[:, 0], t1[:, 1], t1[:, 2]
            ] = ps.coords[
                t2[:, 0], t2[:, 1]
            ]
    

    where t1 is a tensor nats x 3 where

    • position [i, 0] is the pose index
    • position [i, 1] is the residue index, and
    • position [i, 2] is the canonical-ordering atom index

    and t2 is a tensor nats x 2 where

    • position [i, 0] is the pose index, and
    • position [i, 1] is the pose-ordered atom index

A Note on Atom Names and Residue Type Resolution

(Describe tmol's residue-type-resolution logic)

Note that tmol.pose_stack_from_rosettafold2 has to strip out the "H" atom from the N-terminal residue as that atom truly ought to be named "1H."

Scoring a PoseStack

The steps of scoring a PoseStack are

  1. Creating a ScoreFunction
  2. Rendering a WholePoseScoringModule or a BlockPairScoringModule for the given PoseStack
  3. Using the rendered module's __call__ method
pose_stack = ...
sfxn = tmol.beta2016_score_function(pose_stack.device)
wpsm = sfxn.render_whole_pose_scoring_module(pose_stack)
per_pose_weighted_energy = wpsm(pose_stack.coords)

Performing Cartesian Minimization on a PoseStack

Development Guide

Environment

tmol depends multiple external libraries, notably pytorch and the pydata software stack, and utilizes conda to manage installation of these dependencies.

A development and test environment can be bootstrapped via the dev_setup script. This script requires a functional conda installation and, by default, initializes a conda environment named tmol. It is recommended that you use direnv to ensure that the tmol environment is activated.

Development Workflow

Before diving in below, remember that this workflow is only mandatory for master. Development branches can, and should, be organized with your personal best practices.

The primary goal of our shared development workflow is to maintain a stable, high quality master branch. To aid in this, potentially Sisyphean, task we utilize a set of simple, inviolable, core principles:

Pull-request Flow

All changes to master should be performed via pull request flow, with a PR serving as a core point of development, discussion, testing and review. We close pull requests via squash or rebase, so that master contains a tidy, linear project history.

A pull request should land as an "atomic" unit of work, representing a single set of related changes. A larger feature may span multiple pull requests, however each pull request should stand alone. If a request appears to be growing "too large" to review, utilize a stacked pull to partition the work.

Automated Testing

We maintain an automated test suite executed via buildkite. The test suite must always be passing for master, and is available for any open branch via pull request.

Code Review

Review resources:

Documentation