-
Notifications
You must be signed in to change notification settings - Fork 3
Home
tmol
, short for TensorMol, is a faithful reimplementation of the Rosetta molecular modeling energy function ("beta_nov2016_cart") in PyTorch with custom kernels written in C++ and CUDA. Given the coordinates of one or more proteins, tmol
can compute both energies and derivatives. tmol
can also perform gradient-based minimization on those structures. Thus, ML models that produce cartesian coordinates for proteins can include biophysical features in their loss during training or refine their output structures using Rosetta's experimentally validated energy function.
tmol,
like Rosetta before it, is residue-centric*. Molecules are represented as collections of residues. Each residue is one of several "residue types." Residue types comprise a set of atoms, the properties of those atoms, the chemical bonds between those atoms, and the inter-residue chemical bonds that will join residues together. The molecular system records the residue type for each residue, the coordinates of each atom, and how each residue is connected (or not connected) to other residues.
Scoring is computed on a residue-single- and residue-pair basis. Scores can be reported either for entire structures or broken down on a per-residue-pair basis.
*tmol
generalizes "residue," which denotes certain molecules (e.g. glucose), into "blocks," which may be whole residues or may be smaller parts of those residues. This future-proofing will allow us to eventually optimize separately the several, essentially independent hydroxyl groups on a single glucose residue. For now, "block" and "residue" are interchangeable.
There are three main tasks that most tmol
users need to perform:
- the construction of
PoseStack
objects for representing molecular systems, - the scoring of a
PoseStack
, and - the minimization of a
PoseStack
A PoseStack
represents a set (a batch) of molecular systems. Each "Pose" in the stack can hold as many residues as desired (with the warning that tmol
will be more efficient in computing scores for systems that are approximately the same size than for systems that have very different sizes) and these residues can be bonded together into one or more chains. Users will create PoseStack
objects in order to then compute energies.
PoseStack
holds a class that holds the residue-type data for its residues. Most users will not interact directly with this class, but it is good to be aware that it is there. This class, PackedBlockTypes
, is built from a collection of RefinedResidueType
objects that are constructed either from tmol
's default set of residue types, described in a .yaml
file, or that are created programmatically by enterprising developers. Most users will be content to interact with the PackedBlockTypes
object returned by tmol.default_packed_block_types()
if they want to think about this class at all. The PackedBlockTypes
object will be used to cache energy-term-specific data that is needed during energy evaluation, and the creation of this data can be somewhat slow; thus, it is most efficient to share a single PackedBlockTypes
object between multiple PoseStacks
.
The coordinates of a PoseStack
can be modified after construction; however, all other data members must be left unaltered. If you want to modify the residue type information for an existing PoseStack, you should construct a new PoseStack
object instead.
The main function for creating a PoseStack
is tmol.pose_stack_from_canonical_form
, but two convenience functions, tmol.pose_stack_from_openfold
and tmol.pose_stack_from_rosettafold2
exist to convert from two popular representations of proteins. Both of these convenience functions first create a "canonical form" for a structure and then pass that canonical form into tmol.pose_stack_from_canonical_form
. More on the "canonical form" later.
Given a model that returns an OpenFold-style dictionary, a PoseStack
can be constructed using
output = openfold_model.infer(sequences)
pose_stack = tmol.pose_stack_from_openfold(output)
Given a model that returns an RosettaFold2-style dictionary, a PoseStack
can be constructed using
seq, xyz, chainlens = rosettafold2_model.infer(sequence)
pose_stack = tmol.pose_stack_from_rosettafold2(seq[0], xyz[0], chainlens[0])
tmol
, and Rosetta before it, creates an all-atom representation of a molecular system. For this reason, there are several differences between tmol
's representation of several residue types and the representations for most of the popular ML models (such as AlphaFold/RosettaFold/OpenFold/ESMFold). In particular, tmol
models hydrogens explicitly and is thus aware of chemical differences that other representations gloss over. For example, cysteine can either form or not form a disulfide bond with another cysteine. In other ML models, there is no representational difference between cysteine in these two states; in tmol, the disulfide-bonded cysteine has an extra inter-residue connection between the SG sulfur and the other cysteine, and the non-disulfide-bonded cysteine has a sulfhydryl hydrogen in its place. Thus tmol
represents these two states with two different residue types. In another case, tmol
differentiates between the two tautomers of histidine; one tautomer protonates the NE2 nitrogen in the imidazole ring, the other tautomer ("HIS_D") protonates the ND1 nitrogen.
tmol
will build hydrogens for you if you do not have their coordinates. This calculation is deterministic. For aliphatic hydrogens and most polar hydrogens, it is trivially deterministic; for some polar hydrogens, however, there are degrees of freedom that go beyond heavy-atom coordinates. tmol
places hydroxyl, phenolic, and sulfhydryl hydrogens with a dihedral angle of 180 -- almost certainly not the optimal location for these atoms -- and it will always choose the NE2-protonated histidine tautomer. The hydrogen-placement step is differentiable, so if you include the tmol
energy in your ML model's loss and you only provide heavy-atom coordinates, then the energetic contribution of the automatically-built hydrogen atoms will feed into the positional derivatives of the heavy-atoms that define their geometries.
tmol
includes explicit representations of termini atoms, using different residue types for amino acids (and eventually other polymer subunits) that are in the middle of a peptide chain than at the ends. Also, tmol
will define chemical bonds between sequential residues that are part of the same chain. In several modeling tasks, the first and last residues of a chain may be absent or sequential residues will not actually have chemical bonds between them. For example, in modeling a protein/protein interface, it might be most computationally efficient to only represent a small number of residues on either side of the interface. The first residue might not be the N terminus, and residues i and i+1 might be separated by many cropped-out residues between them; in such cases, adding the formal positive charge on the first residue's amino group might create unrealistic electrostatic interactions, and declaring a chemical bond between i and i+1 might put large forces on these residues to try and correct the "bad" covalent geometry. tmol
's PoseStack construction process allows control over which residues are treated like regular polymeric positions and which are "exceptions to the rule" through the variable res_not_connected
. More on this variable below.
Because tmol
represents structures with a higher chemical granularity than most other ML packages, it has to resolve the chemical structure of the molecules coming in to it from other sources. Even the process of reading in a PDB file requires this chemical type resolution. The input to tmol
's PoseStack
construction function is
- a
CanoncicalOrdering
object, - a
PackedBlockTypes
object, and - a set of three or more tensors typically bundled together in a dictionary referred to as the "canonical form"
The canonical form dictionary must contain:
- "chain_id": a tensor of
torch.int32
of size [n_poses x max_n_residues], specifying the chain identifier for each residue in each pose. - "res_types": a tensor of
torch.int32
of size [n_poses x max_n_residues], specifying the integer representation of each residue's three-letter code in line with the ordering specified by theCanonicalOrdering
object, where masked-out residues are indicated by a sentinel value of -1, and - "coords": a tensor of
torch.int32
of size [n_poses x max_n_residues x max_n_atoms_per_residue x 3], where the position in the third dimension is used to indicate which atom is being described, and where atoms that are not being given totmol
should have their coordinates given asnumpy.NaN
In addition, the canonical form may also contain:
-
"disulfides": a tensor of
torch.int64
of size [n_dslf x 3] which lists disulfides as tuples of (pose_index, res1_index, res2_index). In many modeling problems, the indices of disulfide-bonded residues is known up front and can be given totmol
to avoid the step of detecting disulfide bonds based on distance. There are two reasons to skip this step: 1) it is possible that a model might not place two disulfide-bonded residues close enough together fortmol
to declare them to be disulfide-bonded, and thustmol
will be of no help in pushing these residues closer together and 2) this step takes place on the CPU. -
"find_additional_disulfides": a boolean that controls whether or not the disulfide-detection step should be performed
-
"res_not_connected": a tensor
torch.bool
of size [n_poses x max_n_residues x 2]. This tensor is used to indicate that a given (polymeric) residue is not connected to its previous (position 0) or next (position 1) residue; for termini residues, a value ofTrue
will cause the residue to not be built with its down (position 0) or up (position 1) termini-variant types. The purpose is to allow the user to include a subset of the residues in a protein where a series of "gap" residues can be omitted between i and i+1 without those two residues being treated as if they are chemically bonded. This will keep the Ramachandran term from scoring nonsense dihdral angles and will keep the cart-bonded term from scoring nonsense bond lengths and angles. -
"return_chain_ind": a boolean that when
True
alters the return type of this function so that it will be a tuple with the first element being thePoseStack
and the second element being a tensor oftorch.int32
for the re-indexed residues of thePoseStack
. There are two things that should be noted. 1.PoseStack
does not keep track of a chain identifier; chain is essentially an emergent property of the chemical bonds. However,PoseStack
can be used to represent disconnected segments of a single chain, in which case, it seems that chain identifier cannot be perfectly recovered from the set of chemical bonds. At the moment, if you wish to keep track of the chain identifier for a particular residue, that must be stored separately from thePoseStack
. 2. Keeping track of chain identifier is made more challenging by the fact thatPoseStack
construction will excise out any residues with a residue type of-1
(i.e. gap residues), and all residues appearing after those gap residues are given "new indices" (that is, they will appear earlier in the list of non-gap residues). For convenience, this function returns the chain_id after the gap residues have been removed. -
"return_atom_mapping": a boolean that when
True
alters the return type of this function to that it will be a tuple with the first element being thePoseStack
and the last two elements being tensorst1
andt2
that describe the mapping for atoms in the canonical-form tensor to their PoseStack index; this could be used to update the coordinates in a PoseStack without rebuilding it (as long as the chemical identity is meant to be unchanged) or to perhaps remap derivatives to or from pose stack ordering. If requested, the atom mapping will be the last two arguments returned by this function, as two tensors:ps, t1, t2 = pose_stack_from_canonical_form( ..., return_atom_mapping=True ) can_ord_coords[ t1[:, 0], t1[:, 1], t1[:, 2] ] = ps.coords[ t2[:, 0], t2[:, 1] ]
where t1 is a tensor nats x 3 where
- position [i, 0] is the pose index
- position [i, 1] is the residue index, and
- position [i, 2] is the canonical-ordering atom index
and t2 is a tensor nats x 2 where
- position [i, 0] is the pose index, and
- position [i, 1] is the pose-ordered atom index
(Describe tmol's residue-type-resolution logic)
Note that tmol.pose_stack_from_rosettafold2
has to strip out the "H" atom from the N-terminal residue as that atom truly ought to be named "1H."
The steps of scoring a PoseStack are
- Creating a ScoreFunction
- Rendering a WholePoseScoringModule or a BlockPairScoringModule for the given PoseStack
- Using the rendered module's
__call__
method
pose_stack = ...
sfxn = tmol.beta2016_score_function(pose_stack.device)
wpsm = sfxn.render_whole_pose_scoring_module(pose_stack)
per_pose_weighted_energy = wpsm(pose_stack.coords)
tmol
depends multiple external libraries, notably pytorch and the pydata software stack, and utilizes conda to manage installation of these dependencies.
A development and test environment can be bootstrapped via the dev_setup script. This script requires a functional conda
installation and, by default,
initializes a conda environment named tmol
. It is recommended that you
use direnv
to ensure that the tmol
environment
is activated.
Before diving in below, remember that this workflow is only mandatory for master
. Development branches can, and should, be organized with your personal best practices.
The primary goal of our shared development workflow is to maintain a stable, high quality master
branch. To aid in this, potentially Sisyphean, task we utilize a set of simple, inviolable, core principles:
All changes to master should be performed via pull request flow, with a PR serving as a core point of development, discussion, testing and review. We close pull requests via squash or rebase, so that master contains a tidy, linear project history.
A pull request should land as an "atomic" unit of work, representing a single set of related changes. A larger feature may span multiple pull requests, however each pull request should stand alone. If a request appears to be growing "too large" to review, utilize a stacked pull to partition the work.
We maintain an automated test suite executed via buildkite. The test suite must always be passing for master
, and is available for any open branch via pull request.
Review resources: