title |
---|
Cheminformatics |
Tim Dudgeon, Simon Bray, Gianmauro Cuccuru, Björn Grüning, Rachael Skyner, Jack Scantlebury, Susan Leung, Frank von Delft
This repo serves as a companion to our recent docking simulations on the SARS-CoV-2 main protease.
It contains descriptions of workflows and exact versions of all software used. The goals of this study were to:
- Underscore the importance of access to raw data
- Demonstrate that existing community efforts in curation and deployment of computational chemistry software can reliably support rapid reproducible research during global crises
The Diamond Light Source's XChem team recently completed a successful fragment screen on the SARS-CoV-2 main protease (MPro), which provided 55 fragment hits (which can be viewed nteractively here ). In an effort to identify candidate molecules for binding, InformaticsMatters, the XChem group and the European Galaxy team have joined forces to construct and execute a Galaxy workflow for performing and evaluating molecular docking on a massive scale.
The diagram below describes the worfklow used in this work. Further details of the steps can be found in the compound enumeration and Docking and scoring workflow sections.
An initial list of ~42,000 candidate molecules was assembled by using the Fragalysis fragment network to elaborate from the initial fragment hits. The fragment network takes a big set of compounds, and splits them up into parts – rings, linkers and substituents. These parts form the nodes in a graph network. The edges between these nodes describe how the bits of molecules can be linked together to make new molecules. From this information, we know how we can change a molecule by searching the network for new bits to add to an initial hit, with transformations described along the edges in the graph-network.
This was done using Informatics Matters’ Fragnet Search APIs, querying a database of ~64M molecules available from Enamine REAL, ChemSpace and MolPort using query parameters of 2 edge traversals and a change in heavy atom count of 5 and ring atom count of 2.
The enumerated compounds were used as inputs for the docking and scoring workflow. The workflow consists of the following steps, each of which was carried out using tools installed on the European Galaxy server:
- Charge enumeration of those 42,000 candidate molecules to generate ~159,000 docking candidates.
- Generation of 3D conformations based on SMILES strings of the candidate molecules.
- Preparation of active site for docking using rDock.
- Docking of molecules into each of the MPro binding sites using rDock, generating 25 docking poses for each molecule.
- Evaluation of the docking poses using a deep learning approach developed at the University of Oxford, employing augmentation of training data with incorrectly docked ligands to prompt the model to learn from protein-ligand interactions. The algorithm was deployed on the European Galaxy server inside a Docker container, thanks to work by InformaticsMatters and the European Galaxy team.
- Scoring of the top scoring pose from each molecule against the original fragment screening hit ligands using the SuCOS MAX shape and feature overlay algorithm, again deployed on the European Galaxy server by InformaticsMatters and the European Galaxy team.
This workflow was repeated for each of the 17 fragment screening crystal structures that were available at the time: Mpro -x1249, -x0072, -x0104, -x0107, -x0161, -x0195, -x0305, -x0354, -x0387, -x0434, -x0540, -x0678, -x0874, -x0946, -x0995, -x1077 and -x1093 (more hits have been found since).
Of these steps, the third (docking) is the most compute-intensive. Here, the project benefited from the enormous distributed compute capacity which underlies the European Galaxy project. Over 5000 CPUs were made available, provided by Diamond’s STFC-IRIS cluster at Harwell, UK and the de.NBI cloud in Freiburg, Germany. With each docking job requiring 1 CPU, thousands of poses could thus be docked in parallel, allowing millions of poses to be docked over a single weekend. The fourth step (pose scoring), while less computationally expensive, was accelerated thanks to GPUs provided by de.NBI and STFC. In total, the entire workflow described here took around 120,000 hours of CPU time (13 years) to complete.
All data is publicly available via https://usegalaxy.eu, together with the workflows used for data generation, and we are working to provide more detailed documentation that will allow other users to perform similar studies, including on other systems. Histories for each fragment structures are provided here.
Having identified promising candidate ligands, we are now looking for funding to purchase compounds as a basis for further experimental study.
In addition we will be looking at newly released data here → Updates: Analysis of additional data
The experiments have been performed using the Galaxy platform and open source tools from BioConda and conda-forge. Tools were run using cloud resources provided by de.NBI and STFC.