Accurate and efficient structure elucidation from routine one-dimensional NMR spectra using multitask machine learning
Official implementation of:
Accurate and efficient structure elucidation from routine one-dimensional NMR spectra using multitask machine learning
Frank Hu, Michael S. Chen, Grant M. Rotskoff, Matthew W. Kanan, Thomas E. Markland
Abstract: Rapid determination of molecular structures can greatly accelerate workflows across many chemical disciplines. However, elucidating structure using only one-dimensional (1D) NMR spectra, the most readily accessible data, remains an extremely challenging problem because of the combinatorial explosion of the number of possible molecules as the number of constituent atoms is increased. Here, we introduce a multitask machine learning framework that predicts the molecular structure (formula and connectivity) of an unknown compound solely based on its 1D 1H and/or 13C NMR spectra. First, we show how a transformer architecture can be constructed to efficiently solve the task, traditionally performed by chemists, of assembling large numbers of molecular fragments into molecular structures. Integrating this capability with a convolutional neural network, we build an end-to-end model for predicting structure from spectra that is fast and accurate. We demonstrate the effectiveness of this framework on molecules with up to 19 heavy (non-hydrogen) atoms, a size for which there are trillions of possible structures. Without relying on any prior chemical knowledge such as the molecular formula, we show that our approach predicts the exact molecule 69.6% of the time within the first 15 predictions, reducing the search space by up to 11 orders of magnitude.
Paper available at: https://pubs.acs.org/doi/10.1021/acscentsci.4c01132
Data available on Zenodo at: https://zenodo.org/records/13892026
We recommend having Anaconda installed as your package manager and to run this code on a Linux-based GPU cluster. To install NMR2Struct, do the following:
git clone https://github.com/MarklandGroup/NMR2Struct.git
cd NMR2Struct
conda env create -f environment.yml
conda activate NMR_env
pip install -e .
Make sure to run conda update conda
before installing to ensure your anaconda distribution is up to date.
All of the functionality in NMR2Struct is controlled via a YAMl configuration file system. A YAML file consists of sections with fields that can have different user-specified values. While most fields in a YAML file are populated with strings, fields can also be set to booleans (True
and False
), integers, floats, lists of values, and None types with null
. The YAML files in NMR2Struct have the following sections:
- global_args: Controls general arguments such as the seed, number of GPUs, where to save everything, and tensor datatypes
- data: Determines how input and target data are processed for training and inference, and where to find data files
- model: Model architecture arguments that are used for instantiating the model
- training: Arguments for the optimization protocol, such as the data split, number of epochs, optimizer, etc.
- analysis: Arguments for computing performance metrics on the model
For the data
section specifically, the spectra_file
, label_file
, and smiles_file
fields contain lists of files and there should be the same number of elements in each list, even if some elements are set to null
. There is an element-wise correspondence of the elements in each list, so the first spectrum, label, and smiles file together forms one dataset, and so on. This is designed so that the training and losses can be computed separately for each dataset and is useful for more advanced early-stopping experiments. The files should have the following formats:
- spectra: These should be hdf5 files that contain one dataset called 'spectra' where each row is a one-dimensional vector representing the concatenated 1H NMR and 13C NMR spectrum.
- label: These should be hdf5 files that contain one dataset called 'substructure_labels' where each row is a one-dimensional binary vector representing the substructures detected in each molecule.
- smiles: This should be a numpy file that contains a single array with all SMILES strings encoded into binary format. Be sure to do
x.decode('utf-8')
when reading strings from this format.
This section lists how you should set the value for specific alphabet and token related fields within the YAML configuration files. These fields are automatically populated during training based on the dataset and saved as a completed training YAML config file that you can use. However, if you want to infer a model without training, it is necessary to know what the token value should be. This is broken down into two sections, one for the substrcture-to-structure transformer and one for the multitask model.
Field | Value |
---|---|
model.model_args.source_size |
number of substructures + 1 (e.g., 958 for the 957 substructures) |
model.model_args.target_size |
length of alphabet + 3 |
model.model_args.src_pad_token |
0 |
model.model_args.tgt_pad_token |
length of alphabet |
inference.run_inference_args.pred_gen_opts.tgt_start_token |
length of alphabet + 1 |
inference.run_inference_args.pred_gen_opts.tgt_stop_token |
length of alphabet + 2 |
Field | Value |
---|---|
model.model_args.structure_model_args.source_size |
28045 |
model.model_args.structure_model_args.target_size |
length of alphabet + 3 |
model.model_args.structure_model_args.src_pad_token |
null |
model.model_args.structure_model_args.tgt_pad_token |
length of alphabet |
model.model_args.substructure_model_args.source_size |
28045 |
model.model_args.substructure_model_args.src_pad_token |
null |
inference.run_inference_args.pred_gen_opts.tgt_start_token |
length of alphabet + 1 |
inference.run_inference_args.pred_gen_opts.tgt_stop_token |
length of alphabet + 2 |
To train a substructure-to-structure transformer, follow these steps:
- Modify the training configuration file under
example_configs/training_substruct_to_struct.yaml
to point to the correct substructure file atdata.label_file
and the correct smiles file under thedata.smiles_file
. Thedata.spectra_file
field can be set to a list ofnull
values because spectra are not used for training the substructure-to-structure model. - Modify the config file to point to the correct splitting file at
training.splits
. The splitting file should be a dictionary of three fields which specify the indices that should go into the training, validation, and testing set, respectively. If you do not want to provide a splitting dictionary, set this field tonull
in which case the splitting is done based on thetrain_size
,val_size
, andtest_size
fractions. - Copy the training config file into a directory where you want to run the training.
- Run the following command (assuming you renamed your YAML file to be
config.yaml
):
nmr_train config.yaml
A similar procedure is used for training a spectrum-to-structure multitask model, but make sure to:
- Use the
example_configs/training_spectrum_to_struct.yaml
file as the starting point and set thespectra_file
field to list the correct spectra files. - Set the
model.model_args.structure_model_ckpt
to the correct checkpoint if using a pretrained transformer, otherwise leavenull
. - Set the
data.alphabet
as eithernull
if not using a pretrained transformer or to the alphabet used by the pretrained transformer.
Warning
This inference workflow is intended for running inference on a model on a dataset that was used for training, i.e. getting predictions over a test set to then compute metrics on. If you want to infer a single example, please refer to the section on inference with specific examples.
To infer a substructure-to-structure transformer for SMILES generation, first make sure to set the alphabet correctly with:
inference:
...
run_inference_args:
...
pred_gen_opts:
...
alphabet: checkpoints/alphabet.npy
which is the alphabet generated from training.
Note
If you want to use the alphabet that was used in the original model from the paper, put the path to the file example_configs/alphabet.npy
instead.
Once the alphabet is set, just run:
nmr_infer config.yaml 0 1
where config.yaml
is your modified training script from the training section which should contain an inference section tailored for the substructure-to-structure transformer. The 0 and 1 refer to the local rank and total number of GPU processes and is used to control multi-GPU inference. For most cases, inference on one GPU is sufficient, so we use a local rank of 0 and 1 GPU process. Make sure to set the inference.splits
field correctly. The same procedure holds for inferring SMILES from a spectrum-to-structure multitask model. By default, the inference
sections of the provided YAML files are configured to have the models generate SMILES strings.
To infer substructures from the multitask model, modify the inference section as follows, being sure to fill in the splits file which denotes which parts of the dataset belong to the test set:
inference:
model_selection: lowest
splits:
- PATH TO SPLITS FILE OR NULL
train_size: 0.8
val_size: 0.1
test_size: 0.1
dloader_args:
shuffle: false
batch_size: 1024
sets_to_run: ['test']
run_inference_args:
pred_gen_fn: 'infer_multitask_model_substructures'
pred_gen_opts:
track_gradients: true
write_freq: 100
Running inference writes the raw predictions as an hdf5 file to the save directory that you specified under the global_args
section, which should also be where the model checkpoints and tensorboard files are saved.
To infer the structures and/or substructures from a specific set of spectra, we recommend using the single spectrum inference entry point. First, be sure to set the inference.run_inference_args.pred_gen_opts.tgt_start_token
and inference.run_inference_args.pred_gen_opts.tgt_stop_token
correctly according to the section on token conventions.
To use this entry point, do the following:
nmr_infer_single_spectrum \
--config YAML_CONFIG_FILE \
--hnmr_file HNMR_FILE \
--cnmr_file CNMR_FILE \
--hnmr_shifts HNMR_SHIFTS \
--cnmr_shifts CNMR_SHIFTS \
--ckpt CKPT \
--normalize
Note
This entry point only works for inferring structure/substructures from spectra using a multitask model.
The arguments are as follows:
--config
: A YAML configuration file that contains at least theglobal_args
,model
, andinference
sections. Theinference
section should be configured correctly for SMILES or substructure inference. You can copy the correct inference settings from the YAML configuration files inexample_configs
.--hnmr_file
: A text or tab separated csv file containing pairs of (ppm, intensity) values on each row separated by a space. If not using an 1H NMR spectrum, pass "None" to this argument.--cnmr_file
: A text file containing a comma separated list of the carbon ppm shifts in the NMR. If not using an 13C NMR spectrum, pass "None" to this argument.--hnmr_shifts
: A pickle file containing the shift grid as a 1D numpy array that is used to interpolate the 1H NMR spectrum. You can find the shift grid used in the paper inexample_configs/HNMR_shifts.p
.--cnmr_shifts
: A pickle file containing the shift grid for discretizing the 13C NMR spectrum. The grid used in the paper can be found inexample_configs/CNMR_shifts.p
.--ckpt
: The model checkpoint you want to use. Its architecture should match the one specified in--config
.---normalize
: Toggles normalization of the 1H NMR spectrum by dividing the spectrum by the its highest intensity.
At the end of the inference, the predictions will be saved to the global_args.savedir
as an hdf5 file under the test
group. You must pass in at least one spectrum, either an 1H or 13C NMR, to the respective argument.
Warning
Analysis is only useful when you are training a model on a dataset with known targets and you want to compare its predictions against the target values. If you are just doing a one-off inference using the model, analysis will not work.
If you want to calculate metrics for substructures, configure the analysis
section of your YAML file as follows:
analysis:
analysis_type: "substructure"
pattern: "predictions_dataset_0_[0-9]+.h5"
This will calculate metrics such as RMSEs, F1-scores, BCE values, etc. and saves the results as a dictionary at global_args.savedir
.
For SMILES analysis, configure the analysis
section as follows:
analysis:
analysis_type: "SMILES"
pattern: "predictions_dataset_0_[0-9]+.h5"
f_addn_args:
substructures: PATH TO SUBSTRUCTURE FILE
where the substructures is set to point to the example_configs/substructures_957.p
file which contains the SMART strings for the 957 substructures used. The results will be saved to
a hdf5 file called processed_predictions.h5
located at global_args.savedir
.
NMR2Struct uses Tensorboard to visualize the learning curves when training the models, and it is installed with the environment. To use Tensorboard, do the following:
cd /path/to/training/directory
tensorboard --logdir=./
This should automatically do port-forwarding to your browser where you will see the Tensorboard dashboard. Tensorboard searches for all event files in a parent directory, so you can visualize multiple runs at once if they are all contained within the same parent directory.