The repository is for Token-Mol 1.0:tokenized drug design with large language model
The codes for pre-training and fine-tuning have been test on Windows and Linux system.
Codes for reinforcement learning with docking can stably run only on Linux.
Python >= 3.8
Pytorch >= 1.13.1
RDKit >= 2022.09.1
transformers >= 4.24.0
networkx >= 2.8.4
pandas >= 1.5.3
scipy == 1.10.0
easydict (any version)
CUDA 11.6
For docking/reinforcement learning:
Software | Source |
---|---|
QuickVina2 | Download |
ADFRsuite | Download |
Open Babel | Download |
Warning: When utilizing ADFRsuite, it is necessary to add it to the system path, which causes the Python symlink to point to the Python 2.7 executable in ADFRsuite
bin
directory. Therefore, we recommend usingpython3
commend instead ofpython
.
You can download directly or get access to datasets according to the following table:
Task | Source or access |
---|---|
Pre-training | GEOM |
Conformation generation | Tora3D |
Property prediction | MolecularNet & ADME |
Pocket-based generation | ResGen |
We pre-trained model with GPT2 architecture with HuggingFace🤗 Transformers.
The pre-trained model can be directly download at Zenedo.
Warning: The model should be further fine-tuned before being used for any downstream tasks.
Before running fine-tuning, move weight and configuration files of pre-trained model into Pretrain_model
folder.
Validation set have been preset in data
folder, processed training set can be downloaded in here.
# Path of training set and validation set have been preset in the code
python pocket_fine_tuning_rmse.py --batch_size 32 --epochs 40 --lr 5e-3 --every_step_save_path Trained_model/pocket_generation
The finetuned model will be saved at Trained_model/pocket_generation.pt
.
You can directly download the fine-tuned checkpoint pocket_generation.pt
here.
Encoded single pocket example/ARA2A.pkl
or multiple pockets data/test_protein_represent.pkl
can be used as input for generation.
Total number of generated molecules each pockets = batch size * epoch
# single pocket
python gen.py --model_path ./Trained_model/pocket_generation.pt --protein_path ./example/ARA2A.pkl --output_path ARA2A.csv --batch 25 --epoch 4
# multiple pockets
python gen.py --model_path ./Trained_model/pocket_generation.pt --protein_path ./data/test_protein_represent.pkl --output_path test.csv --batch 25 --epoch 4
The pocket can be encoded with GVP. Original pockets in pdb format are attached at data/test_set.zip
and example/3rfm_a2a-pocket10.pdb
.
The generated molecules should be processed and converted from sequences to RDKit RDMol objects and then used for subsequent applications.
Output file *.csv
will be input to confs_to_mols_pocket.py
(or confs_to_mols_pocket_multi.py
for preset test set).
We recommend manually changing the following two lines in the code.
# Customize file names
csv_file = 'output.csv'
# Modify reshape dimension to the number of generated molecules, default 100
generate_mol = pd.read_csv(csv_file, header=None).values.reshape(-1, 100).tolist()
Then, run directly
python confs_to_mols_pocket.py
Processed molecules will save at results
folder in pickle
format.
You can customize reward score in reinforce/reward_score.py
,and there are detailed description in the code.
Before running reinforcement learning, target pocket need to be specified and encoded.
We strongly recommend running the code on a multi-GPU machine as a too small batch size will result in an inability to perform an efficient gradient update.
python ./reinforce/reinforce_multi_gpu.py --restore-from ./Trained_model/pocket_generation.pt --protein-dir ./reinforce/usecase --agent ./reinforce.pt --world-size 4 --batch-size 8 --num-steps 1000 --max-length 85
Args | Description |
---|---|
--restore-from | Model checkpoint |
--protein-dir | Path of target pocket |
--agent | Agent model save path |
--world-size | World size (DDP) |
--batch-size | Batch size on a single GPU |
--num-steps | Total running steps |
--max-length | Max length of sequence |
After testing, code under the above arguments can run on machine with at least 4*Quadro RTX8000 (48GB vRAM)
.
You can also check molecules generated in each steps in every_steps_saved.pkl
and detailed reward terms in reward_terms*.csv
.