Token-Mol

The repository is for Token-Mol 1.0：tokenized drug design with large language model

Environment

The codes for pre-training and fine-tuning have been test on Windows and Linux system.
Codes for reinforcement learning with docking can stably run only on Linux.

Python Dependencies

Python >= 3.8
Pytorch >= 1.13.1
RDKit >= 2022.09.1
transformers >= 4.24.0
networkx >= 2.8.4
pandas >= 1.5.3
scipy == 1.10.0
easydict (any version)

Software Dependencies

CUDA 11.6

For docking/reinforcement learning:

Software	Source
QuickVina2	Download
ADFRsuite	Download
Open Babel	Download

Warning: When utilizing ADFRsuite, it is necessary to add it to the system path, which causes the Python symlink to point to the Python 2.7 executable in ADFRsuite bin directory. Therefore, we recommend using python3 commend instead of python.

Data

You can download directly or get access to datasets according to the following table:

Task	Source or access
Pre-training	GEOM
Conformation generation	Tora3D
Property prediction	MolecularNet & ADME
Pocket-based generation	ResGen

Pre-training

We pre-trained model with GPT2 architecture with HuggingFace🤗 Transformers.
The pre-trained model can be directly download at Zenedo.

Warning: The model should be further fine-tuned before being used for any downstream tasks.

Fine-tuning

Before running fine-tuning, move weight and configuration files of pre-trained model into Pretrain_model folder.
Validation set have been preset in data folder, processed training set can be downloaded in here.

# Path of training set and validation set have been preset in the code
python pocket_fine_tuning_rmse.py --batch_size 32 --epochs 40 --lr 5e-3 --every_step_save_path Trained_model/pocket_generation

The finetuned model will be saved at Trained_model/pocket_generation.pt.

Generation

You can directly download the fine-tuned checkpoint pocket_generation.pt here.

Encoded single pocket example/ARA2A.pkl or multiple pockets data/test_protein_represent.pkl can be used as input for generation.

Total number of generated molecules each pockets = batch size * epoch

# single pocket
python gen.py --model_path ./Trained_model/pocket_generation.pt --protein_path ./example/ARA2A.pkl --output_path ARA2A.csv --batch 25 --epoch 4

# multiple pockets
python gen.py --model_path ./Trained_model/pocket_generation.pt --protein_path ./data/test_protein_represent.pkl --output_path test.csv --batch 25 --epoch 4

The pocket can be encoded with GVP. Original pockets in pdb format are attached at data/test_set.zip and example/3rfm_a2a-pocket10.pdb.

Post-processing

The generated molecules should be processed and converted from sequences to RDKit RDMol objects and then used for subsequent applications.

Output file *.csv will be input to confs_to_mols_pocket.py (or confs_to_mols_pocket_multi.py for preset test set).

We recommend manually changing the following two lines in the code.

# Customize file names
csv_file = 'output.csv'
# Modify reshape dimension to the number of generated molecules, default 100
generate_mol = pd.read_csv(csv_file, header=None).values.reshape(-1, 100).tolist()

Then, run directly

python confs_to_mols_pocket.py

Processed molecules will save at results folder in pickle format.

Reinforcement learning

Reward score

You can customize reward score in reinforce/reward_score.py,and there are detailed description in the code.

Running

Before running reinforcement learning, target pocket need to be specified and encoded.

We strongly recommend running the code on a multi-GPU machine as a too small batch size will result in an inability to perform an efficient gradient update.

python ./reinforce/reinforce_multi_gpu.py --restore-from ./Trained_model/pocket_generation.pt --protein-dir ./reinforce/usecase --agent ./reinforce.pt --world-size 4 --batch-size 8 --num-steps 1000 --max-length 85

Args	Description
--restore-from	Model checkpoint
--protein-dir	Path of target pocket
--agent	Agent model save path
--world-size	World size (DDP)
--batch-size	Batch size on a single GPU
--num-steps	Total running steps
--max-length	Max length of sequence

After testing, code under the above arguments can run on machine with at least 4*Quadro RTX8000 (48GB vRAM).

You can also check molecules generated in each steps in every_steps_saved.pkl and detailed reward terms in reward_terms*.csv.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
Pretrained_model		Pretrained_model
Trained_model		Trained_model
data		data
early_stop		early_stop
example		example
reinforce		reinforce
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ada_model.py		ada_model.py
bert_tokenizer.py		bert_tokenizer.py
confs_to_mols_pocket.py		confs_to_mols_pocket.py
confs_to_mols_pocket_multi.py		confs_to_mols_pocket_multi.py
gen.py		gen.py
pocket_fine_tuning_rmse.py		pocket_fine_tuning_rmse.py
smi_torsion_2_molobj.py		smi_torsion_2_molobj.py
val_list.txt		val_list.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Token-Mol

Environment

Python Dependencies

Software Dependencies

Data

Pre-training

Fine-tuning

Generation

Post-processing

Reinforcement learning

Reward score

Running

About

Releases

Packages

Contributors 2

Languages

License

jkwang93/Token-Mol

Folders and files

Latest commit

History

Repository files navigation

Token-Mol

Environment

Python Dependencies

Software Dependencies

Data

Pre-training

Fine-tuning

Generation

Post-processing

Reinforcement learning

Reward score

Running

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages