Chemical Language Model Linker

Running the code

First, install the dependencies from requirements.txt, preferably in a virtual environment. For instance, with conda run

conda create -n chemlml python=3.9
conda activate chemlml
pip install -r requirements.txt

Then, run python train_ChEBI-20.py to train on the ChEBI-20 dataset. Parameters include:

accumulation_steps: Accumulation steps for gradient accumulation.
warm_up_steps and lr_factor: Parameters for Noam Optimizer.
text_encoder: Text encoder for ChemLML, "ChemT5" for encoder from Text + ChemT5, "SciBERT" for SciBERT model and "galactica-" + ["125m", "1.3b", "6.7b"] for different scales of Galactica model.
molecule_decoder: Molecule decoder for ChemLML, "MolGen" and "MolGen-7B" for default MolGen and MolGen7B model.
freeze_encoder: To freeze the text encoder or not.
use_wandb: To use wandb or not.
wandb_key: Your wandb key.
skip_valid To skip the validation set or not.
eval_epoch: Evaluate on validation set every eval_epoch epochs.

Run python test.py to test on different datasets. The only difference is the dataset_name parameters. Select from "ChEBI-20", "PubChem_filtered", or "PubChem_unfiltered". PubChem_unfiltered must be downloaded from Zenodo into the PubChem subdirectory first.

To evaluate the result, run python evaluation/fingerprint_metrics.py and specify the test file.

The pretrained ChemLML models are available on Zenodo.

Citation

Chemical Language Model Linker: blending text and molecules with modular adapters
Yifan Deng, Spencer S. Ericksen, Anthony Gitter (2024)
arXiv:2410.20182 [cs.LG]

Datasets

See the ChEBI-20 and PubChem dataset subdirectories for details and licenses.

The ChEBI-20 dataset is from MolT5
PubChem dataset is from Mol-Instructions and PubChem. The unfiltered version is hosted on Zenodo.

Third-party code and models

ChemLML uses the following models from Hugging Face:

MolT5: https://huggingface.co/laituan245/molt5-base-smiles2caption
Text+Chem T5: https://huggingface.co/GT4SD/multitask-text-and-chemistry-t5-base-standard
MolGen: https://huggingface.co/zjunlp/MolGen-large
MolGen-7B: https://huggingface.co/zjunlp/MolGen-7b
Fine-tuned LLaMA2-7B: https://huggingface.co/zjunlp/llama2-molinst-molecule-7b In order to use LLaMA, please refer to the access request at https://huggingface.co/meta-llama/Llama-2-7b-hf. Then copy and paste the huggingface token to line 37 in model.py
SCIBERT: https://huggingface.co/allenai/scibert_scivocab_uncased
Galactica series: https://huggingface.co/facebook/galactica-125m
MolXPT: https://huggingface.co/zequnl/molxpt

See the Hugging Face model cards for licenses, limitations, and citations.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
ChEBI-20		ChEBI-20
PubChem		PubChem
docking_result		docking_result
evaluation		evaluation
.gitignore		.gitignore
LICENSE		LICENSE
NoamOpt.py		NoamOpt.py
README.md		README.md
dataset.py		dataset.py
model.py		model.py
requirements.txt		requirements.txt
test.py		test.py
train_ChEBI-20.py		train_ChEBI-20.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chemical Language Model Linker

Running the code

Citation

Datasets

Third-party code and models

About

Releases 1

Packages

Contributors 2

Languages

License

gitter-lab/ChemLML

Folders and files

Latest commit

History

Repository files navigation

Chemical Language Model Linker

Running the code

Citation

Datasets

Third-party code and models

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages