LLM4SYN is a framework that leverages large language models (LLMs) to predict chemical synthesis pathways for inorganic materials. It includes three core models:
- LHS2RHS: Predicts products from reactants.
- RHS2LHS: Predicts reactants from products.
- TGT2CEQ: Generates the full chemical equation given a target compound.
These models are fine-tuned using a text-mined synthesis database, improving prediction accuracy from under 40% to around 90%. The framework also introduces a generalized Tanimoto similarity (GTS) for accurate evaluation of chemical equations.
- Folder Structure
- Download Dataset
- Set Up Hugging Face and WandB Accounts
- Environment Setup
- Training the Model
- Testing the Model
- Available Models
- Evaluate Generated Equations
- References
- Citation
- Create the following folders for data and models:
./data/ ./models/
Download the dataset from the following sources and place the solid-state_dataset_2019-06-27_upd.json
file in ./data/
:
- Sign up at Hugging Face.
- Generate API keys for "Writing" and "Reading".
- Duplicate
env_config_template.py
and rename it toenv_config.py
. - Paste your Hugging Face API keys (
hf_api_key_r
,hf_api_key_w
) and Username (hf_usn
) inenv_config.py
. - Set the path to the
.json
data file asdata_path
inenv_config.py
.
- Sign up at WandB.
- Login to WandB from the terminal using
wandb login
.
Ensure that the following libraries are installed:
Python==3.10.0
torch==2.0.0+cu118
transformers==4.33.2
wandb==0.16.0
accelerate==0.3.0
huggingface_hub==0.16.4
datasets==2.14.5
bokeh==3.3.4
# Other basic libraries like numpy, matplotlib, etc.
Choose a task from the following options: ['lhs2rhs', 'rhs2lhs', 'lhsope2rhs', 'rhsope2lhs', 'tgt2ceq', 'tgtope2ceq']
.
If needed, set ver_tag
. Then, run the training script:
python train_llm4syn.py
To test a trained model, set the task
, model_tag
, and ver_tag
to match the trained model saved on HuggingFace. Then, run:
python test_llm4syn.py
Model Name | Pre-trained Model | Task Description | Link |
---|---|---|---|
RyotaroOKabe/lhs2rhs_dgpt2_v1.2.1 | distilgpt2 | Predict RHS given LHS | Link |
RyotaroOKabe/rhs2lhs_dgpt2_v1.2.1 | distilgpt2 | Predict LHS given RHS | Link |
RyotaroOKabe/tgt2ceq_dgpt2_v1.2.1 | distilgpt2 | Predict CEQ given TGT | Link |
RyotaroOKabe/lhsope2rhs_dgpt2_v1.2.1 | distilgpt2 | Predict RHS given LHS with additional OPE | Link |
RyotaroOKabe/rhsope2lhs_dgpt2_v1.2.1 | distilgpt2 | Predict LHS given RHS with additional OPE | Link |
RyotaroOKabe/tgtope2ceq_dgpt2_v1.2.1 | distilgpt2 | Predict CEQ given TGT with additional OPE | Link |
To evaluate the model-generated equations, run:
eval_llm4syn.ipynb
[1]
@article{kononova2019text,
title={Text-mined dataset of inorganic materials synthesis recipes},
author={Kononova, Olga and Huo, Haoyan and He, Tanjin and Rong, Ziqin and Botari, Tiago and Sun, Wenhao and Tshitoyan, Vahe and Ceder, Gerbrand},
journal={Scientific data},
volume={6},
number={1},
pages={203},
year={2019},
publisher={Nature Publishing Group UK London}
}
If you find this code or dataset useful, please cite the following paper:
@article{okabe2024large,
title={Large Language Model-Guided Prediction Toward Quantum Materials Synthesis},
author={Okabe, Ryotaro and West, Zack and Chotrattanapituk, Abhijatmedhi and Cheng, Mouyang and Carrizales, Denisse C{\'o}rdova and Xie, Weiwei and Cava, Robert J and Li, Mingda},
journal={arXiv preprint arXiv:2410.20976},
year={2024}
}