Large Language Model-Guided Prediction Toward Quantum Materials Synthesis (LLM4SYN)

LLM4SYN is a framework that leverages large language models (LLMs) to predict chemical synthesis pathways for inorganic materials. It includes three core models:

LHS2RHS: Predicts products from reactants.
RHS2LHS: Predicts reactants from products.
TGT2CEQ: Generates the full chemical equation given a target compound.

These models are fine-tuned using a text-mined synthesis database, improving prediction accuracy from under 40% to around 90%. The framework also introduces a generalized Tanimoto similarity (GTS) for accurate evaluation of chemical equations.

Folder Structure

Create the following folders for data and models:
```
./data/  
./models/  
```

Download Dataset

Download the dataset from the following sources and place the solid-state_dataset_2019-06-27_upd.json file in ./data/:

Text-mined dataset of inorganic materials synthesis recipes

Set Up Hugging Face and WandB Accounts

Hugging Face

Sign up at Hugging Face.
Generate API keys for "Writing" and "Reading".
Duplicate env_config_template.py and rename it to env_config.py.
Paste your Hugging Face API keys (hf_api_key_r, hf_api_key_w) and Username (hf_usn) in env_config.py.
Set the path to the .json data file as data_path in env_config.py.

WandB

Sign up at WandB.
Login to WandB from the terminal using wandb login.

Environment Setup

Ensure that the following libraries are installed:

Python==3.10.0   
torch==2.0.0+cu118    
transformers==4.33.2     
wandb==0.16.0   
accelerate==0.3.0   
huggingface_hub==0.16.4   
datasets==2.14.5   
bokeh==3.3.4   
# Other basic libraries like numpy, matplotlib, etc.

Training the Model

Choose a task from the following options: ['lhs2rhs', 'rhs2lhs', 'lhsope2rhs', 'rhsope2lhs', 'tgt2ceq', 'tgtope2ceq'].

If needed, set ver_tag. Then, run the training script:

python train_llm4syn.py

Testing the Model

To test a trained model, set the task, model_tag, and ver_tag to match the trained model saved on HuggingFace. Then, run:

python test_llm4syn.py

Available Models

Model Name	Pre-trained Model	Task Description	Link
RyotaroOKabe/lhs2rhs_dgpt2_v1.2.1	distilgpt2	Predict RHS given LHS	Link
RyotaroOKabe/rhs2lhs_dgpt2_v1.2.1	distilgpt2	Predict LHS given RHS	Link
RyotaroOKabe/tgt2ceq_dgpt2_v1.2.1	distilgpt2	Predict CEQ given TGT	Link
RyotaroOKabe/lhsope2rhs_dgpt2_v1.2.1	distilgpt2	Predict RHS given LHS with additional OPE	Link
RyotaroOKabe/rhsope2lhs_dgpt2_v1.2.1	distilgpt2	Predict LHS given RHS with additional OPE	Link
RyotaroOKabe/tgtope2ceq_dgpt2_v1.2.1	distilgpt2	Predict CEQ given TGT with additional OPE	Link

Evaluate Generated Equations

To evaluate the model-generated equations, run:

eval_llm4syn.ipynb

References

[1] 
@article{kononova2019text,
  title={Text-mined dataset of inorganic materials synthesis recipes},
  author={Kononova, Olga and Huo, Haoyan and He, Tanjin and Rong, Ziqin and Botari, Tiago and Sun, Wenhao and Tshitoyan, Vahe and Ceder, Gerbrand},
  journal={Scientific data},
  volume={6},
  number={1},
  pages={203},
  year={2019},
  publisher={Nature Publishing Group UK London}
}

Citation

If you find this code or dataset useful, please cite the following paper:

@article{okabe2024large,
  title={Large Language Model-Guided Prediction Toward Quantum Materials Synthesis},
  author={Okabe, Ryotaro and West, Zack and Chotrattanapituk, Abhijatmedhi and Cheng, Mouyang and Carrizales, Denisse C{\'o}rdova and Xie, Weiwei and Cava, Robert J and Li, Mingda},
  journal={arXiv preprint arXiv:2410.20976},
  year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 77 Commits
assets		assets
data		data
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
env_config_template.py		env_config_template.py
eval_llm4syn.ipynb		eval_llm4syn.ipynb
requirements.txt		requirements.txt
test_llm4syn.py		test_llm4syn.py
train_llm4syn.py		train_llm4syn.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Large Language Model-Guided Prediction Toward Quantum Materials Synthesis (LLM4SYN)

Table of Contents

Folder Structure

Download Dataset

Set Up Hugging Face and WandB Accounts

Hugging Face

WandB

Environment Setup

Training the Model

Testing the Model

Available Models

Evaluate Generated Equations

References

Citation

About

Releases 1

Packages

Languages

License

RyotaroOKabe/llm4syn

Folders and files

Latest commit

History

Repository files navigation

Large Language Model-Guided Prediction Toward Quantum Materials Synthesis (LLM4SYN)

Table of Contents

Folder Structure

Download Dataset

Set Up Hugging Face and WandB Accounts

Hugging Face

WandB

Environment Setup

Training the Model

Testing the Model

Available Models

Evaluate Generated Equations

References

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages