Skip to content

Here, we present a framework using large language models (LLMs) to predict synthesis pathways for inorganic materials, including quantum materials.

License

Notifications You must be signed in to change notification settings

RyotaroOKabe/llm4syn

Repository files navigation

Large Language Model-Guided Prediction Toward Quantum Materials Synthesis (LLM4SYN)

LLM4SYN is a framework that leverages large language models (LLMs) to predict chemical synthesis pathways for inorganic materials. It includes three core models:

  1. LHS2RHS: Predicts products from reactants.
  2. RHS2LHS: Predicts reactants from products.
  3. TGT2CEQ: Generates the full chemical equation given a target compound.

These models are fine-tuned using a text-mined synthesis database, improving prediction accuracy from under 40% to around 90%. The framework also introduces a generalized Tanimoto similarity (GTS) for accurate evaluation of chemical equations.


Table of Contents

  1. Folder Structure
  2. Download Dataset
  3. Set Up Hugging Face and WandB Accounts
  4. Environment Setup
  5. Training the Model
  6. Testing the Model
  7. Available Models
  8. Evaluate Generated Equations
  9. References
  10. Citation

Folder Structure

  1. Create the following folders for data and models:
    ./data/  
    ./models/  
    

Download Dataset

Download the dataset from the following sources and place the solid-state_dataset_2019-06-27_upd.json file in ./data/:

Set Up Hugging Face and WandB Accounts

Hugging Face

  1. Sign up at Hugging Face.
  2. Generate API keys for "Writing" and "Reading".
  3. Duplicate env_config_template.py and rename it to env_config.py.
  4. Paste your Hugging Face API keys (hf_api_key_r, hf_api_key_w) and Username (hf_usn) in env_config.py.
  5. Set the path to the .json data file as data_path in env_config.py.

WandB

  1. Sign up at WandB.
  2. Login to WandB from the terminal using wandb login.

Environment Setup

Ensure that the following libraries are installed:

Python==3.10.0   
torch==2.0.0+cu118    
transformers==4.33.2     
wandb==0.16.0   
accelerate==0.3.0   
huggingface_hub==0.16.4   
datasets==2.14.5   
bokeh==3.3.4   
# Other basic libraries like numpy, matplotlib, etc.

Training the Model

Choose a task from the following options: ['lhs2rhs', 'rhs2lhs', 'lhsope2rhs', 'rhsope2lhs', 'tgt2ceq', 'tgtope2ceq'].

If needed, set ver_tag. Then, run the training script:

python train_llm4syn.py

Testing the Model

To test a trained model, set the task, model_tag, and ver_tag to match the trained model saved on HuggingFace. Then, run:

python test_llm4syn.py

Available Models

Model Name Pre-trained Model Task Description Link
RyotaroOKabe/lhs2rhs_dgpt2_v1.2.1 distilgpt2 Predict RHS given LHS Link
RyotaroOKabe/rhs2lhs_dgpt2_v1.2.1 distilgpt2 Predict LHS given RHS Link
RyotaroOKabe/tgt2ceq_dgpt2_v1.2.1 distilgpt2 Predict CEQ given TGT Link
RyotaroOKabe/lhsope2rhs_dgpt2_v1.2.1 distilgpt2 Predict RHS given LHS with additional OPE Link
RyotaroOKabe/rhsope2lhs_dgpt2_v1.2.1 distilgpt2 Predict LHS given RHS with additional OPE Link
RyotaroOKabe/tgtope2ceq_dgpt2_v1.2.1 distilgpt2 Predict CEQ given TGT with additional OPE Link

Evaluate Generated Equations

To evaluate the model-generated equations, run:

eval_llm4syn.ipynb  

References

[1] 
@article{kononova2019text,
  title={Text-mined dataset of inorganic materials synthesis recipes},
  author={Kononova, Olga and Huo, Haoyan and He, Tanjin and Rong, Ziqin and Botari, Tiago and Sun, Wenhao and Tshitoyan, Vahe and Ceder, Gerbrand},
  journal={Scientific data},
  volume={6},
  number={1},
  pages={203},
  year={2019},
  publisher={Nature Publishing Group UK London}
}

Citation

If you find this code or dataset useful, please cite the following paper:

@article{okabe2024large,
  title={Large Language Model-Guided Prediction Toward Quantum Materials Synthesis},
  author={Okabe, Ryotaro and West, Zack and Chotrattanapituk, Abhijatmedhi and Cheng, Mouyang and Carrizales, Denisse C{\'o}rdova and Xie, Weiwei and Cava, Robert J and Li, Mingda},
  journal={arXiv preprint arXiv:2410.20976},
  year={2024}
}

About

Here, we present a framework using large language models (LLMs) to predict synthesis pathways for inorganic materials, including quantum materials.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published