RTL-Repo Benchmark

This repository contains the code and data for the RTL-Repo benchmark introduced in the paper RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects.

📰 News

[May. 10, 2024]: RTL-Repo has been accepted to IEEE LAD'24 - The First IEEE International Workshop on LLM-Aided Design!

👋 Overview

RTL-Repo is a benchmark for evaluating LLMs' effectiveness in generating Verilog code autocompletions within large, complex codebases. It assesses the model's ability to understand and remember the entire Verilog repository context and generate new code that is correct, relevant, logically consistent, and adherent to coding conventions and guidelines, while being aware of all components and modules in the project. This provides a realistic evaluation of a model's performance in real-world RTL design scenarios. RTL-Repo comprises over 4000 code samples from GitHub repositories, each containing the context of all Verilog code in the repository, offering a valuable resource for the hardware design community to assess and train LLMs for Verilog code generation in complex, multi-file RTL projects.

You can access our dataset on 🤗 HuggingFace Hub using this link 🤗 RTL-Repo, or you can run the following code:

from datasets import load_dataset
rtl_repo = load_dataset('ahmedallam/RTL-Repo', split='test')

💽 Installation

Before proceeding with the installation, please ensure you have PyTorch installed. Refer to the PyTorch installation page for the specific installation command for your platform.

You can install and run our benchmark easily by following these commands:

git clone https://github.com/AUCOHL/RTL-Repo.git
cd RTL-Repo
pip install -r requirements.txt

🚀 Usage

Generate model predictions

To run your model on our benchmark's full test dataset, simply execute this command:

python src/generate_model_predictions.py --model_name <model_name_or_path> --model_max_tokens <model_max_tokens> --temperature <temperature>

Evaluate

After generating predictions from your model, execute this script to produce the evaluation statistics:

python src/evaluate.py --prediction_path <prediction_path>

Example Output (GPT-3.5):

Statistics for the predictions are as follows:

Total count: 1174
Edit similarity: 61.37819
Exact match: 33.816

🏆 Leaderboard

Model Name	Publisher	Number of Parameters	Open-Source?	Verilog-Specific?	Edit Similarity	Exact Match
GPT-4	OpenAI	N/A	❌	❌	71.87	48.5
GPT-3.5	OpenAI	N/A	❌	❌	61.4	33.8
RTLCoder-DeepSeek	Liu et al.	6.7B	✅	✅	48.1	16.2
Starcoder2	BigCode	15B	✅	❌	48.0	17.0
VeriGen	Thakur et al.	16B	✅	✅	43.9	9.5
RTLCoder-Mistral	Liu et al.	7B	✅	✅	37.9	8.3

✍️ Citation

If you find our work helpful, please use the following citation.

@misc{allam2024rtlrepo,
      title={RTL-Repo: A Benchmark for Evaluating LLMs on Large-Scale RTL Design Projects}, 
      author={Ahmed Allam and Mohamed Shalan},
      year={2024},
      eprint={2405.17378},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RTL-Repo Benchmark

📰 News

👋 Overview

💽 Installation

🚀 Usage

Generate model predictions

Evaluate

🏆 Leaderboard

✍️ Citation

🪪 License

About

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
predictions		predictions
src		src
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

License

AUCOHL/RTL-Repo

Folders and files

Latest commit

History

Repository files navigation

RTL-Repo Benchmark

📰 News

👋 Overview

💽 Installation

🚀 Usage

Generate model predictions

Evaluate

🏆 Leaderboard

✍️ Citation

🪪 License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages