Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new dataset Mercury #238

Merged
merged 22 commits into from
May 29, 2024
Merged

Add a new dataset Mercury #238

merged 22 commits into from
May 29, 2024

Conversation

Elfsong
Copy link
Contributor

@Elfsong Elfsong commented May 26, 2024

  • Motivation first:

    TL;DR: Mercury is a dataset for evaluating computational efficiency of Python code generation.

    Amidst the recent strides in evaluating Large Language Models for Code (Code-LLMs), existing benchmarks have mainly focused on functional correctness, overlooking the importance of computational efficiency.

    To fill the gap, we present Mercury, the first computational efficiency benchmark for Code-LLMs. It comprises 1,889 Python tasks, each with adequate solutions to support a runtime distribution. Based on the distribution, we introduce a new metric Beyond, which computes a runtime-percentile-weighted Pass score to reflect functional correctness and computational efficiency simultaneously.

    On Mercury, leading Code-LLMs can achieve 65% on Pass, while less than 50% on Beyond. Given that an ideal Beyond score would be aligned with the Pass score, it indicates that while Code-LLMs exhibit impressive capabilities in generating functionally correct code, there remains a notable gap in their efficiency. Finally, our empirical experiments reveal that Direct Preference Optimization (DPO) serves as a robust baseline for enhancing computational efficiency compared with Supervised Fine Tuning (SFT), which paves a promising avenue for future exploration of efficient code generation.\footnote{Our code and data are available on GitHub: [https://github.com/Elfsong/Mercury.

  • Write a full paragraph describing the feature;

    In this work, we introduce Mercury, a novel code generation benchmark designed to assess and improve Code-LLM computational efficiency. It comprises 1,889 Python programming tasks with three difficulty stratification, which is divided into two datasets for model evaluation and fine-tuning separately. For each evaluation task, we assign a test case generator to remedy the shortfall of test case coverage. In measuring computational efficiency, the primary challenge stems from normalizing the absolute runtime across tasks that have diverse runtime ranges. Thus, we collect and locally execute numerous historical solutions for each task to form a runtime distribution and leverage the runtime percentile of LLM-generated code on the distribution instead of the absolute runtime to evaluate computational efficiency. Furthermore, to mitigate performance discrepancies attributed to irrelevant processes and diverse hardware configurations, we set up an isolated sandbox environment for task execution to establish local runtime distributions. More details can be found in the paper: https://arxiv.org/abs/2402.07844

  • Provide a code snippet that demonstrates its future use;

accelerate  launch --main_process_port 30000  main.py  \
    --model bigcode/starcoder2-7b   \
    --load_in_4bit   \
    --max_length_generation 2048   \
    --tasks mercury    \
    --n_samples 5  \
    --temperature 0.2  \
    --batch_size 5   \
    --allow_code_execution  \
    --save_generations  \
    --metric_output_path starcoder2-7b-mercury-result.json

@Elfsong
Copy link
Contributor Author

Elfsong commented May 26, 2024

@SivilTaram FYI

Copy link
Collaborator

@loubnabnl loubnabnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks you very much for submitting this interesting benchmark! LGTM, just two comments:

@Elfsong
Copy link
Contributor Author

Elfsong commented May 28, 2024

@loubnabnl Thank you so much for reviewing this code:)

did you make sure the current implementation matches the scores reported in your paper for one of the public LLMs?

Yes. The scores reported in our paper are based on this implementation. We are also working on publishing a public leaderboard page.

can you add some documentation about how to use the benchmark in the docs https://github.com/bigcode-project/bigcode-evaluation-harness/tree/main/docs

Sure. The instructions have been added. See this commit.

Copy link
Collaborator

@loubnabnl loubnabnl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! ready to merge 🚀

@loubnabnl loubnabnl merged commit f0f2b52 into bigcode-project:main May 29, 2024
1 check passed
@Elfsong Elfsong deleted the mercury branch May 30, 2024 05:37
phuonglvh pushed a commit to phuonglvh/bigcode-evaluation-harness that referenced this pull request Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants