This project is a julia version of HumanEval. Our goal is to gain a better understanding of latest LLMs' performance with the Julia programming language.
model | evalplus * | basic ** |
---|---|---|
gpt-4-0125-preview | 0.774 | 0.823 |
gpt-4-turbo | 0.756 | 0.823 |
mistral-large-instruct-2407 | 0.744 | 0.823 |
gpt-4o | 0.738 | 0.817 |
claude-3-5-sonnet-20240620 | 0.72 | 0.823 |
gpt-4-1106-preview | 0.72 | 0.805 |
DeepSeek-Coder-V2-Instruct | 0.695 | 0.774 |
DeepSeek-V2-Chat | 0.689 | 0.756 |
Llama-3.1-405B-Instruct | 0.628 | 0.744 |
claude-3-opus-20240229 | 0.61 | 0.689 |
Qwen2-72B-Instruct | 0.598 | 0.665 |
Phind-CodeLlama-34B-v2 | 0.591 | 0.659 |
gpt-3.5-turbo-0125 | 0.591 | 0.652 |
mistral-large-latest | 0.573 | 0.659 |
gpt-3.5-turbo-0613 | 0.567 | 0.64 |
gpt-3.5-turbo-1106 | 0.555 | 0.628 |
DeepSeek-Coder-33B-instruct | 0.543 | 0.598 |
Magicoder-S-DS-6.7B | 0.543 | 0.616 |
WizardCoder-33B-V1.1 | 0.543 | 0.604 |
Qwen1.5-110B-Chat | 0.53 | 0.598 |
yi-large | 0.524 | 0.652 |
deepseek-coder-6.7b-instruct | 0.488 | 0.549 |
CodeLlama-70b-Instruct-hf | 0.457 | 0.561 |
code-millenials-34b | 0.439 | 0.5 |
Magicoder-S-CL-7B | 0.402 | 0.463 |
CodeLlama-34b-Instruct-hf | 0.311 | 0.366 |
Starling-LM-7B-alpha | 0.299 | 0.354 |
Yi-34B-Chat | 0.232 | 0.317 |
** basic: scores are calculated based on test cases from HumanEval only.
By default, all results are calculated by
pass@1
using greedy decoding. Models are deployed with vllm which uses a predefined chat template stored in the tokenizer. Feel free to create an issue if you'd like to evaluate some other models. First, deploy the model you'd like to evaluate with a OpenAI compatible endpoint, like vLLM or Ollama. We'll need the OPENAI_API_KEY
and OPENAI_BASE_URL
in the next step.
To test models from Anthropic, you should set ANTHROPIC_API_KEY
and ANTHROPIC_BASE_URL
instead.
docker run -it --rm \
-v /PATH/TO/SAVE/RESULTS/generations:/workspace/HumanEval.jl/generations \
-e OPENAI_API_KEY=YOUR_SECRET \
-e OPENAI_BASE_URL=http://localhost:8000/v1 \
-e RETESTITEMS_NWORKERS=16 \
-e RETESTITEMS_TESTITEM_TIMEOUT=15 \
-e MODEL=gpt-3.5-turbo-0613 \
ghcr.io/01-ai/humaneval.jl:latest
/PATH/TO/SAVE/RESULTS/generations
, this folder will contain raw responses from the model, extracted julia code snippets, and unit test results.YOUR_SECRET
, it should be the same with the one you provided when deploying the server.RETESTITEMS_NWORKERS
, adjust it to the number of cores with your test environment. It specifies how many workers we use to run tests.RETESTITEMS_TESTITEM_TIMEOUT
, the default15
seconds should be enough to pass all the test cases.MODEL
, the model name you specified when deploying models. If you usevLLM
, it should be the same with the value of--served-model-name
- Make sure you have the latest Julia installed.
- Clone and enter the root of this project.
- Start the Julia REPL with the following command
OPENAI_API_KEY=debug OPENAI_BASE_URL=http://localhost:8000/v1 RETESTITEMS_NWORKERS=16 RETESTITEMS_TESTITEM_TIMEOUT=15 MODEL=gpt-3.5-turbo-0613 julia --project
The meaning of the environment variables are the same with above.
- Execute following commands in the Julia REPL.
julia> import Pkg; Pkg.instantiate();
julia> include("src/evaluation.jl")
julia> evaluate("YOUR_MODEL_NAME")
Once finished, the results will be displayed. You may find more details under the generations
directory.
- nuprl/MultiPL-E contains Julia version prompts transformed from the original Python version HumanEval. However, based on my limited Julia programming experience, the prompts are not that accurate and conventional.
- Julia-LLM-Leaderboard, which focuses on practicality and simplicity.
- EvalPlus Leaderboard
- Explore advanced techniques to improve LLM's performance with code in general. Especially how to iteratively refine code.
- Julia specific LLM training/finetuning. We want to know the minimum requirement to train a code LLM.
- Improve Yi series models' performance with code.
We're hiring! If you're interested in working on code LLM at 01.ai, please contact [email protected].
- What are the differences compared to the original Python version?
- What are the limitations of this project?
- How do LLMs perform compared to human?
- How difficult is each problem?
- Is GPT4 good enough?
- How to make this evaluation higher quality?
- How should we measure hallucinations?
- Are there any other metrics we should care beyond pass@k?
- Why does Yi-34B-Chat perform so poor?
- This project heavily relies on many features provided by ReTestItems.jl. Great thanks to Nick Robinson's help during the development.