Add TestGenEval benchmark #5534

kjain14 · 2024-12-11T22:08:38Z

End-user friendly description of the problem this fixes or functionality that this introduces

Adds a new unit test generation benchmark TestGenEval: https://arxiv.org/abs/2410.00752

Give a summary of what the PR does, explaining any non-trivial design decisions

PR includes changes to measure:

Coverage
Mutation score
Push docker images for TestGenEval with testing dependencies
Prompts for measuring CodeAct performance
Wide range of lexical metrics too (rouge, codebleu, readability, etc)

Note: This is a clean version of PR #5534 that contains only the TestGenEval changes.

Merge OpenHands

merging openhands

pyproject.toml

…-main

neubig

Thanks @kjain14!

Sorry it took me a while to get to reviewing this. The README was still a bit WIP so I asked OpenHands to update it, and now I think it reflects the flow better. I was able to run the benchmark.

However, I'm a bit stuck for evaluation, I'm getting this when I try to install:

$ poetry install --with testgeneval
...
  Unable to find installation candidates for torch (2.5.1)
  - Installing importlib-resources (6.4.5)
  - Installing kubernetes (31.0.0)
  - Installing llama-cloud (0.1.5)
  - Installing llama-index-embeddings-openai (0.3.0)
  - Installing llama-index-program-openai (0.3.0)
  - Installing llama-parse (0.5.15)
  - Installing mmh3 (5.0.1)
  - Installing onnxruntime (1.20.1)
  - Installing opentelemetry-instrumentation-fastapi (0.46b0)
  - Installing orjson (3.10.12)
  - Installing posthog (3.7.2)
  - Installing pypdf (5.1.0)
  - Installing pypika (0.48.9)
  - Installing rapidfuzz (3.11.0)
  - Installing scikit-learn (1.5.2)
  - Installing striprtf (0.0.26)
  - Updating synchronicity (0.9.7 -> 0.9.8)
  - Installing torch (2.5.1): Failed

  RuntimeError

  Unable to find installation candidates for torch (2.5.1)
  - Updating botocore (1.35.84 -> 1.35.87)
  - Installing importlib-resources (6.4.5)
  - Installing kubernetes (31.0.0)
  - Installing llama-cloud (0.1.5)
  - Installing llama-index-embeddings-openai (0.3.0)
  - Installing llama-index-program-openai (0.3.0)
  - Installing llama-parse (0.5.15)
  - Installing mmh3 (5.0.1)
  - Installing onnxruntime (1.20.1)
  - Installing opentelemetry-instrumentation-fastapi (0.46b0)
  - Installing orjson (3.10.12)
  - Installing posthog (3.7.2)
  - Installing pypdf (5.1.0)
  - Installing pypika (0.48.9)
  - Installing rapidfuzz (3.11.0)
  - Installing scikit-learn (1.5.2)
  - Installing striprtf (0.0.26)
  - Updating synchronicity (0.9.7 -> 0.9.8)
  - Installing torch (2.5.1): Failed

  RuntimeError

  Unable to find installation candidates for torch (2.5.1)

  at ~/Library/Application Support/pypoetry/venv/lib/python3.13/site-packages/poetry/installation/chooser.py:74 in choose_for
       70│ 
       71│             links.append(link)
       72│ 
       73│         if not links:
    →  74│             raise RuntimeError(f"Unable to find installation candidates for {package}")
       75│ 
       76│         # Get the best link
       77│         chosen = max(links, key=lambda link: self._sort_key(package, link))
       78│ 

Cannot install torch.

Are you able to reproduce this? If so it'd be nice if you could figure out the incompatibility. If you can't repro I'll try to investigate further.

kjain14 · 2024-12-27T18:32:45Z

Hmm, I tried today and am not able to reproduce this, wondering what may be causing this?

I think this also does not have to do with testgeneval dependencies (it is because of the llama group dependencies, which list torch==2.5.1)

neubig · 2024-12-28T19:57:30Z

Hmm, I'll take another look.

Merging master

neubig · 2025-02-09T12:32:24Z

Sorry again this took me so long, but I'm looking at this now. I overcame my previous issue but encountered the problem below:

...
  File "/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/run_infer.py", line 118, in truncate_prompt
    encoding = tiktoken.encoding_for_model(model)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/Library/Caches/pypoetry/virtualenvs/openhands-ai-mHbwbJq6-py3.12/lib/python3.12/site-packages/tiktoken/model.py", line 105, in encoding_for_model
    return get_encoding(encoding_name_for_model(model_name))
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/Library/Caches/pypoetry/virtualenvs/openhands-ai-mHbwbJq6-py3.12/lib/python3.12/site-packages/tiktoken/model.py", line 92, in encoding_name_for_model
    raise KeyError(
KeyError: 'Could not automatically map openai/claude-3-5-sonnet-20241022 to a tokeniser. Please use `tiktoken.get_encoding` to explicitly get the tokeniser you expect.'

This was due to prompt truncation. If this is necessary in OpenHands, I think it's something we should handle on the OpenHands side, not the benchmark side, so I removed the code for now and things seem to be working OK with Claude (although it failed on some instances). I'll update once I've run a full eval.

neubig · 2025-02-09T13:27:30Z

OK, run_infer.py seems to be working, but I'm not sure about evaluation.

The README says to use ./evaluation/benchmarks/testgeneval/scripts/eval_infer.sh but this file does not exist, only ./evaluation/benchmarks/testgeneval/scripts/eval_infer_remote.sh. @kjain14 , could you elaborate on how you ran evaluation?

neubig · 2025-02-10T16:14:45Z

Hi @kjain14 , I think this is getting pretty close, but now I'm having an issue with codebleu:

poetry run python evaluation/benchmarks/testgeneval/eval_infer.py --eval-num-workers 1 --input-file evaluation/evaluation_outputs/outputs/ --dataset kjain14/testgenevallite --split test
/Users/gneubig/Library/Caches/pypoetry/virtualenvs/openhands-ai-mHbwbJq6-py3.12/lib/python3.12/site-packages/tree_sitter/__init__.py:36: FutureWarning: Language(path, name) is deprecated. Use Language(ptr, name) instead.
  warn("{} is deprecated. Use {} instead.".format(old, new), FutureWarning)
Traceback (most recent call last):
  File "/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/eval_infer.py", line 22, in <module>
    from evaluation.benchmarks.testgeneval.metrics import (
  File "/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/metrics.py", line 305, in <module>
    "Java8": Evaluator("java"),
             ^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/Evaluator.py", line 38, in __init__
    self.parser_language = Language(this_dir / 'parser' / 'my-languages.so', lang)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/Library/Caches/pypoetry/virtualenvs/openhands-ai-mHbwbJq6-py3.12/lib/python3.12/site-packages/tree_sitter/__init__.py", line 132, in __init__
    self.lib = cdll.LoadLibrary(fspath(path_or_ptr))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/miniconda3/envs/openhands/lib/python3.12/ctypes/__init__.py", line 460, in LoadLibrary
    return self._dlltype(name)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/miniconda3/envs/openhands/lib/python3.12/ctypes/__init__.py", line 379, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: dlopen(/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/parser/my-languages.so, 0x0006): tried: '/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/parser/my-languages.so' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/parser/my-languages.so' (no such file), '/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/parser/my-languages.so' (no such file)

kjain14 · 2025-02-11T11:58:00Z

This should be fixed now (was being gitignored previously)

enyst · 2025-02-11T12:48:10Z

Just a thought about this addition:

Could we have codeBLEU library as a dependency in python?

In general, we have optional dependencies for evaluation in the poetry 'evaluation' group. Do you think it can be done that way?

kjain14 · 2025-02-13T17:35:24Z

This is possible, but needs the tree-sitter version to be upgraded (is there a reason why it is pinned currently?)

neubig · 2025-02-16T23:08:17Z

I'm working on upgrading the tree sitter version!

enyst · 2025-02-17T19:15:34Z

@kjain14 tree-sitter was updated in main, you may want to see if it works now?

Main of OpenHands

kjain14 · 2025-02-17T19:56:38Z

It seems like the codebleu package only works with a very specific version of tree-sitter (higher than the previous v0.21.0 but lower than the current version). Could we adjust it to work with this version (or alternatively can just use the code I have).

Looks like there is a PR to do this on the codebleu repo but no reponse: k4black/codebleu#76

Because codebleu (0.7.0) depends on tree-sitter (>=0.22.0,<0.23.0)
 and no versions of codebleu match >0.7.0,<0.8.0, codebleu (>=0.7.0,<0.8.0) requires tree-sitter (>=0.22.0,<0.23.0).
So, because openhands-ai depends on both tree-sitter (>=0.24.0,<0.25.0) and codebleu (^0.7.0), version solving failed.

Kush Dave Jain and others added 13 commits November 29, 2024 17:57

initial TestGenEval code

7a4729c

Initial pass for TestGenEval

c6206f5

Licensing

280baa2

Readability metrics

75fba59

Fixing testing dependencies

f7f2531

Add option for starting point

30197e6

Cleaning to not OOM

791b7f9

Merge pull request #1 from All-Hands-AI/main

bd66d09

Merge OpenHands

TestGenEval MVP

585dba9

+ mutation testing

3af6025

Merge pull request #2 from All-Hands-AI/main

b19f735

merging openhands

Update README

7c81deb

Merge branch 'main' of https://github.com/kjain14/OpenHands

2cd64bc

neubig reviewed Dec 11, 2024

View reviewed changes

pyproject.toml Outdated Show resolved Hide resolved

Kush Dave Jain added 3 commits December 12, 2024 15:50

reset

b685c67

testgeneval deps

fb9bc87

Final update, now working on all projects

77a153e

kjain14 marked this pull request as ready for review December 16, 2024 19:55

neubig self-requested a review December 16, 2024 20:08

openhands-agent and others added 3 commits December 25, 2024 21:23

Update TestGenEval README with comprehensive information

3401bd6

Merge branch 'main' of github.com:All-Hands-AI/OpenHands into kjain14…

b47da9e

…-main

Update lock file

90422e5

neubig reviewed Dec 25, 2024

View reviewed changes

neubig self-requested a review December 28, 2024 19:57

neubig self-assigned this Dec 30, 2024

Kush Dave Jain added 3 commits January 8, 2025 18:04

Any and all pass

31b6967

Reset to normal time

1ded123

Refine postprocessing

efb525a

Kush Dave Jain and others added 7 commits January 28, 2025 21:29

Removing duplicate script

c7d575b

Ablation outputs

64abd4a

Fixing code to handle ablations

e7a8daf

Final prompt for final experiments

eef0ed3

Merge pull request #3 from All-Hands-AI/main

8782e3a

Merging master

Merge branch 'main' into main

d8ad8ba

Remove prompt truncation

326e75e

Remove unneeded input

4471002

neubig assigned kjain14 and unassigned neubig Feb 9, 2025

neubig and others added 4 commits February 10, 2025 10:21

Rename eval-infer

fd53378

Restore testgeneval poetry group

513dd97

Update lock

1290a25

Update TestGenEval README to include dependency installation

ace9e6e

Adding so file for codebleu

eb36426

neubig mentioned this pull request Feb 15, 2025

Upgrade tree sitter #6740

Merged

1 task

Kush Dave Jain added 3 commits February 16, 2025 17:14

Merging

09335d6

Merging

367c8a9

Merge branch 'All-Hands-AI-main'

92ddc1b

Merge pull request #6 from All-Hands-AI/main

4984bf6

Main of OpenHands

neubig mentioned this pull request Feb 17, 2025

chore(deps): update tree-sitter requirement from <0.24.0,>=0.22.0 to >=0.22.0,<0.25.0 k4black/codebleu#76

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TestGenEval benchmark #5534

Add TestGenEval benchmark #5534

kjain14 commented Dec 11, 2024 •

edited by neubig

Loading

neubig left a comment

kjain14 commented Dec 27, 2024 •

edited

Loading

neubig commented Dec 28, 2024

neubig commented Feb 9, 2025

neubig commented Feb 9, 2025

neubig commented Feb 10, 2025

kjain14 commented Feb 11, 2025

enyst commented Feb 11, 2025

kjain14 commented Feb 13, 2025

neubig commented Feb 16, 2025

enyst commented Feb 17, 2025

kjain14 commented Feb 17, 2025 •

edited

Loading

Add TestGenEval benchmark #5534

Are you sure you want to change the base?

Add TestGenEval benchmark #5534

Conversation

kjain14 commented Dec 11, 2024 • edited by neubig Loading

neubig left a comment

Choose a reason for hiding this comment

kjain14 commented Dec 27, 2024 • edited Loading

neubig commented Dec 28, 2024

neubig commented Feb 9, 2025

neubig commented Feb 9, 2025

neubig commented Feb 10, 2025

kjain14 commented Feb 11, 2025

enyst commented Feb 11, 2025

kjain14 commented Feb 13, 2025

neubig commented Feb 16, 2025

enyst commented Feb 17, 2025

kjain14 commented Feb 17, 2025 • edited Loading

kjain14 commented Dec 11, 2024 •

edited by neubig

Loading

kjain14 commented Dec 27, 2024 •

edited

Loading

kjain14 commented Feb 17, 2025 •

edited

Loading