Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add TestGenEval benchmark #5534

Open
wants to merge 46 commits into
base: main
Choose a base branch
from
Open

Add TestGenEval benchmark #5534

wants to merge 46 commits into from

Conversation

kjain14
Copy link

@kjain14 kjain14 commented Dec 11, 2024

End-user friendly description of the problem this fixes or functionality that this introduces

Adds a new unit test generation benchmark TestGenEval: https://arxiv.org/abs/2410.00752


Give a summary of what the PR does, explaining any non-trivial design decisions

PR includes changes to measure:

  • Coverage
  • Mutation score
  • Push docker images for TestGenEval with testing dependencies
  • Prompts for measuring CodeAct performance
  • Wide range of lexical metrics too (rouge, codebleu, readability, etc)

Note: This is a clean version of PR #5534 that contains only the TestGenEval changes.

@kjain14 kjain14 marked this pull request as ready for review December 16, 2024 19:55
@neubig neubig self-requested a review December 16, 2024 20:08
Copy link
Contributor

@neubig neubig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @kjain14!

Sorry it took me a while to get to reviewing this. The README was still a bit WIP so I asked OpenHands to update it, and now I think it reflects the flow better. I was able to run the benchmark.

However, I'm a bit stuck for evaluation, I'm getting this when I try to install:

$ poetry install --with testgeneval
...
  Unable to find installation candidates for torch (2.5.1)
  - Installing importlib-resources (6.4.5)
  - Installing kubernetes (31.0.0)
  - Installing llama-cloud (0.1.5)
  - Installing llama-index-embeddings-openai (0.3.0)
  - Installing llama-index-program-openai (0.3.0)
  - Installing llama-parse (0.5.15)
  - Installing mmh3 (5.0.1)
  - Installing onnxruntime (1.20.1)
  - Installing opentelemetry-instrumentation-fastapi (0.46b0)
  - Installing orjson (3.10.12)
  - Installing posthog (3.7.2)
  - Installing pypdf (5.1.0)
  - Installing pypika (0.48.9)
  - Installing rapidfuzz (3.11.0)
  - Installing scikit-learn (1.5.2)
  - Installing striprtf (0.0.26)
  - Updating synchronicity (0.9.7 -> 0.9.8)
  - Installing torch (2.5.1): Failed

  RuntimeError

  Unable to find installation candidates for torch (2.5.1)
  - Updating botocore (1.35.84 -> 1.35.87)
  - Installing importlib-resources (6.4.5)
  - Installing kubernetes (31.0.0)
  - Installing llama-cloud (0.1.5)
  - Installing llama-index-embeddings-openai (0.3.0)
  - Installing llama-index-program-openai (0.3.0)
  - Installing llama-parse (0.5.15)
  - Installing mmh3 (5.0.1)
  - Installing onnxruntime (1.20.1)
  - Installing opentelemetry-instrumentation-fastapi (0.46b0)
  - Installing orjson (3.10.12)
  - Installing posthog (3.7.2)
  - Installing pypdf (5.1.0)
  - Installing pypika (0.48.9)
  - Installing rapidfuzz (3.11.0)
  - Installing scikit-learn (1.5.2)
  - Installing striprtf (0.0.26)
  - Updating synchronicity (0.9.7 -> 0.9.8)
  - Installing torch (2.5.1): Failed

  RuntimeError

  Unable to find installation candidates for torch (2.5.1)

  at ~/Library/Application Support/pypoetry/venv/lib/python3.13/site-packages/poetry/installation/chooser.py:74 in choose_for
       70│ 
       71│             links.append(link)
       72│ 
       73│         if not links:
    →  74│             raise RuntimeError(f"Unable to find installation candidates for {package}")
       75│ 
       76│         # Get the best link
       77│         chosen = max(links, key=lambda link: self._sort_key(package, link))
       78│ 

Cannot install torch.

Are you able to reproduce this? If so it'd be nice if you could figure out the incompatibility. If you can't repro I'll try to investigate further.

@kjain14
Copy link
Author

kjain14 commented Dec 27, 2024

Hmm, I tried today and am not able to reproduce this, wondering what may be causing this?

I think this also does not have to do with testgeneval dependencies (it is because of the llama group dependencies, which list torch==2.5.1)

@neubig neubig self-requested a review December 28, 2024 19:57
@neubig
Copy link
Contributor

neubig commented Dec 28, 2024

Hmm, I'll take another look.

@neubig neubig self-assigned this Dec 30, 2024
@neubig
Copy link
Contributor

neubig commented Feb 9, 2025

Sorry again this took me so long, but I'm looking at this now. I overcame my previous issue but encountered the problem below:

...
  File "/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/run_infer.py", line 118, in truncate_prompt
    encoding = tiktoken.encoding_for_model(model)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/Library/Caches/pypoetry/virtualenvs/openhands-ai-mHbwbJq6-py3.12/lib/python3.12/site-packages/tiktoken/model.py", line 105, in encoding_for_model
    return get_encoding(encoding_name_for_model(model_name))
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/Library/Caches/pypoetry/virtualenvs/openhands-ai-mHbwbJq6-py3.12/lib/python3.12/site-packages/tiktoken/model.py", line 92, in encoding_name_for_model
    raise KeyError(
KeyError: 'Could not automatically map openai/claude-3-5-sonnet-20241022 to a tokeniser. Please use `tiktoken.get_encoding` to explicitly get the tokeniser you expect.'

This was due to prompt truncation. If this is necessary in OpenHands, I think it's something we should handle on the OpenHands side, not the benchmark side, so I removed the code for now and things seem to be working OK with Claude (although it failed on some instances). I'll update once I've run a full eval.

@neubig
Copy link
Contributor

neubig commented Feb 9, 2025

OK, run_infer.py seems to be working, but I'm not sure about evaluation.

The README says to use ./evaluation/benchmarks/testgeneval/scripts/eval_infer.sh but this file does not exist, only ./evaluation/benchmarks/testgeneval/scripts/eval_infer_remote.sh. @kjain14 , could you elaborate on how you ran evaluation?

@neubig neubig assigned kjain14 and unassigned neubig Feb 9, 2025
@neubig
Copy link
Contributor

neubig commented Feb 10, 2025

Hi @kjain14 , I think this is getting pretty close, but now I'm having an issue with codebleu:

poetry run python evaluation/benchmarks/testgeneval/eval_infer.py --eval-num-workers 1 --input-file evaluation/evaluation_outputs/outputs/ --dataset kjain14/testgenevallite --split test
/Users/gneubig/Library/Caches/pypoetry/virtualenvs/openhands-ai-mHbwbJq6-py3.12/lib/python3.12/site-packages/tree_sitter/__init__.py:36: FutureWarning: Language(path, name) is deprecated. Use Language(ptr, name) instead.
  warn("{} is deprecated. Use {} instead.".format(old, new), FutureWarning)
Traceback (most recent call last):
  File "/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/eval_infer.py", line 22, in <module>
    from evaluation.benchmarks.testgeneval.metrics import (
  File "/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/metrics.py", line 305, in <module>
    "Java8": Evaluator("java"),
             ^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/Evaluator.py", line 38, in __init__
    self.parser_language = Language(this_dir / 'parser' / 'my-languages.so', lang)
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/Library/Caches/pypoetry/virtualenvs/openhands-ai-mHbwbJq6-py3.12/lib/python3.12/site-packages/tree_sitter/__init__.py", line 132, in __init__
    self.lib = cdll.LoadLibrary(fspath(path_or_ptr))
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/miniconda3/envs/openhands/lib/python3.12/ctypes/__init__.py", line 460, in LoadLibrary
    return self._dlltype(name)
           ^^^^^^^^^^^^^^^^^^^
  File "/Users/gneubig/miniconda3/envs/openhands/lib/python3.12/ctypes/__init__.py", line 379, in __init__
    self._handle = _dlopen(self._name, mode)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^
OSError: dlopen(/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/parser/my-languages.so, 0x0006): tried: '/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/parser/my-languages.so' (no such file), '/System/Volumes/Preboot/Cryptexes/OS/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/parser/my-languages.so' (no such file), '/Users/gneubig/work/OpenHands/evaluation/benchmarks/testgeneval/CodeBLEU/parser/my-languages.so' (no such file)

@kjain14
Copy link
Author

kjain14 commented Feb 11, 2025

This should be fixed now (was being gitignored previously)

@enyst
Copy link
Collaborator

enyst commented Feb 11, 2025

Just a thought about this addition:
image

Could we have codeBLEU library as a dependency in python?

In general, we have optional dependencies for evaluation in the poetry 'evaluation' group. Do you think it can be done that way?

@kjain14
Copy link
Author

kjain14 commented Feb 13, 2025

This is possible, but needs the tree-sitter version to be upgraded (is there a reason why it is pinned currently?)

@neubig neubig mentioned this pull request Feb 15, 2025
1 task
@neubig
Copy link
Contributor

neubig commented Feb 16, 2025

I'm working on upgrading the tree sitter version!

@enyst
Copy link
Collaborator

enyst commented Feb 17, 2025

@kjain14 tree-sitter was updated in main, you may want to see if it works now?

@kjain14
Copy link
Author

kjain14 commented Feb 17, 2025

It seems like the codebleu package only works with a very specific version of tree-sitter (higher than the previous v0.21.0 but lower than the current version). Could we adjust it to work with this version (or alternatively can just use the code I have).

Looks like there is a PR to do this on the codebleu repo but no reponse: k4black/codebleu#76

Because codebleu (0.7.0) depends on tree-sitter (>=0.22.0,<0.23.0)
 and no versions of codebleu match >0.7.0,<0.8.0, codebleu (>=0.7.0,<0.8.0) requires tree-sitter (>=0.22.0,<0.23.0).
So, because openhands-ai depends on both tree-sitter (>=0.24.0,<0.25.0) and codebleu (^0.7.0), version solving failed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants