-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TestGenEval benchmark #5534
base: main
Are you sure you want to change the base?
Conversation
Merge OpenHands
merging openhands
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @kjain14!
Sorry it took me a while to get to reviewing this. The README was still a bit WIP so I asked OpenHands to update it, and now I think it reflects the flow better. I was able to run the benchmark.
However, I'm a bit stuck for evaluation, I'm getting this when I try to install:
$ poetry install --with testgeneval
...
Unable to find installation candidates for torch (2.5.1)
- Installing importlib-resources (6.4.5)
- Installing kubernetes (31.0.0)
- Installing llama-cloud (0.1.5)
- Installing llama-index-embeddings-openai (0.3.0)
- Installing llama-index-program-openai (0.3.0)
- Installing llama-parse (0.5.15)
- Installing mmh3 (5.0.1)
- Installing onnxruntime (1.20.1)
- Installing opentelemetry-instrumentation-fastapi (0.46b0)
- Installing orjson (3.10.12)
- Installing posthog (3.7.2)
- Installing pypdf (5.1.0)
- Installing pypika (0.48.9)
- Installing rapidfuzz (3.11.0)
- Installing scikit-learn (1.5.2)
- Installing striprtf (0.0.26)
- Updating synchronicity (0.9.7 -> 0.9.8)
- Installing torch (2.5.1): Failed
RuntimeError
Unable to find installation candidates for torch (2.5.1)
- Updating botocore (1.35.84 -> 1.35.87)
- Installing importlib-resources (6.4.5)
- Installing kubernetes (31.0.0)
- Installing llama-cloud (0.1.5)
- Installing llama-index-embeddings-openai (0.3.0)
- Installing llama-index-program-openai (0.3.0)
- Installing llama-parse (0.5.15)
- Installing mmh3 (5.0.1)
- Installing onnxruntime (1.20.1)
- Installing opentelemetry-instrumentation-fastapi (0.46b0)
- Installing orjson (3.10.12)
- Installing posthog (3.7.2)
- Installing pypdf (5.1.0)
- Installing pypika (0.48.9)
- Installing rapidfuzz (3.11.0)
- Installing scikit-learn (1.5.2)
- Installing striprtf (0.0.26)
- Updating synchronicity (0.9.7 -> 0.9.8)
- Installing torch (2.5.1): Failed
RuntimeError
Unable to find installation candidates for torch (2.5.1)
at ~/Library/Application Support/pypoetry/venv/lib/python3.13/site-packages/poetry/installation/chooser.py:74 in choose_for
70│
71│ links.append(link)
72│
73│ if not links:
→ 74│ raise RuntimeError(f"Unable to find installation candidates for {package}")
75│
76│ # Get the best link
77│ chosen = max(links, key=lambda link: self._sort_key(package, link))
78│
Cannot install torch.
Are you able to reproduce this? If so it'd be nice if you could figure out the incompatibility. If you can't repro I'll try to investigate further.
Hmm, I tried today and am not able to reproduce this, wondering what may be causing this? I think this also does not have to do with testgeneval dependencies (it is because of the llama group dependencies, which list torch==2.5.1) |
Hmm, I'll take another look. |
Sorry again this took me so long, but I'm looking at this now. I overcame my previous issue but encountered the problem below:
This was due to prompt truncation. If this is necessary in OpenHands, I think it's something we should handle on the OpenHands side, not the benchmark side, so I removed the code for now and things seem to be working OK with Claude (although it failed on some instances). I'll update once I've run a full eval. |
OK, The README says to use |
Hi @kjain14 , I think this is getting pretty close, but now I'm having an issue with codebleu:
|
This should be fixed now (was being gitignored previously) |
This is possible, but needs the tree-sitter version to be upgraded (is there a reason why it is pinned currently?) |
I'm working on upgrading the tree sitter version! |
@kjain14 tree-sitter was updated in main, you may want to see if it works now? |
Main of OpenHands
It seems like the codebleu package only works with a very specific version of tree-sitter (higher than the previous v0.21.0 but lower than the current version). Could we adjust it to work with this version (or alternatively can just use the code I have). Looks like there is a PR to do this on the codebleu repo but no reponse: k4black/codebleu#76
|
End-user friendly description of the problem this fixes or functionality that this introduces
Adds a new unit test generation benchmark TestGenEval: https://arxiv.org/abs/2410.00752
Give a summary of what the PR does, explaining any non-trivial design decisions
PR includes changes to measure:
Note: This is a clean version of PR #5534 that contains only the TestGenEval changes.