You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To validate patches for an issue, SWE-bench collects and executes only those test files that are changed in the corresponding PR. We found that some LLM-generated patches can pass all fail_to_pass and pass_to_pass tests, but fail on some unchanged test cases, whereas the oracle patch passes these unchanged test cases. This implies that these LLM-generated patches are actually incorrect, and would typically not be accepted by developers. As a result, the leaderboard overestimates the effectiveness of several tools.
To improve the robustness of the validation, would it be better to run all tests, instead of only the test files modified in the PR?
Steps/Code to Reproduce
For each patch corresponding to a resolved issue:
Find all the test files in the target project.
Execute all test files, both on the model patch and the oracle patch:
For the underlying environment, we used the instance image of SWE-bench.
To run the tests, we used the eval.sh script in SWE-bench log directory to run the test files and ran one test file at a time.
For parsing the test results, we used the get_logs_eval() function in swebench/harness/grading.py.
Record any test that fails on the model patch but passes on the oracle patch.
Note: Some tests could also fail on the oracle patch, possibly due to environment issues (e.g., the best environment for Django might be Windows, since that's what they use in their GitHub Action workflow), or problems with the testing method (e.g., bin/test for Sympy doesn't recognize @XFAIL).
Expected Results
We ran all the developer-written tests for the patches submitted by 20241029_OpenHands-CodeAct-2.1-sonnet-20241022 and found 21 cases where at least one test failed on the model patch while passing on the oracle patch.
Here are some examples:
django__django-12663
test_forward_in_lookup_filters_correctly
FAIL: test_forward_in_lookup_filters_correctly (foreign_object.tests.MultiColumnFKTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/testbed_model/tests/foreign_object/tests.py", line 133, in test_forward_in_lookup_filters_correctly
attrgetter('person_id')
File "/testbed_model/django/test/testcases.py", line 1052, in assertQuerysetEqual
return self.assertEqual(list(items), values, msg=msg)
AssertionError: Lists differ: [2, 3] != [2]
First list contains 1 additional elements.
First extra element 1:
3
- [2, 3]
+ [2]
django__django-13406
test_cast_to_text_field
ERROR: test_cast_to_text_field (db_functions.comparison.test_cast.CastTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/testbed_model/tests/db_functions/comparison/test_cast.py", line 142, in test_cast_to_text_field
self.assertEqual(Author.objects.values_list(Cast('age', models.TextField()), flat=True).get(), '1')
File "/testbed_model/django/db/models/manager.py", line 85, in manager_method
return getattr(self.get_queryset(), name)(*args, **kwargs)
File "/testbed_model/django/db/models/query.py", line 863, in values_list
clone.query.set_values(fields)
File "/testbed_model/django/db/models/sql/query.py", line 2235, in set_values
self.add_fields(field_names, True)
File "/testbed_model/django/db/models/sql/query.py", line 1919, in add_fields
join_info = self.setup_joins(name.split(LOOKUP_SEP), opts, alias, allow_many=allow_m2m)
AttributeError: 'Cast' object has no attribute 'split'
sympy__sympy-12419
test_Sum_doit
=================================== FAILURES ===================================
________________________________ test_Sum_doit _________________________________
def test_Sum_doit():
assert Sum(n*Integral(a**2), (n, 0, 2)).doit() == a**3
assert Sum(n*Integral(a**2), (n, 0, 2)).doit(deep=False) == \
3*Integral(a**2)
assert summation(n*Integral(a**2), (n, 0, 2)) == 3*Integral(a**2)
# test nested sum evaluation
s = Sum( Sum( Sum(2,(z,1,n+1)), (y,x+1,n)), (x,1,n))
assert 0 == (s.doit() - n*(n+1)*(n-1)).factor()
> assert Sum(KroneckerDelta(m, n), (m, -oo, oo)).doit() == Piecewise((1, And(-oo < n, n < oo)), (0, True))
E assert 1 == Piecewise((1, (-oo < n) & (n < oo)), (0, True))
E + where 1 = doit()
E + where doit = Sum(KroneckerDelta(m, n), (m, -oo, oo)).doit
E + where Sum(KroneckerDelta(m, n), (m, -oo, oo)) = Sum(KroneckerDelta(m, n), (m, -oo, oo))
E + where KroneckerDelta(m, n) = KroneckerDelta(m, n)
E + and Piecewise((1, (-oo < n) & (n < oo)), (0, True)) = Piecewise((1, (-oo < n) & (n < oo)), (0, True))
sympy/concrete/tests/test_sums_products.py:557: AssertionError
sphinx-doc__sphinx-8120
test_babel_with_language_de
_________________________ test_babel_with_language_de __________________________
app = <SphinxTestApp buildername='latex'>
status = <_io.StringIO object at 0x7fbfb3d21550>
warning = <_io.StringIO object at 0x7fbfb3d21e50>
@pytest.mark.sphinx(
'latex', testroot='latex-babel',
confoverrides={'language': 'de'})
def test_babel_with_language_de(app, status, warning):
app.builder.build_all()
result = (app.outdir / 'python.tex').read_text()
print(result)
print(status.getvalue())
print(warning.getvalue())
assert '\documentclass[letterpaper,10pt,ngerman]{sphinxmanual}' in result
assert '\usepackage{babel}' in result
assert '\usepackage{times}' in result
assert '\usepackage[Sonny]{fncychap}' in result
assert ('\addto\captionsngerman{\renewcommand{\contentsname}{Table of content}}
'
in result)
assert '\shorthandoff{"}' in result
# sphinxmessages.sty
result = (app.outdir / 'sphinxmessages.sty').read_text()
print(result)
assert r'\def\pageautorefname{Seite}' in result
> assert r'\addto\captionsngerman{\renewcommand{\figurename}{Fig.\@{} }}' in result
E AssertionError: assert '\addto\captionsngerman{\renewcommand{\figurename}{Fig.\@{} }}' in '%
% sphinxmessages.sty
%
% message resources for Sphinx
%
\ProvidesPackage{sphinxmessages}[2019/01/04 v2.0 Loca... }}
\def\fnum@table{\tablename\thetable{}}
\addto\captionsngerman{\renewcommand{\literalblockname}{List.}}'
tests/test_build_latex.py:563: AssertionError
Actual Results
System Information
Linux - 5.4.0-166-generic 183-Ubuntu SMP Mon Oct 2 11:28:33 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Python 3.10.15
The text was updated successfully, but these errors were encountered:
Describe the bug
To validate patches for an issue, SWE-bench collects and executes only those test files that are changed in the corresponding PR. We found that some LLM-generated patches can pass all fail_to_pass and pass_to_pass tests, but fail on some unchanged test cases, whereas the oracle patch passes these unchanged test cases. This implies that these LLM-generated patches are actually incorrect, and would typically not be accepted by developers. As a result, the leaderboard overestimates the effectiveness of several tools.
To improve the robustness of the validation, would it be better to run all tests, instead of only the test files modified in the PR?
Steps/Code to Reproduce
For each patch corresponding to a resolved issue:
eval.sh
script in SWE-bench log directory to run the test files and ran one test file at a time.get_logs_eval()
function inswebench/harness/grading.py
.Note: Some tests could also fail on the oracle patch, possibly due to environment issues (e.g., the best environment for Django might be Windows, since that's what they use in their GitHub Action workflow), or problems with the testing method (e.g.,
bin/test
for Sympy doesn't recognize@XFAIL
).Expected Results
We ran all the developer-written tests for the patches submitted by 20241029_OpenHands-CodeAct-2.1-sonnet-20241022 and found 21 cases where at least one test failed on the model patch while passing on the oracle patch.
Here are some examples:
django__django-12663
test_forward_in_lookup_filters_correctly
django__django-13406
test_cast_to_text_field
sympy__sympy-12419
test_Sum_doit
sphinx-doc__sphinx-8120
test_babel_with_language_de
Actual Results
System Information
Linux - 5.4.0-166-generic 183-Ubuntu SMP Mon Oct 2 11:28:33 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
Python 3.10.15
The text was updated successfully, but these errors were encountered: