"Successful" patches fail on unexecuted developer-written tests #280

PrinzyW · 2025-01-10T01:50:53Z

Describe the bug

To validate patches for an issue, SWE-bench collects and executes only those test files that are changed in the corresponding PR. We found that some LLM-generated patches can pass all fail_to_pass and pass_to_pass tests, but fail on some unchanged test cases, whereas the oracle patch passes these unchanged test cases. This implies that these LLM-generated patches are actually incorrect, and would typically not be accepted by developers. As a result, the leaderboard overestimates the effectiveness of several tools.

To improve the robustness of the validation, would it be better to run all tests, instead of only the test files modified in the PR?

Steps/Code to Reproduce

For each patch corresponding to a resolved issue:

Find all the test files in the target project.
Execute all test files, both on the model patch and the oracle patch:
1. For the underlying environment, we used the instance image of SWE-bench.
2. To run the tests, we used the eval.sh script in SWE-bench log directory to run the test files and ran one test file at a time.
3. For parsing the test results, we used the get_logs_eval() function in swebench/harness/grading.py.
Record any test that fails on the model patch but passes on the oracle patch.

Note: Some tests could also fail on the oracle patch, possibly due to environment issues (e.g., the best environment for Django might be Windows, since that's what they use in their GitHub Action workflow), or problems with the testing method (e.g., bin/test for Sympy doesn't recognize @XFAIL).

Expected Results

We ran all the developer-written tests for the patches submitted by 20241029_OpenHands-CodeAct-2.1-sonnet-20241022 and found 21 cases where at least one test failed on the model patch while passing on the oracle patch.

Here are some examples:

django__django-12663

test_forward_in_lookup_filters_correctly

FAIL: test_forward_in_lookup_filters_correctly (foreign_object.tests.MultiColumnFKTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/testbed_model/tests/foreign_object/tests.py", line 133, in test_forward_in_lookup_filters_correctly
    attrgetter('person_id')
  File "/testbed_model/django/test/testcases.py", line 1052, in assertQuerysetEqual
    return self.assertEqual(list(items), values, msg=msg)
AssertionError: Lists differ: [2, 3] != [2]

First list contains 1 additional elements.
First extra element 1:
3

- [2, 3]
+ [2]

django__django-13406

test_cast_to_text_field

ERROR: test_cast_to_text_field (db_functions.comparison.test_cast.CastTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/testbed_model/tests/db_functions/comparison/test_cast.py", line 142, in test_cast_to_text_field
    self.assertEqual(Author.objects.values_list(Cast('age', models.TextField()), flat=True).get(), '1')
  File "/testbed_model/django/db/models/manager.py", line 85, in manager_method
    return getattr(self.get_queryset(), name)(*args, **kwargs)
  File "/testbed_model/django/db/models/query.py", line 863, in values_list
    clone.query.set_values(fields)
  File "/testbed_model/django/db/models/sql/query.py", line 2235, in set_values
    self.add_fields(field_names, True)
  File "/testbed_model/django/db/models/sql/query.py", line 1919, in add_fields
    join_info = self.setup_joins(name.split(LOOKUP_SEP), opts, alias, allow_many=allow_m2m)
AttributeError: 'Cast' object has no attribute 'split'

sympy__sympy-12419

test_Sum_doit

=================================== FAILURES ===================================
________________________________ test_Sum_doit _________________________________

    def test_Sum_doit():
        assert Sum(n*Integral(a**2), (n, 0, 2)).doit() == a**3
        assert Sum(n*Integral(a**2), (n, 0, 2)).doit(deep=False) == \
            3*Integral(a**2)
        assert summation(n*Integral(a**2), (n, 0, 2)) == 3*Integral(a**2)
    
        # test nested sum evaluation
        s = Sum( Sum( Sum(2,(z,1,n+1)), (y,x+1,n)), (x,1,n))
        assert 0 == (s.doit() - n*(n+1)*(n-1)).factor()
    
>       assert Sum(KroneckerDelta(m, n), (m, -oo, oo)).doit() == Piecewise((1, And(-oo < n, n < oo)), (0, True))
E       assert 1 == Piecewise((1, (-oo < n) & (n < oo)), (0, True))
E        +  where 1 = doit()
E        +    where doit = Sum(KroneckerDelta(m, n), (m, -oo, oo)).doit
E        +      where Sum(KroneckerDelta(m, n), (m, -oo, oo)) = Sum(KroneckerDelta(m, n), (m, -oo, oo))
E        +        where KroneckerDelta(m, n) = KroneckerDelta(m, n)
E        +  and   Piecewise((1, (-oo < n) & (n < oo)), (0, True)) = Piecewise((1, (-oo < n) & (n < oo)), (0, True))

sympy/concrete/tests/test_sums_products.py:557: AssertionError

sphinx-doc__sphinx-8120

test_babel_with_language_de

_________________________ test_babel_with_language_de __________________________

app = <SphinxTestApp buildername='latex'>
status = <_io.StringIO object at 0x7fbfb3d21550>
warning = <_io.StringIO object at 0x7fbfb3d21e50>

    @pytest.mark.sphinx(
        'latex', testroot='latex-babel',
        confoverrides={'language': 'de'})
    def test_babel_with_language_de(app, status, warning):
        app.builder.build_all()
        result = (app.outdir / 'python.tex').read_text()
        print(result)
        print(status.getvalue())
        print(warning.getvalue())
        assert '\documentclass[letterpaper,10pt,ngerman]{sphinxmanual}' in result
        assert '\usepackage{babel}' in result
        assert '\usepackage{times}' in result
        assert '\usepackage[Sonny]{fncychap}' in result
        assert ('\addto\captionsngerman{\renewcommand{\contentsname}{Table of content}}
'
                in result)
        assert '\shorthandoff{"}' in result
    
        # sphinxmessages.sty
        result = (app.outdir / 'sphinxmessages.sty').read_text()
        print(result)
        assert r'\def\pageautorefname{Seite}' in result
>       assert r'\addto\captionsngerman{\renewcommand{\figurename}{Fig.\@{} }}' in result
E       AssertionError: assert '\addto\captionsngerman{\renewcommand{\figurename}{Fig.\@{} }}' in '%
% sphinxmessages.sty
%
% message resources for Sphinx
%
\ProvidesPackage{sphinxmessages}[2019/01/04 v2.0 Loca... }}
\def\fnum@table{\tablename\thetable{}}

\addto\captionsngerman{\renewcommand{\literalblockname}{List.}}'

tests/test_build_latex.py:563: AssertionError

Actual Results

System Information

Linux - 5.4.0-166-generic 183-Ubuntu SMP Mon Oct 2 11:28:33 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Python 3.10.15

The text was updated successfully, but these errors were encountered:

PrinzyW added the bug Something isn't working label Jan 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Successful" patches fail on unexecuted developer-written tests #280

"Successful" patches fail on unexecuted developer-written tests #280

PrinzyW commented Jan 10, 2025

"Successful" patches fail on unexecuted developer-written tests #280

"Successful" patches fail on unexecuted developer-written tests #280

Comments

PrinzyW commented Jan 10, 2025

Describe the bug

Steps/Code to Reproduce

Expected Results

Actual Results

System Information