Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Successful" patches fail on unexecuted developer-written tests #280

Open
PrinzyW opened this issue Jan 10, 2025 · 0 comments
Open

"Successful" patches fail on unexecuted developer-written tests #280

PrinzyW opened this issue Jan 10, 2025 · 0 comments
Labels
bug Something isn't working

Comments

@PrinzyW
Copy link

PrinzyW commented Jan 10, 2025

Describe the bug

To validate patches for an issue, SWE-bench collects and executes only those test files that are changed in the corresponding PR. We found that some LLM-generated patches can pass all fail_to_pass and pass_to_pass tests, but fail on some unchanged test cases, whereas the oracle patch passes these unchanged test cases. This implies that these LLM-generated patches are actually incorrect, and would typically not be accepted by developers. As a result, the leaderboard overestimates the effectiveness of several tools.

To improve the robustness of the validation, would it be better to run all tests, instead of only the test files modified in the PR?

Steps/Code to Reproduce

For each patch corresponding to a resolved issue:

  1. Find all the test files in the target project.
  2. Execute all test files, both on the model patch and the oracle patch:
    1. For the underlying environment, we used the instance image of SWE-bench.
    2. To run the tests, we used the eval.sh script in SWE-bench log directory to run the test files and ran one test file at a time.
    3. For parsing the test results, we used the get_logs_eval() function in swebench/harness/grading.py.
  3. Record any test that fails on the model patch but passes on the oracle patch.

Note: Some tests could also fail on the oracle patch, possibly due to environment issues (e.g., the best environment for Django might be Windows, since that's what they use in their GitHub Action workflow), or problems with the testing method (e.g., bin/test for Sympy doesn't recognize @XFAIL).

Expected Results

We ran all the developer-written tests for the patches submitted by 20241029_OpenHands-CodeAct-2.1-sonnet-20241022 and found 21 cases where at least one test failed on the model patch while passing on the oracle patch.

Here are some examples:

  • django__django-12663

    test_forward_in_lookup_filters_correctly

    FAIL: test_forward_in_lookup_filters_correctly (foreign_object.tests.MultiColumnFKTests)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/testbed_model/tests/foreign_object/tests.py", line 133, in test_forward_in_lookup_filters_correctly
        attrgetter('person_id')
      File "/testbed_model/django/test/testcases.py", line 1052, in assertQuerysetEqual
        return self.assertEqual(list(items), values, msg=msg)
    AssertionError: Lists differ: [2, 3] != [2]
    
    First list contains 1 additional elements.
    First extra element 1:
    3
    
    - [2, 3]
    + [2]
    
  • django__django-13406

    test_cast_to_text_field

    ERROR: test_cast_to_text_field (db_functions.comparison.test_cast.CastTests)
    ----------------------------------------------------------------------
    Traceback (most recent call last):
      File "/testbed_model/tests/db_functions/comparison/test_cast.py", line 142, in test_cast_to_text_field
        self.assertEqual(Author.objects.values_list(Cast('age', models.TextField()), flat=True).get(), '1')
      File "/testbed_model/django/db/models/manager.py", line 85, in manager_method
        return getattr(self.get_queryset(), name)(*args, **kwargs)
      File "/testbed_model/django/db/models/query.py", line 863, in values_list
        clone.query.set_values(fields)
      File "/testbed_model/django/db/models/sql/query.py", line 2235, in set_values
        self.add_fields(field_names, True)
      File "/testbed_model/django/db/models/sql/query.py", line 1919, in add_fields
        join_info = self.setup_joins(name.split(LOOKUP_SEP), opts, alias, allow_many=allow_m2m)
    AttributeError: 'Cast' object has no attribute 'split'
    
  • sympy__sympy-12419

    test_Sum_doit

    =================================== FAILURES ===================================
    ________________________________ test_Sum_doit _________________________________
    
        def test_Sum_doit():
            assert Sum(n*Integral(a**2), (n, 0, 2)).doit() == a**3
            assert Sum(n*Integral(a**2), (n, 0, 2)).doit(deep=False) == \
                3*Integral(a**2)
            assert summation(n*Integral(a**2), (n, 0, 2)) == 3*Integral(a**2)
        
            # test nested sum evaluation
            s = Sum( Sum( Sum(2,(z,1,n+1)), (y,x+1,n)), (x,1,n))
            assert 0 == (s.doit() - n*(n+1)*(n-1)).factor()
        
    >       assert Sum(KroneckerDelta(m, n), (m, -oo, oo)).doit() == Piecewise((1, And(-oo < n, n < oo)), (0, True))
    E       assert 1 == Piecewise((1, (-oo < n) & (n < oo)), (0, True))
    E        +  where 1 = doit()
    E        +    where doit = Sum(KroneckerDelta(m, n), (m, -oo, oo)).doit
    E        +      where Sum(KroneckerDelta(m, n), (m, -oo, oo)) = Sum(KroneckerDelta(m, n), (m, -oo, oo))
    E        +        where KroneckerDelta(m, n) = KroneckerDelta(m, n)
    E        +  and   Piecewise((1, (-oo < n) & (n < oo)), (0, True)) = Piecewise((1, (-oo < n) & (n < oo)), (0, True))
    
    sympy/concrete/tests/test_sums_products.py:557: AssertionError
    
  • sphinx-doc__sphinx-8120

    test_babel_with_language_de

    _________________________ test_babel_with_language_de __________________________
    
    app = <SphinxTestApp buildername='latex'>
    status = <_io.StringIO object at 0x7fbfb3d21550>
    warning = <_io.StringIO object at 0x7fbfb3d21e50>
    
        @pytest.mark.sphinx(
            'latex', testroot='latex-babel',
            confoverrides={'language': 'de'})
        def test_babel_with_language_de(app, status, warning):
            app.builder.build_all()
            result = (app.outdir / 'python.tex').read_text()
            print(result)
            print(status.getvalue())
            print(warning.getvalue())
            assert '\documentclass[letterpaper,10pt,ngerman]{sphinxmanual}' in result
            assert '\usepackage{babel}' in result
            assert '\usepackage{times}' in result
            assert '\usepackage[Sonny]{fncychap}' in result
            assert ('\addto\captionsngerman{\renewcommand{\contentsname}{Table of content}}
    '
                    in result)
            assert '\shorthandoff{"}' in result
        
            # sphinxmessages.sty
            result = (app.outdir / 'sphinxmessages.sty').read_text()
            print(result)
            assert r'\def\pageautorefname{Seite}' in result
    >       assert r'\addto\captionsngerman{\renewcommand{\figurename}{Fig.\@{} }}' in result
    E       AssertionError: assert '\addto\captionsngerman{\renewcommand{\figurename}{Fig.\@{} }}' in '%
    % sphinxmessages.sty
    %
    % message resources for Sphinx
    %
    \ProvidesPackage{sphinxmessages}[2019/01/04 v2.0 Loca... }}
    \def\fnum@table{\tablename\thetable{}}
    
    \addto\captionsngerman{\renewcommand{\literalblockname}{List.}}'
    
    tests/test_build_latex.py:563: AssertionError
    

Actual Results

System Information

Linux - 5.4.0-166-generic 183-Ubuntu SMP Mon Oct 2 11:28:33 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Python 3.10.15

@PrinzyW PrinzyW added the bug Something isn't working label Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant