Training Software Engineering Agents and Verifiers with SWE-Gym

Jiayi Pan^*,1, Xingyao Wang^*,2, Graham Neubig³, Navdeep Jaitly⁴, Heng Ji², Alane Suhr^{^,1}, Yizhe Zhang^{^,4}

¹UC Berkeley, ²UIUC, ³CMU, ⁴Apple
_{^*Equal contribution, ^{^}Equal supervision}

We present SWE-Gym, the first environment for training real-world software engineering agents. We use it to train strong LM agents that achieve state-of-the-art open results on SWE-Bench, with early, promising scaling characteristics as we increase training and inference-time compute.

Progress in agents for software engineering has been limited by the lack of training environments that both include rigorous verification for reinforcement learning and cover the expansive tasks encountered in real-world repository-level engineering.

We introduce SWE-Gym: An Open Environment for Training Software Engineering Agents & Verifiers. Our baselines achieve new open SOTA - 32%/26% on SWE-Bench Verified/Lite, with promising scaling trends.

SWE-Gym enables scalable improvements for software engineering agents at both training and inference time. Our current results is primarily bottlenecked by training and inference compute, rather than the size of our environment.

SWE-Gym Environment

We create SWE-Gym, the first environment for training SWE agents, with 2.4K real tasks from 11 Python repos & a Lite split of 234 instances. SWE-Gym combines real-world Python tasks, repository context, executable environments, and test verification to train agents for solving software engineering problems.

SWE-Gym trains LMs as agents

When fine-tuned on less than 500 agent-environment interaction trajectories sampled from it from GPT-4o and Claude 3.5 Sonnet, we achieve +14% absolute gains on SWE-Bench Verified with an 32B LM-powered OpenHands agent.

SWE-Gym enables self-improvement

SWE-Gym is also effective across agent scaffolds. With rejection sampling fine-tuning and MoatlessTools scaffold, our 32B and 7B models achieve 20% and 10% respectively on SWE-Bench Lite through self-improvement.

SWE-Gym enables inference-time scaling

SWE-Gym enables inference-time scaling through verifiers trained on agent trajectories.
These verifiers identify most promising solutions via best-of-n selection, together with our learned agents, they achieve 32%/26% on SWE-Bench Verified/Lite, a new open SoTA.

Inference Time Scaling for Moatless Agent

Inference Time Scaling for OpenHands Agent

Our baselines on SWE-Gym shows strong scaling trends

Lastly, our ablations reveal strong scaling trends - performance is now bottlenecked by train and inference compute, rather than the size of our dataset. Pushing and improving these scaling trends further is an exciting direction for future work.

Reproducing Results

The Dataset

To access SWE-Gym dataset, checkout our huggingface hub page SWE-Gym

The environment constants are currently saved at SWE-Bench-Fork

We also have pre-built docker images for each instance under xingyaoww/sweb.eval.x86_64 prefix at docker hub.

The Experiments See docs/OpenHands.md and docs/MoatlessTools.md for instructions on reproducing results with our training and inference-time results for OpenHands and MoatlessTools agents.

📚 Citation

@misc{pan2024trainingsoftwareengineeringagents,
      title={Training Software Engineering Agents and Verifiers with SWE-Gym}, 
      author={Jiayi Pan and Xingyao Wang and Graham Neubig and Navdeep Jaitly and Heng Ji and Alane Suhr and Yizhe Zhang},
      year={2024},
      eprint={2412.21139},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2412.21139}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
assets		assets
docs		docs
scripts		scripts
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Training Software Engineering Agents and Verifiers with SWE-Gym

SWE-Gym Environment

SWE-Gym trains LMs as agents

SWE-Gym enables self-improvement

SWE-Gym enables inference-time scaling

Our baselines on SWE-Gym shows strong scaling trends

Reproducing Results

📚 Citation

About

Releases

Packages

Languages

License

TatsuHaguioC23/SWE-Gym

Folders and files

Latest commit

History

Repository files navigation

Training Software Engineering Agents and Verifiers with SWE-Gym

SWE-Gym Environment

SWE-Gym trains LMs as agents

SWE-Gym enables self-improvement

SWE-Gym enables inference-time scaling

Our baselines on SWE-Gym shows strong scaling trends

Reproducing Results

📚 Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages