Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container snapshots/restore feature and other critical bug fixes #6441

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

kripper
Copy link
Contributor

@kripper kripper commented Jan 23, 2025

PR Summary

This PR introduces a feature to snapshot/restore the sandbox storage + several important fixes and features to enhance the overall functionality of the system.

Key Changes:

New Feature: Container Snapshot and Restore

This feature introduces the ability to snapshot and restore container storage. It mounts Docker's storage directory (/var/lib/docker) on a virtual Btrfs volume using a loop device within the OpenHands (OH) environment. This approach has been tested on Windows using VS Code and Dev Containers, with OH running in a Docker container and creating sandboxes in nested containers.

Motivation:

The primary goal of this feature is to enable reverting changes made by OH across the entire system. This allows users to effectively undo any system modifications introduced during interactions within OH, promoting a safer and more predictable development and testing environment.

Important Notes:

Current Status:

  • No additional features will be added at this moment.
  • Frontend developers are welcome to implement a button to allow users to revert snapshots (one per user message).

Usage Instructions:

To enable the snapshot and restore functionality, add the following configuration to your sandbox settings:

[sandbox]
docker_snapshots = true

If disabled, the Docker configuration will revert upon restarting OH.

Snapshot Creation:

  • Every time a user sends a message, a new snapshot of the container storage will be created.
  • The logs will display the command required to restore each snapshot.

Testing

  • The file openhands/storage/docker_snapshots.py contains a test case (unit-tests in comments) to debug and understand how it works.

Feel free to reach out if there are any issues or further questions.

@rbren rbren changed the title Fixes and Enhancement for v0.22 Fix for reattaching to Docker when using host network Jan 24, 2025
- Added sandbox.docker_snapshots config (bool)
- Mounts a btrfs loop device for docker storage
- Can be enabled/disabled (revert docker config change)
@kripper kripper changed the title Fix for reattaching to Docker when using host network Container snapshots and restore and other fixes. Jan 25, 2025
@kripper kripper changed the title Container snapshots and restore and other fixes. Container snapshots and restore and other critical bug fixes Jan 25, 2025
@kripper kripper changed the title Container snapshots and restore and other critical bug fixes Container snapshots/restore feature and other critical bug fixes Jan 25, 2025
@enyst
Copy link
Collaborator

enyst commented Jan 25, 2025

This is a great feature, thank you for working on it!

It might be good to try to keep the bug fixes separate from new feature(s). One smaller fixes PR and one large feature PR would mean that the fixes PR is faster to review and merge. If you need the fixes too, you can create/rebase the feature branch on top on the fixes branch. It shouldn't cause much trouble when merged, then.

Up to you, it just would be better IMHO. Because I think you are correct that the fixes are needed (independently), so it would be nice if we can take them in before the snapshots feature is ready.

@kripper
Copy link
Contributor Author

kripper commented Jan 25, 2025

It might be good to try to keep the bug fixes separate from new feature(s).

Here I isolated the critical bug fixes:
#6460

@rbren
Copy link
Collaborator

rbren commented Jan 29, 2025

Let's remove the dev_container stuff from this PR to keep changes isolated.

Feel free to open a second PR for dev_container, but no guarantees there. The maintainers explicitly chose to remove dev_container, and I don't want to add it back without gathering consensus first

@diwu-sf
Copy link
Contributor

diwu-sf commented Feb 6, 2025

I tried this on mac and it doesn't really work, and it looks like it now requires sudo root privileges, which feels unsafe to give to an app.

On a real deployment (SaaS in kubernetes, on-prem corporate, etc..) that serves multiple users, this would need a significantly different implementation (support from the runtime infra itself).

What's the plan for making this actually work in a real production deployment?

@kripper
Copy link
Contributor Author

kripper commented Feb 6, 2025

I tried this on mac and it doesn't really work, and it looks like it now requires sudo root privileges, which feels unsafe to give to an app.

What exactly didn't work? any error msg?

You can try the test case commented here:
https://github.com/All-Hands-AI/OpenHands/pull/6441/files#diff-56f99ca967ee59d7ce3134b274798e48f4fce237dce9c8dd6d2b3c419781fb28R3

@kripper
Copy link
Contributor Author

kripper commented Feb 6, 2025

and it looks like it now requires sudo root privileges, which feels unsafe to give to an app.

I'm running OH inside a container and sandboxes as nested containers (and sometimes those sandboxes create nested-nested-containers). I don't understand why people is running OH on the host. Probably not windows users.

Docker runs as root on my OH env.
If you want to restart a sandbox using a restored snapshot, you will need root access (sudo).
Alternatively, you could run the commands manually yourself, or setup your environment to run docker as a normal user, but I'm not supporting this use case.

@kripper
Copy link
Contributor Author

kripper commented Feb 6, 2025

support from the runtime infra itself

The runtime operates within the sandbox. While the runtime can trigger the snapshot/restore operation, the actual handling must occur outside the sandbox. Otherwise, you won't be able to capture the entire sandbox storage during the snapshot.

What's the plan for making this actually work in a real production deployment?

You mean running OH in production like a "OH as a service"?
Or you mean running a sandbox in production?
(Using "OH" and "in production" in the same sentence still sound a bit strange to me).

We are snapshotting sandboxes (docker containers) and docker containers are widely used in production.
If you are interested in taking and restoring snapshots on a bare metal host or VM, the same implemented strategy could be applied, but instead of taking a snapshot of the sandbox's container storage, we can take a snapshot of a mounted Btrfs volume.
And as I mentioned before, this wouldn't be a complete snapshot, only of the designated mounted volume.

Related: For #6457 we will probably need to snapshot a designated workspace. This is another additional requirement that might also align well with your use case.

And increased timeout, needed for slow headless docker in docker environments.
It's a sparse file (no real storage allocation)
@kripper kripper deleted the branch All-Hands-AI:main February 11, 2025 15:34
@kripper kripper closed this Feb 11, 2025
@kripper kripper deleted the main branch February 11, 2025 15:34
@kripper kripper restored the main branch February 11, 2025 15:37
@kripper kripper reopened this Feb 11, 2025
@kripper
Copy link
Contributor Author

kripper commented Feb 13, 2025

I tried this on mac and it doesn't really work,

Make sure to chmod +x openhands/storage/docker_snapshots.py
Fixed in 780acd0

@rbren
Copy link
Collaborator

rbren commented Feb 17, 2025

@kripper I think we need to break this into multiple PRs

The reattach on ExposedPorts is a simple bug, we should be able to merge that fix quickly

The snapshot stuff seems super cool, would love to get that in

The devcontainer stuff is controversial, and I'd love to have a separate PR where we can discuss whether or not to merge that

@rbren
Copy link
Collaborator

rbren commented Feb 17, 2025

Oh nice looks like a good chunk of it was broken out here! https://github.com/All-Hands-AI/OpenHands/pull/6460/files 🙏

@kripper
Copy link
Contributor Author

kripper commented Feb 17, 2025

Ok. I will remove the DevContainer folder from this PR in the future. My expectation was the other PR to be merged and so to avoid the additional effort of preparing PRs with cherry picked commits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants