Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quickstart tutorial #1044

Merged
merged 3 commits into from
Jun 7, 2024
Merged

Conversation

incaseoftrouble
Copy link
Contributor

@incaseoftrouble incaseoftrouble commented May 26, 2024

First draft for the quickstart guide. It should include a short point how to add it for docker / what to consider there and maybe some basic troubleshooting without getting too large (I think this should remain at about 1-2 pages so it doesn't look more daunting than necessary).

Copy link
Member

@PhilippWendler PhilippWendler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot!

Besides the idea of this tutorial, I also like the overall structure, its scope, length, and level of detail.

I mostly have suggestions and questions about tweaking the phrasing a little bit.

I guess we should add more cross references from the Documentation section of the readme, the doc INDEX, and runexec.md?

README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
doc/quickstart.md Outdated Show resolved Hide resolved
doc/quickstart.md Show resolved Hide resolved
doc/quickstart.md Outdated Show resolved Hide resolved
doc/quickstart.md Outdated Show resolved Hide resolved
doc/quickstart.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
doc/quickstart.md Outdated Show resolved Hide resolved
doc/quickstart.md Outdated Show resolved Hide resolved
@incaseoftrouble incaseoftrouble force-pushed the quickstart branch 2 times, most recently from ecafad6 to 3da2bf3 Compare May 28, 2024 11:33
@PhilippWendler PhilippWendler marked this pull request as draft May 28, 2024 13:41
@PhilippWendler PhilippWendler changed the title WIP Quickstart draft Quickstart tutorial May 28, 2024
Copy link
Member

@PhilippWendler PhilippWendler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could I please ask you to go through the tutorial and replace "run" wherever it is used as a verb with "execute"? Because "run" occurs so often with different meanings, we try to avoid with as a verb.

Looks great for me now, all comments are minor editorial stuff.

Do you have it marked as draft because you want to add something or can I merge?

README.md Outdated Show resolved Hide resolved
doc/quickstart.md Outdated Show resolved Hide resolved
doc/quickstart.md Outdated Show resolved Hide resolved
doc/quickstart.md Outdated Show resolved Hide resolved
doc/quickstart.md Outdated Show resolved Hide resolved
doc/quickstart.md Outdated Show resolved Hide resolved
@incaseoftrouble
Copy link
Contributor Author

incaseoftrouble commented May 28, 2024

Comments should be resolved.

I would really want to add a small "how to" for running in docker / podman but I couldn't get it to work.

$ podman run --security-opt unmask=/proc/* --security-opt unmask=/sys/fs/cgroup --security-opt seccomp=unconfined -it test
root@19ba3e5977c2:/# PYTHONPATH=/opt/benchexec.whl python3 -m benchexec.check_cgroups
<--- lots of warnings --->

BenchExec seems to be running in a Podman container without enabled cgroups.
Please pass "--security-opt unmask=/sys/fs/cgroup" to your "podman run" command.

and

$ docker run --privileged -v /sys/fs/cgroup/:/sys/fs/cgroup:rw -it test 
root@2736516436fc:/# PYTHONPATH=/opt/benchexec.whl python3 -m benchexec.check_cgroups
<--- same warnings --->

System is using cgroups v2 but not systemd.
If you are using BenchExec within a container, please ensure that cgroups are properly delegated into the container.
Otherwise please configure your system such that BenchExec can use cgroups.

I couldn't find anything else in the manuals ...

@PhilippWendler
Copy link
Member

Hm, seems I need to tweak the error-case detection and the error messages again...

The actual problem is that in your tests, BenchExec is not the only process in the container (there is also the shell) and that the container is also not running some cgroup manager (like systemd). This means that all processes in the container are in the same cgroup, so there is no cgroup with BenchExec alone and neither can we ask someone to create a cgroup for us. This is why BenchExec fails.

It should work if you ensure that only BenchExec is running in the container, i.e., start it directly as entry point of the container instead of spawning an interactive container.

I would love to find a way to handle this in BenchExec, but I do not really want to mess with the cgroups of other processes, and this would be required.

@incaseoftrouble
Copy link
Contributor Author

incaseoftrouble commented May 28, 2024

EDIT: I completely removed my writing-while-thinking comments, the results are now moved to separate issues.

It seems that docker run --privileged --cap-drop=all --cgroupns=host is needed to work with cgroups-v2, but there are some other issues related to benchexec.

I think it would be really beneficial to have a tutorial of how to actually manage the cgroups in an interactive container. A lot of artifact evaluations want everything to be self-contained inside the image (i.e. include data processing etc.). But since I believe this requires more work, it is probably best if we delay this.

@incaseoftrouble incaseoftrouble marked this pull request as ready for review May 28, 2024 18:33
@PhilippWendler
Copy link
Member

EDIT: I completely removed my writing-while-thinking comments, the results are now moved to separate issues.

Thanks. Continuing separately makes a lot of sense.

Regarding the removed comments, I want to add that it should have worked with Podman already, it is just that after my last comment you seem to have switched to Docker only, and there got some arguments wrong. I also tried to improve the documentation in this regard in 901b50c. If you find something that does not work as documented now, this would be another issue.

It seems that docker run --privileged --cap-drop=all --cgroupns=host is needed to work with cgroups-v2, but there are some other issues related to benchexec.

--cgroupns=host is not needed for me and would actually be contrary to what one wants. With cgroups v2, nesting of containers finally works properly, but with --cgroupns=host one disables this.

I think it would be really beneficial to have a tutorial of how to actually manage the cgroups in an interactive container. A lot of artifact evaluations want everything to be self-contained inside the image (i.e. include data processing etc.). But since I believe this requires more work, it is probably best if we delay this.

It is tricky, because use cases and systems are so diverse. We can probably not hope to give a tutorial "this is how it works", but we can likely give an example "this is how you can do it in this particular case". But what needs to be done for example already differs between the case "interactive container, configure cgroups, start BenchExec" and "non-interactive container with script wrapping BenchExec".

@incaseoftrouble
Copy link
Contributor Author

incaseoftrouble commented May 29, 2024

--cgroupns=host is not needed for me and would actually be contrary to what one wants. With cgroups v2, nesting of containers finally works properly, but with --cgroupns=host one disables this.

I needed that to be able to modify subtree_control easily. With --cgroupns=private, the docker container is put into a cgroup with empty subtree_control and this cannot be modified if there is a process in there. So, another solution is to create a sub-cgroup of the container's group, move the interactive shell there, set the subtree_control of the container's cgroup, create another group for benchexec, and then start benchexec inside that one. This seems kinda brittle.

But being able to run easily in an interactive docker setting is (I think) really important.

It is tricky, because use cases and systems are so diverse. We can probably not hope to give a tutorial "this is how it works", but we can likely give an example "this is how you can do it in this particular case"

Totally agree, but there should be a tutorial that works easily for a fairly recent setup (i.e. Ubuntu/Debian & docker). I think it is fine to strongly recommend podman, but docker should really work, too.

EDIT: I found a somewhat reliable way for docker, I will add it to the quickstart

@PhilippWendler
Copy link
Member

--cgroupns=host is not needed for me and would actually be contrary to what one wants. With cgroups v2, nesting of containers finally works properly, but with --cgroupns=host one disables this.

I needed that to be able to modify subtree_control easily.

Oh, I see now also from the explanation in the other issue. But this means that it works only because you escape the cgroups of the (outer) container completely. So for example any resource limits configured on the Docker container won't work. I won't recommend this.

So, another solution is to create a sub-cgroup of the container's group, move the interactive shell there, set the subtree_control of the container's cgroup, create another group for benchexec, and then start benchexec inside that one. This seems kinda brittle.

Well, this is exactly how cgroups v2 are supposed to be used (and what BenchExec does internally for itself), so this is not brittle per se. It starts getting brittle once you think about "moving the interactive shell there", because if you have more than that shell process (e.g., background processes), then you also need to move them. And if someone starts runexec twice in parallel in a script, you will have problems with race conditions in this step. Reading the existing list of processes and moving them can't be done atomically. And there is also no atomic "move process P from A to B" operation.

But being able to run easily in an interactive docker setting is (I think) really important.

Not arguing against this.

EDIT: I found a somewhat reliable way for docker, I will add it to the quickstart

Maybe discuss in a separate issue first? If we find something like this, I don't want to have it only inside the tutorial, but also in the main documentation. And if the final solution needs several iterations, it is nicer to have the discussion recorded in a specific issue.

@incaseoftrouble
Copy link
Contributor Author

It starts getting brittle once you think about "moving the interactive shell there"

Yes, this is what I meant. But now that I understood what is the issue I think it is fine to just put this into the entrypoint of the docker container (which is what I did)

Maybe discuss in a separate issue first?

Done in #1048

@incaseoftrouble
Copy link
Contributor Author

incaseoftrouble commented Jun 6, 2024

Ok, I merged the two things and discovered an important bit: Even for non-interactive cases, we need the "complicated" setup due to permissions!

If we only run podman run ... tag runexec ... this process is started as root inside the podman container -- which we explicitly advise against. However, if you directly start the container as user, that user does not have permissions to modify cgroups, so runexec will fail. So, in any case, we need to chown the root cgroup to the non-root user and then switch users only afterwards!

So either we remove that part ("benchexec should not be run as root" -> "its fine inside a container for non-interactive setups") or simply say "this is the way to do it"

@PhilippWendler
Copy link
Member

If we only run podman run ... tag runexec ... this process is started as root inside the podman container -- which we explicitly advise against. However, if you directly start the container as user, that user does not have permissions to modify cgroups, so runexec will fail. So, in any case, we need to chown the root cgroup to the non-root user and then switch users only afterwards!

Out of interest: How do you directly start a container as user?

And are you talking about Docker here or about rootless Podman? I wasn't aware that there is support for this, in particular that you can use something like su in a rootless Podman container (you cannot in a BenchExec container).

So either we remove that part ("benchexec should not be run as root" -> "its fine inside a container for non-interactive setups") or simply say "this is the way to do it"

Ok, so the following is true:

  • Outside of a container, BenchExec should not run as root.
  • In a Docker container (with --privileged or not), BenchExec should not run as root (because container root is host root). (There is AFAIK an optional mode in Docker to make use of user namespaces like Podman does, but I don't know exactly how to use it and I don't think anybody uses it, so I ignore this.)

To be honest, for the case of a rootless Podman I hadn't thought much about whether root is acceptable or not. My first intuition would be that it is probably acceptable.


After thinking more about this:

If you managed to use su in a rootless Podman container, then the container was not configured fully rootlessly, because this would not be possible. Instead what must have happened is that you have the suid tool newuidmap installed and configured (/etc/subuid) such that your user account is allowed to use a certain range of other uids as well. Podman optionally makes use of this to create containers with more than one user inside. You can probably verify this by looking under which uids the processes appear outside of all containers.

In a setup with Podman and usable newuidmap, using su provides a slight additional security boundary compared to not using it (for example, a process escaping both containers would not have access to the user's files).

However, I don't want to say "you have to have newuidmap". In a setup with Podman but without newuidmap, su is not usable and I think we can live with that.

What would be middle ground is to have a Podman container with just one user inside (no newuidmap required), but that user not having the uid 0 but some other uid. This is what BenchExec does by default, and Podman's default is what containerexec --root does. But I am not sure whether Podman supports this and if cgroups would be usable.

So, we would want to recommend su for Docker containers, but for rootless Podman is not necessary and restricts the potential use cases slightly. I would be fine with explaining the case for rootless Podman in detail (with Dockerfile and script) and telling the users what they additionally need to do for Docker in prose.


If you are interested in even more details about this (feel free to ignore, but for cgroups you were quite interested, so I am leaving this here):

There are several layers of isolation with regards to containers:
0. docker run --privileged provides almost no isolation and is trivial to escape, probably even accidentally.

  1. docker run is what AFAIK kernel developers call a privileged and insecure container, but Docker seems to get away with it in practice, so it is maybe not trivially escapable. This has still no separation between a user inside a container and the same uid outside the container.
  2. A container with a user namespace, but started as root. This is AFAIK the first level for which kernel developers would agree that some security boundary exists, but I don't think anyone wants to make a guarantee. If you start BenchExec as root, this is what you get.
  3. A container with a user namespace, not started as root, but with some suid helpers for setup. This is "rootless Podman" if you have newuidmap installed. The container itself is fully rootless and as secure as the next level, just with more usable users inside.
  4. A container with a user namespace, not started as root. This is the only way to start a container on a standard Linux system without the need for cooperation of the system administrator at any point. (At least for vanilla Linux, many distributions restrict even this.) This is what BenchExec provides if you start it as a regular user.

The main reason why I do not recommend 2. is that it is easy to overlook something during container setup, and then there is immediately an escape to being root at the host. I think there are also more things that a container tool needs to take care of if 2. is the target instead of 4., and BenchExec was written only for 4.

And 2. inside 0. or 1. (BenchExec in Docker) is just the same as 2. so we recommend 4. in 0. (Docker+su). But 2. inside 3. or 4. (BenchExec in rootless Podman) should not be worse than just 4. (BenchExec outside of container). The combination 4. in 3. is the solution with Podman+su. 4. in 4. is technically possible (you can nest BenchExec containers), but I don't know whether it is possible if Podman is the outer layer.

@incaseoftrouble
Copy link
Contributor Author

Out of interest: How do you directly start a container as user?

Add USER in the Dockerfile or --user=xy to the run command.

If you managed to use su in a rootless Podman container, then the container was not configured fully rootlessly, because this would not be possible.

IDK :D podman run -t debian:latest id gives me uid=0(root) gid=0(root) groups=0(root)

My /etc/subuid says <user>:100000:65536. I think this newuidmap was installed as a dependency of docker / podman.

You can probably verify this by looking under which uids the processes appear outside of all containers.

Running as 0 inside the container: My account; creating a new user and su into that: Some high id (100999)

So I guess that confirms the suspicion + I think I understand the issue.

telling the users what they additionally need to do for Docker in prose.

What will happen then is that people copy-paste the "unsafe" Dockerfile and use it with docker I think (we can be fine with that and explicitly say "this is at your own risk"; I'm just saying that this is probably what would happen).

I would be happy with this variant. Especially because the su variant (a) looks confusing and (b) (as I understand it) breaks on a "recommended" setup (podman without newuidmap); and going into detail what needs to be done then just makes this more daunting. I think especially for a "basic guide" it is better if the approach is universal than being perfectly secure.

but Docker seems to get away with it in practice

I think the working assumption is that docker images are trusted; but yeah.

@PhilippWendler
Copy link
Member

Hm, all this container handling by Podman seems to be even more complex. It seems that differently from what I thought and what I claimed above it does not use a user namespace per container, you need to request that explicitly with --userns=. It does seem to put all containers that you run into a shared user namespace, though (source). And if you want to have a separate user namespace without newuidmap (i.e., case 5 that BenchExec does by default) then you even need to manually specify a single-uid-mapping with --uidmap.

So although security-wise Podman containers are worse than I thought the complexity to explain all what is necessary for a proper solution is way beyond what we can and want to manage here.

This is just to leave my latest knowledge here, I will now review the current state of the PR.

Copy link
Member

@PhilippWendler PhilippWendler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot, looks great already! I have just minor comments.

doc/benchexec-in-container.md Outdated Show resolved Hide resolved
doc/benchexec-in-container.md Outdated Show resolved Hide resolved
doc/benchexec-in-container.md Outdated Show resolved Hide resolved
doc/benchexec-in-container.md Outdated Show resolved Hide resolved
doc/benchexec-in-container.md Outdated Show resolved Hide resolved
doc/benchexec-in-container.md Outdated Show resolved Hide resolved
doc/benchexec-in-container.md Outdated Show resolved Hide resolved
doc/benchexec-in-container.md Outdated Show resolved Hide resolved
doc/benchexec-in-container.md Outdated Show resolved Hide resolved
doc/quickstart.md Outdated Show resolved Hide resolved
@incaseoftrouble
Copy link
Contributor Author

All comments should be addressed

Copy link
Member

@PhilippWendler PhilippWendler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Two small typos, otherwise ready to merge for me. Do you agree?

purpose of the `init.sh` script. Only then can we create a separate, empty
cgroup dedicated to BenchExec.

Note that this also is the reason why non-interactive setups do not this. Then,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Note that this also is the reason why non-interactive setups do not this. Then,
Note that this also is the reason why non-interactive setups do not need this. Then,

containerized environments (such as Docker or Podman) and shows how you can
create your own interactive Docker image (or adapt existing ones) to use
BenchExec within it. We focus mainly on the (nowadays standard) cgroups v2,
a brief guidline for the (outdated) cgroups v1 is provided
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a brief guidline for the (outdated) cgroups v1 is provided
a brief guideline for the (outdated) cgroups v1 is provided

@PhilippWendler
Copy link
Member

Oh, I overlooked the conflicts. The changes from 901b50c and 29480b3 need to be merged here.

One is the removal of -security-opt unmask=/sys/fs/cgroup and the other is this text that was added to INSTALL.md. I think most of it is covered already by the new file, but can you double check? Maybe what I wrote about manual creation of the cgroup makes sense to have in the background section?

@incaseoftrouble
Copy link
Contributor Author

incaseoftrouble commented Jun 7, 2024

One is the removal of -security-opt unmask=/sys/fs/cgroup

With this (podman run --cgroups=split --security-opt unmask=/proc/* --security-opt seccomp=unconfined -it ...) I get mkdir: cannot create directory '/sys/fs/cgroup/init': Read-only file system. I kept this in for now; maybe we can find a solution?

I merged the rest I think.

@PhilippWendler
Copy link
Member

PhilippWendler commented Jun 7, 2024

One is the removal of -security-opt unmask=/sys/fs/cgroup

With this (podman run --cgroups=split --security-opt unmask=/proc/* --security-opt seccomp=unconfined -it ...) I get mkdir: cannot create directory '/sys/fs/cgroup/init': Read-only file system. I kept this in for now; maybe we can find a solution?

Hm, maybe a version difference? I guess with the unmask it should work all the time.

I merged the rest I think.

Thank you very much for your extensive help with all this! It is a great improvement of our documentation.

I will merge this now and of course also make sure to list you as a contributor for the next release.

@PhilippWendler PhilippWendler merged commit 297ecd5 into sosy-lab:main Jun 7, 2024
7 checks passed
@incaseoftrouble
Copy link
Contributor Author

Hm, maybe a version difference? I guess with the unmask it should work all the time.

podman version 3.4.4

Thank you very much for your extensive help with all this! It is a great improvement of our documentation.

No problem, its helping me twofold, too. Esp. having runexec for my evaluation makes my life easier :)

list you as a contributor for the next release.

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

2 participants