Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moby CI is failing due to stdio issue introduced in f2f16213 init: close internal fds before execve #4384

Closed
AkihiroSuda opened this issue Aug 24, 2024 · 17 comments
Labels
Milestone

Comments

@AkihiroSuda
Copy link
Member

v1.1.0-929-gf2f16213 (f2f1621 init: close internal fds before execve) seems the first commit that is causing the issue reported by @kolyshkin ("Most failures are related to not getting output from a container run.").

=== Failed
=== FAIL: amd64.integration.container TestCreateWithCDIDevices (0.87s)
    cdi_test.go:68: assertion failed: [] does not contain FOO=injected

=== FAIL: amd64.integration.container TestDiff (0.34s)
    diff_test.go:28: assertion failed: 
        --- expected
        +++ items
          []container.FilesystemChange(
        - 	{{Kind: s"A", Path: "/foo"}, {Kind: s"A", Path: "/foo/bar"}},
        + 	nil,
          )
        

=== FAIL: amd64.integration.container TestNetworkNat (0.36s)
    nat_test.go:33: assertion failed: error is not nil: dial tcp 172.17.0.3:8080: connect: connection refused
=== FAIL: amd64.integration.plugin.logging TestReadPluginNoRead/explicitly_enabled_caching (1.00s)
    read_test.go:90: assertion failed: strings.TrimSpace(buf.String()) is not "hello world": []
    --- FAIL: TestReadPluginNoRead/explicitly_enabled_caching (1.00s)

=== FAIL: amd64.integration.plugin.logging TestReadPluginNoRead/default (1.03s)
    read_test.go:90: assertion failed: strings.TrimSpace(buf.String()) is not "hello world": []
    --- FAIL: TestReadPluginNoRead/default (1.03s)

=== FAIL: amd64.integration.plugin.logging TestReadPluginNoRead (5.20s)
    read_test.go:93: [d8b21c48ed581] daemon is not started

...

Looks like stdout is closed too early on some condition.

cc @cyphar @kolyshkin @rata @thaJeztah


Its parent commit v1.1.0-928-g8e1cd2f5 (8e1cd2f, init: verify after chdir that cwd is inside the container) still fails due to other issues.

=== FAIL: amd64.integration.capabilities TestNoNewPrivileges/CapabilityRequested=false (0.32s)
    capabilities_linux_test.go:104: test produced invalid error: "writing sync procError: write sync: file already closed\nexec /bin/cat: operation not permitted", expected "exec /bin/cat: operation not permitted". Stdout:""
    --- FAIL: TestNoNewPrivileges/CapabilityRequested=false (0.32s)
=== FAIL: amd64.integration.container TestUpdateRestartPolicy (0.35s)
    update_test.go:31: assertion failed: error is not nil: Error response from daemon: Cannot update container a33223f3221da201c20e58724db3bc9b04586d80b305096181f8b46b4cd95fd3: runc did not terminate successfully: exit status 1: cannot set cpu limit: container could not join or create cgroup
        : unknown

...

Originally posted by @AkihiroSuda in moby/moby#48366 (comment)

@rata
Copy link
Member

rata commented Aug 24, 2024

@AkihiroSuda awesome, thanks a lot!

So 8e1cd2f still fails with other errors, but its parent passes all the test? Or which is the last good commit?

f2f1621 is present in 1.13, and the backport seems to apply cleany. It is not obvious to me now why it is not causing issues in 1.1.

Did you manage to repro this locally in some way? I see you did all in CI, but I wonder now with the offending commit, if you see what might be missing locally

@rata
Copy link
Member

rata commented Aug 26, 2024

Reverting f2f1621 on top of main (the conflicts are just the context, it is trivial), fixes a lot of the tests: https://github.com/moby/moby/actions/runs/10558009347/job/29246756508?pr=48336. This fixes rootless and systemd+22.04, docker-py and maybe others on graphdriver too, and most of snapshotter tests are fixed too.

It still doesn't fix everything, though.

@rata
Copy link
Member

rata commented Aug 26, 2024

So, the first bad commit is "nsenter: cloned_binary: remove bindfd logic entirely" (b999376). The parent commit, 23e41ef, passes all the tests just fine. Here are the tests: https://github.com/moby/moby/actions/runs/10567858267/job/29277756893?pr=48336.

Of course we can't try to revert that in main, as the C file doesn't exist anymore. Any ideas welcome.

Here are my bisect logs, you can also see it here on the moby PR (moby/moby#48336):

$ git bisect log
git bisect start
# status: waiting for both good and bad commits
# good: [f8ad20f500bf75edd86041657ee762bce116f8f5] runc kill: drop -a option
git bisect good f8ad20f500bf75edd86041657ee762bce116f8f5
# status: waiting for bad commit, 1 good commit known
# bad: [321aa20c49568ea2e2a3f708a50ccffb1bd67c79] scripts: add proper 386 and amd64 target triples and builds
git bisect bad 321aa20c49568ea2e2a3f708a50ccffb1bd67c79
# good: [dbe8434359ca35af1c1e10df42b1f4391c1e1010] Merge pull request #3949 from kinvolk/rata/idmap-improve-errors
git bisect good dbe8434359ca35af1c1e10df42b1f4391c1e1010
# bad: [c6e7b1a8ec55d0d3f3d6e72c7ea1d8bd6bca0f04] libct: initProcess.start: fix sync logic
git bisect bad c6e7b1a8ec55d0d3f3d6e72c7ea1d8bd6bca0f04
# good: [c77aaa3f95ec7cffbb24c53b0ee8a0b6d19d8839] criu checkpoint/restore: print errors from criu log
git bisect good c77aaa3f95ec7cffbb24c53b0ee8a0b6d19d8839
# bad: [80320b76b9d5d330b887c5d60331395ecd579a74] Merge pull request #3974 from kolyshkin/ci-cache
git bisect bad 80320b76b9d5d330b887c5d60331395ecd579a74
# bad: [acab6f6416142a302f9c324b3f1b66a1e46e29ef] Merge pull request #3931 from cyphar/remove-bindfd
git bisect bad acab6f6416142a302f9c324b3f1b66a1e46e29ef
# bad: [0d890ad66f4d1cb8f3591eff60cb49a2cf2e80d0] nsenter: cloned_binary: use MFD_EXEC and F_SEAL_EXEC
git bisect bad 0d890ad66f4d1cb8f3591eff60cb49a2cf2e80d0
# bad: [b999376fb237195265081a8b8ba3fd3bd6ef8c2c] nsenter: cloned_binary: remove bindfd logic entirely
git bisect bad b999376fb237195265081a8b8ba3fd3bd6ef8c2c
# first bad commit: [b999376fb237195265081a8b8ba3fd3bd6ef8c2c] nsenter: cloned_binary: remove bindfd logic entirely

@cyphar
Copy link
Member

cyphar commented Aug 27, 2024

It's very weird that the bindfd patch would cause these kinds of errors -- that patch just removes one of the methods we use for re-exec and the executable you use for re-exec should have no impact on anything the container process does.

As for closing too early, we don't close stdio so closing it early would also be very weird...

Let me see if I can reproduce one of these errors outside of Docker's CI.

Of course we can't try to revert that in main, as the C file doesn't exist anymore. Any ideas welcome.

You could reimplement in Go without too much effort, but I don't think it'd be worth it because we don't want to have that code anymore (it was removed for a reason) so it would only be used for testing. The bindfd approach was less secure than memfds and every iteration of it caused performance issues.

@rata
Copy link
Member

rata commented Aug 27, 2024

Adding more info to the puzzle.

If I disable the runc-binary sealing (no self/exe clone at all) and I revert "init: close internal fds before execve" (f2f1621) from runc main, then moby CI passes just fine: https://github.com/moby/moby/actions/runs/10581986818/job/29321023053?pr=48336

You can see the runc patches here: https://github.com/rata/runc/commits/rata/moby-fixes/. And the moby PR is: moby/moby#48336

(btw, to speed the moby CI, in the moby PR I've removed the windows tests and some build-kit and other tests that IIRC never failed) I've added back all the tests, after finding the result, and indeed all are green (except a ppcle build that seems to be unrelated and probably passes if we re run).

@rata
Copy link
Member

rata commented Aug 28, 2024

Reverting only one of the commits doesn't fix all the tests, I need both "reverts" for all tests to pass.

@cyphar
Copy link
Member

cyphar commented Aug 29, 2024

To clarify, these are the only failing tests, right? EDIT: Ah I was looking at the errors from #4384 (comment) not the original issue. However, this comment is still relevant for these tests:

  • TestBuildUserNamespaceValidateCapabilitiesAreV2
  • TestBuildSquashParent (only for ubuntu-20.04/graphdriver/systemd)

It seems those two tests use ContainerLogs without waiting for the container to finish running, and there might be a race condition where stdcopy.StdCopy finishes early. In principle that shouldn't happen (StdCopy shouldn't see an EOF until all writers are closed, but EOF is something invented by userspace and so it's possible that a read is returning early when it shouldn't if you add more execs), but there are so many layers of mulitplexing in Docker's stdio handling that it's fairly hard to track where the issue comes from.

Most of the other tests that use ContainerLogs wait for the container with:

poll.WaitOn(t, container.IsStopped(ctx, apiClient, cID), poll.WithDelay(100*time.Millisecond))

or similar. Does adding that to the failing tests also stop the failures?

@cyphar

This comment was marked as outdated.

@rata
Copy link
Member

rata commented Aug 29, 2024

@cyphar it seems that was right on spot, thanks!

So, using just runc 1.2.0-rc.2 and adding a lot of "wait for the container to be running" in moby and some other fixes in moby, fixes almost all the tests. Let me elaborate.

It does seem that moby is not waiting for the container to finish, therefore tests that check for the output sometimes works and sometimes doesn't. I've manually added waits (as @cyphar suggested) to the failing tests and that fixed some issues. Then I added other forms of waits (like if you start a server and you want to connect, wait for the container to be running and retry the connection instead of just failing if immediately the server is not ready) and it fixed another issues too.

There was also a problem on the daemon initialization, that sometimes EINTR is return and the daemon failed to start, instead of retrying.

Fixing all these problems in moby and just testing with runc 1.2.0-rc.2 makes almost only the docker-py tests fail.

The docker-py tests seems to come from here (https://github.com/docker/docker-py) and the failing test indeed don't seem to wait for the container to be ready. So it seems it is the same issue. I haven't tried to modify those tests, though.

One issue that I've seen on CI was that sometimes a container exits with code 137. I guess that is OOM? So, maybe this means that we might be using slightly more mem and moby might need to request more mem on the workers. Maybe those test returned 137 before and this is not new, though. It doesn't happen very frequently, I think, but this is how it looks like when it happens:

=== FAIL: amd64.integration.container TestWaitRestartedContainer/next-exit (5.78s)
    wait_test.go:236: assertion failed: 5 (expectedCode int64) != 137 (waitRes.StatusCode int64)
    --- FAIL: TestWaitRestartedContainer/next-exit (5.78s)

So, it seems the moby failures are not due to runc behaving incorrectly and fixing the race conditions on the moby repo they can be solved. However, it seems on moby CI runc is slower than before and that is why we need to add different kind of "wait" now. This CAN be an issue for other users not checking the container status before retrieving logs.

I'm not sure if that is something we should worry, though. I think probably not.

On one hand, it appears that the slowness difference is only seen on CI (I couldn't repro on my computer, on an Azure VM that was not very big, @kolyshkin couldn't repro either, etc.), and it makes sense, github actions are probably quite limited. On the other hand, we can explore ways to regain the speed back (e.g. it seems we can gain the speed back we revert the binary-sealing to fix the 2019 CVE and the closing of fds). However, I don't see any simple way to regain that speed and, given that is unclear this has any real world impact, I wouldn't explore that (I've tried using runtime.GC() to garbage-collect open fds not used, instead of closing the fds. It seems that might work, but I didn't find it gives us as many guarantees as just closing the fds).

So, IMHO, I think there is nothing to do in runc, at least for now.

What do others think?

My changes to moby are here, as always: https://github.com/moby/moby/pull/48336/files

@cyphar
Copy link
Member

cyphar commented Aug 29, 2024

Ah, I expected that to only help with some of the issues. What about the procError ones? I didn't expect those to be races with the log reading as well... EDIT: I misread the original comment. That issue has already been fixed.

On the other hand, we can explore ways to regain the speed back (e.g. it seems we can gain the speed back we revert the binary-sealing to fix the 2019 CVE and the closing of fds).

Unfortunately, copying the binary is going to be a bit slower than just doing a mount.

I am working on kernel patches that would let us avoid doing binary copying (by making the re-open of /proc/self/exe always fail). The general idea has gotten a positive reception but I suspect the implementation I have at the moment won't be accepted as-is... I've been a bit busy with libpathrs and some seccomp work the past month or so, but I'm hoping to send an RFC in a month or two.

@lifubang
Copy link
Member

lifubang commented Aug 30, 2024

What about the procError ones? I didn't expect those to be races with the log reading as well...

Yes, if execve returns an error, we should send procError to parent process, but if execve success, we have no way to close the log pipe fd, we will leak it to the container process. So it seems that there is no good way to send the execve error to parent process.

What do others think?

Awesome work!
I also think there is nothing to do in runc, except try our best to improve the performance of runc.

I am working on kernel patches that would let us avoid doing binary copying (by making the re-open of /proc/self/exe always fail). The general idea has gotten a positive reception but I suspect the implementation I have at the moment won't be accepted as-is... I've been a bit busy with libpathrs and some seccomp work the past month or so, but I'm hoping to send an RFC in a month or two.

Wish success.

@cyphar
Copy link
Member

cyphar commented Aug 30, 2024

Yes, if execve returns an error, we should send procError to parent process, but if execve success, we have no way to close the log pipe fd, we will leak it to the container process. So it seems that there is no good way to send the execve error to parent process.

@lifubang We close the pipe earlier anyway to tell the runc parent that setup is done -- you fixed the procError in #4171. The best you can do is stderr logging. I misread the original issue and was talking about this error (which was fixed in #4171 -- it only showed up because it was in the middle of a bisect):

=== FAIL: amd64.integration.capabilities TestNoNewPrivileges/CapabilityRequested=false (0.32s)
    capabilities_linux_test.go:104: test produced invalid error: "writing sync procError: write sync: file already closed\nexec /bin/cat: operation not permitted", expected "exec /bin/cat: operation not permitted". Stdout:""
    --- FAIL: TestNoNewPrivileges/CapabilityRequested=false (0.32s)

This problem has been fixed and the current logic looks reasonable to me.

@rata
Copy link
Member

rata commented Aug 30, 2024

Ah, I expected that to only help with some of the issues. What about the procError ones? I didn't expect those to be races with the log reading as well...

I haven't seen that error. Do you have links handy @cyphar ?

@cyphar
Copy link
Member

cyphar commented Aug 30, 2024

@rata I misread the original issue description, I was looking at this error (but it was from a bisect, it was fixed in #4171):

=== FAIL: amd64.integration.capabilities TestNoNewPrivileges/CapabilityRequested=false (0.32s)
    capabilities_linux_test.go:104: test produced invalid error: "writing sync procError: write sync: file already closed\nexec /bin/cat: operation not permitted", expected "exec /bin/cat: operation not permitted". Stdout:""
    --- FAIL: TestNoNewPrivileges/CapabilityRequested=false (0.32s)

I'll strike out that part of the comment since you both asked me about it 😅

@rata
Copy link
Member

rata commented Aug 30, 2024

@cyphar haha, sorry and thanks! :)

@cyphar
Copy link
Member

cyphar commented Oct 21, 2024

It seems this was fixed, right?

@rata
Copy link
Member

rata commented Oct 21, 2024

@cyphar Yes, IMHO we can close this.

On one hand, with the overlayfs protection for /proc/self/exe this is faster and almost no test fail on moby CI (only one in one worker, instead of lot in all the workers that run those tests, that are A LOT). On the other hand, I think moby should wait for the containers to run before trying to check logs in any case.

I'm closing this, if I'm missing something or someone disagrees, please feel free to re-open.

@rata rata closed this as completed Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants