-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
seccomp: prepend -ENOSYS stub to all filters #2750
Conversation
I think this is the least-disruptive way of implementing it, at the cost of not handling holes in the syscall table correctly (they will still get |
Will this have a performance overhead? |
No, it's only a handful of seccomp filter instructions. With my test programs, it only generates ~10 instructions to insert at the beginning of the filter while the actual |
yalue/native_endian#1 sent to fix the |
What happens if a policy explicitly permits What I like about this that I expect that it will adapt to the set of system calls recognized by libseccomp. So even if the policy lists |
Yes, and this is an intentional decision (in the long term it will be done by kernel version rather than the syscall numbers directly). The reason for this is that the recommended design for a seccomp profile is as an allow list (not a deny list) so any syscall not included is presumed to have been audited by the author and determined to "not be safe". Right now it's more of a footgun than if we ask the user to explicitly specify a minimum kernel version or something similar -- though I'm not sure it'll ever stop being a footgun completely. At the very least we've eliminated the "all future syscalls are banned" footgun. Requring the user to explicitly specify
Yup, that's right. If a syscall isn't known by seccomp we treat it as though it isn't present in the filter at all and thus it won't be the newest syscall for any architecture. |
This comment has been minimized.
This comment has been minimized.
Should part of this be implemented in https://github.com/seccomp/libseccomp-golang/? |
@kolyshkin Long-term this functionality should live in libseccomp or whatever we use for filter generation in the future, but there's still lots of work that needs to happen in libseccomp before this functionality exists. |
Any chance to post the list of the remaining work to https://github.com/seccomp/libseccomp/issues ? |
Would it be possible to extend kernel to expose available syscall numbers via sysfs? For kernel without the sysfs patch, we would need to hard-code the largest syscall number. |
There's already seccomp/libseccomp#11 and seccomp/libseccomp#286 which explain this issue and discuss possible solutions. The code in this PR really isn't the best way of solving the problem (ideally the
libseccomp already provides syscall number information -- that's what I'm using in this patch. That comment in the commit is explaining why we can't use libseccomp alone to solve this problem (libseccomp doesn't allow you to do conditional logic based on syscall number -- which is what I use in this PR to implement the Exposing syscall information from the kernel is something that has been discussed upstream many times, but I'm not sure that we're any closer to implementing it to be honest (though BTF is opening the door to that possibility). |
This could be done, but it's completely orthogonal to the problem at hand. There is no need to know anything about the set of syscalls supported by the kernel you're running on. What you want to know is the set of syscalls that make up the API set the seccomp policy was written for. This was already discussed to death in #2151 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should split functions and maybe also Go package to reduce code complexity. We may also want BPF unit tests and RISC-V support, but all of them can be worked out later. So LGTM.
Moby CI with this is currently failing (moby/moby#41900) , but failures are probably unrelated.
This comment has been minimized.
This comment has been minimized.
I've added some unit tests after noticing that there were actually several bugs in the filter generation found by the unit tests. Now the unit tests pass and the filters generated are far more correct for multi-architecture filters (though again, we will never actually need this functionality in practice). |
Now that we have a unit testing framework for seccomp-cBPF filters we can add some tests for the entire seccomp filter in the future (but for right now I think this is okay). |
This allows application to detect whether the kernel supports syscall or not. Previously, an error was unconditionally EPERM. There are many issues about glibc failed with new syscalls in containerized environments if their host run on old kernel. More about motivation for ENOSYS over EPERM: opencontainers/runc#2151 opencontainers/runc#2750 See about defaultErrnoRet introduction: opencontainers/runtime-spec#1087 Previously, FreeIPA profile was vendored from https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json Now it is merged directly from https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json Fixes: https://pagure.io/freeipa/issue/9008 Signed-off-by: Stanislav Levin <[email protected]>
This allows application to detect whether the kernel supports syscall or not. Previously, an error was unconditionally EPERM. There are many issues about glibc failed with new syscalls in containerized environments if their host run on old kernel. More about motivation for ENOSYS over EPERM: opencontainers/runc#2151 opencontainers/runc#2750 See about defaultErrnoRet introduction: opencontainers/runtime-spec#1087 Previously, FreeIPA profile was vendored from https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json Now it is merged directly from https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json Fixes: https://pagure.io/freeipa/issue/9008 Signed-off-by: Stanislav Levin <[email protected]> Reviewed-By: Alexander Bokovoy <[email protected]>
This allows application to detect whether the kernel supports syscall or not. Previously, an error was unconditionally EPERM. There are many issues about glibc failed with new syscalls in containerized environments if their host run on old kernel. More about motivation for ENOSYS over EPERM: opencontainers/runc#2151 opencontainers/runc#2750 See about defaultErrnoRet introduction: opencontainers/runtime-spec#1087 Previously, FreeIPA profile was vendored from https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json Now it is merged directly from https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json Fixes: https://pagure.io/freeipa/issue/9008 Signed-off-by: Stanislav Levin <[email protected]>
This allows application to detect whether the kernel supports syscall or not. Previously, an error was unconditionally EPERM. There are many issues about glibc failed with new syscalls in containerized environments if their host run on old kernel. More about motivation for ENOSYS over EPERM: opencontainers/runc#2151 opencontainers/runc#2750 See about defaultErrnoRet introduction: opencontainers/runtime-spec#1087 Previously, FreeIPA profile was vendored from https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json Now it is merged directly from https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json Fixes: https://pagure.io/freeipa/issue/9008 Signed-off-by: Stanislav Levin <[email protected]>
This allows application to detect whether the kernel supports syscall or not. Previously, an error unconditionally was EPERM. There are many issues about glibc failed to new syscalls in containerized environments for which host run on old kernel. More about motivation for ENOSYS over EPERM: opencontainers/runc#2151 opencontainers/runc#2750 See about defaultErrnoRet introduction: opencontainers/runtime-spec#1087 Previously, FreeIPA profile was vendored from https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json Now it is merged directly from https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json
This allows application to detect whether the kernel supports syscall or not. Previously, an error was unconditionally EPERM. There are many issues about glibc failed with new syscalls in containerized environments if their host run on old kernel. More about motivation for ENOSYS over EPERM: opencontainers/runc#2151 opencontainers/runc#2750 See about defaultErrnoRet introduction: opencontainers/runtime-spec#1087 Previously, FreeIPA profile was vendored from https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json Now it is merged directly from https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json Fixes: https://pagure.io/freeipa/issue/9008 Signed-off-by: Stanislav Levin <[email protected]> Reviewed-By: Alexander Bokovoy <[email protected]>
This allows application to detect whether the kernel supports syscall or not. Previously, an error unconditionally was EPERM. There are many issues about glibc failed to new syscalls in containerized environments for which host run on old kernel. More about motivation for ENOSYS over EPERM: opencontainers/runc#2151 opencontainers/runc#2750 See about defaultErrnoRet introduction: opencontainers/runtime-spec#1087 Previously, FreeIPA profile was vendored from https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json Now it is merged directly from https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json
This allows application to detect whether the kernel supports syscall or not. Previously, an error was unconditionally EPERM. There are many issues about glibc failed with new syscalls in containerized environments if their host run on old kernel. More about motivation for ENOSYS over EPERM: opencontainers/runc#2151 opencontainers/runc#2750 See about defaultErrnoRet introduction: opencontainers/runtime-spec#1087 Previously, FreeIPA profile was vendored from https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json Now it is merged directly from https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json Fixes: https://pagure.io/freeipa/issue/9008 Signed-off-by: Stanislav Levin <[email protected]>
This allows application to detect whether the kernel supports syscall or not. Previously, an error was unconditionally EPERM. There are many issues about glibc failed with new syscalls in containerized environments if their host run on old kernel. More about motivation for ENOSYS over EPERM: opencontainers/runc#2151 opencontainers/runc#2750 See about defaultErrnoRet introduction: opencontainers/runtime-spec#1087 Previously, FreeIPA profile was vendored from https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json Now it is merged directly from https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json Fixes: https://pagure.io/freeipa/issue/9008 Signed-off-by: Stanislav Levin <[email protected]> Reviewed-By: Alexander Bokovoy <[email protected]>
This allows application to detect whether the kernel supports syscall or not. Previously, an error unconditionally was EPERM. There are many issues about glibc failed to new syscalls in containerized environments for which host run on old kernel. More about motivation for ENOSYS over EPERM: opencontainers/runc#2151 opencontainers/runc#2750 See about defaultErrnoRet introduction: opencontainers/runtime-spec#1087 Previously, FreeIPA profile was vendored from https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json Now it is merged directly from https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json
- seccomp: prepend -ENOSYS stub to all filters opencontainers/runc#2750
This allows application to detect whether the kernel supports syscall or not. Previously, an error was unconditionally EPERM. There are many issues about glibc failed with new syscalls in containerized environments if their host run on old kernel. More about motivation for ENOSYS over EPERM: opencontainers/runc#2151 opencontainers/runc#2750 See about defaultErrnoRet introduction: opencontainers/runtime-spec#1087 Previously, FreeIPA profile was vendored from https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json Now it is merged directly from https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json Fixes: https://pagure.io/freeipa/issue/9008 Signed-off-by: Stanislav Levin <[email protected]> Reviewed-By: Alexander Bokovoy <[email protected]>
Having -EPERM is the default was a fairly significant mistake from a
future-proofing standpoint in that it makes any new syscall return a
non-ignorable error (from glibc's point of view). We need to correct
this now because faccessat2(2) is something glibc critically needs to
have support for, but they're blocked on container runtimes because we
return -EPERM unconditionally (leading to confusion in glibc). This is
also a problem we're probably going to keep running into in the future.
Unfortunately there are several issues which stop us from having a clean
solution to this problem:
libseccomp has several limitations which require us to emulate
behaviour we want:
a. We cannot do logic based on syscall number, meaning we cannot
specify a "largest known syscall number";
b. libseccomp doesn't know in which kernel version a syscall was
added, and has no API for "minimum kernel version" so we cannot
simply ask libseccomp to generate sane -ENOSYS rules for us.
c. Additional seccomp rules for the same syscall are not treated as
distinct rules -- if rules overlap, seccomp will merge them. This
means we cannot add per-syscall -EPERM fallbacks;
d. There is no inverse operation for SCMP_CMP_MASKED_EQ;
e. libseccomp does not allow you to specify multiple rules for a
single argument, making it impossible to invert OR rules for
arguments.
The runtime-spec does not have any way of specifying:
a. The errno for the default action;
b. The minimum kernel version or "newest syscall at time of profile
creation"; nor
c. Which syscalls were intentionally excluded from the allow list
(weird syscalls that are no longer used were excluded entirely,
but Docker et al expect those syscalls to get EPERM not ENOSYS).
Certain syscalls should not return -ENOSYS (especially only for
certain argument combinations) because this could also trigger glibc
confusion. This means we have to return -EPERM for certain syscalls
but not as a global default.
There is not an obvious (and reasonable) upper limit to syscall
numbers, so we cannot create a set of rules for each syscall above
the largest syscall number in libseccomp. This means we must handle
inverse rules as described below.
Any syscall can be specified multiple times, which can make
generation of hotfix rules much harder.
As a result, we have to work around all of these things by coming up
with a heuristic to stop the bleeding. In the future we could hopefully
improve the situation in the runtime-spec and libseccomp.
The solution applied here is to prepend a "stub" filter which returns
-ENOSYS if the requested syscall has a larger syscall number than any
syscall mentioned in the filter. The reason for this specific rule is
that syscall numbers are (roughly) allocated sequentially and thus newer
syscalls will (usually) have a larger syscall number -- thus causing our
filters to produce -ENOSYS if the filter was written before the syscall
existed.
Sadly this is not a perfect solution because syscalls can be added
out-of-order and the syscall table can contain holes for several
releases. Unfortuntely we do not have a nicer solution at the moment
because there is no library which provides information about which Linux
version a syscall was introduced in. Until that exists, this workaround
will have to be good enough.
The above behaviour only happens if the default action is a blocking
action (in other words it is not SCMP_ACT_LOG or SCMP_ACT_ALLOW). If the
default action is permissive then we don't do any patching.
Fixes #2151
Signed-off-by: Aleksa Sarai [email protected]