Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seccomp: prepend -ENOSYS stub to all filters #2750

Merged
merged 3 commits into from
Jan 31, 2021
Merged

seccomp: prepend -ENOSYS stub to all filters #2750

merged 3 commits into from
Jan 31, 2021

Conversation

cyphar
Copy link
Member

@cyphar cyphar commented Jan 19, 2021

Having -EPERM is the default was a fairly significant mistake from a
future-proofing standpoint in that it makes any new syscall return a
non-ignorable error (from glibc's point of view). We need to correct
this now because faccessat2(2) is something glibc critically needs to
have support for, but they're blocked on container runtimes because we
return -EPERM unconditionally (leading to confusion in glibc). This is
also a problem we're probably going to keep running into in the future.

Unfortunately there are several issues which stop us from having a clean
solution to this problem:

  1. libseccomp has several limitations which require us to emulate
    behaviour we want:

    a. We cannot do logic based on syscall number, meaning we cannot
    specify a "largest known syscall number";
    b. libseccomp doesn't know in which kernel version a syscall was
    added, and has no API for "minimum kernel version" so we cannot
    simply ask libseccomp to generate sane -ENOSYS rules for us.
    c. Additional seccomp rules for the same syscall are not treated as
    distinct rules -- if rules overlap, seccomp will merge them. This
    means we cannot add per-syscall -EPERM fallbacks;
    d. There is no inverse operation for SCMP_CMP_MASKED_EQ;
    e. libseccomp does not allow you to specify multiple rules for a
    single argument, making it impossible to invert OR rules for
    arguments.

  2. The runtime-spec does not have any way of specifying:

    a. The errno for the default action;
    b. The minimum kernel version or "newest syscall at time of profile
    creation"; nor
    c. Which syscalls were intentionally excluded from the allow list
    (weird syscalls that are no longer used were excluded entirely,
    but Docker et al expect those syscalls to get EPERM not ENOSYS).

  3. Certain syscalls should not return -ENOSYS (especially only for
    certain argument combinations) because this could also trigger glibc
    confusion. This means we have to return -EPERM for certain syscalls
    but not as a global default.

  4. There is not an obvious (and reasonable) upper limit to syscall
    numbers, so we cannot create a set of rules for each syscall above
    the largest syscall number in libseccomp. This means we must handle
    inverse rules as described below.

  5. Any syscall can be specified multiple times, which can make
    generation of hotfix rules much harder.

As a result, we have to work around all of these things by coming up
with a heuristic to stop the bleeding. In the future we could hopefully
improve the situation in the runtime-spec and libseccomp.

The solution applied here is to prepend a "stub" filter which returns
-ENOSYS if the requested syscall has a larger syscall number than any
syscall mentioned in the filter. The reason for this specific rule is
that syscall numbers are (roughly) allocated sequentially and thus newer
syscalls will (usually) have a larger syscall number -- thus causing our
filters to produce -ENOSYS if the filter was written before the syscall
existed.

Sadly this is not a perfect solution because syscalls can be added
out-of-order and the syscall table can contain holes for several
releases. Unfortuntely we do not have a nicer solution at the moment
because there is no library which provides information about which Linux
version a syscall was introduced in. Until that exists, this workaround
will have to be good enough.

The above behaviour only happens if the default action is a blocking
action (in other words it is not SCMP_ACT_LOG or SCMP_ACT_ALLOW). If the
default action is permissive then we don't do any patching.

Fixes #2151
Signed-off-by: Aleksa Sarai [email protected]

@cyphar
Copy link
Member Author

cyphar commented Jan 19, 2021

I think this is the least-disruptive way of implementing it, at the cost of not handling holes in the syscall table correctly (they will still get -EPERM not -ENOSYS).

/cc @richfelker @fweimer-rh

go.mod Outdated Show resolved Hide resolved
@AkihiroSuda
Copy link
Member

Will this have a performance overhead?

@cyphar
Copy link
Member Author

cyphar commented Jan 19, 2021

Will this have a performance overhead?

No, it's only a handful of seccomp filter instructions. With my test programs, it only generates ~10 instructions to insert at the beginning of the filter while the actual libseccomp-generated filter is ~900 instructions long. BPF filters are also really fast anyway.

@cyphar
Copy link
Member Author

cyphar commented Jan 19, 2021

yalue/native_endian#1 sent to fix the make cross failure. The package is missing ppc64le.

@fweimer-rh
Copy link

What happens if a policy explicitly permits process_madvise, without mentioning faccessat2? Wouldn't that cause faccessat2 to fail with EPERM? (On x86-64, the numbers are 440 and 439.) I guess this is just something policy authors need to be aware of, for a while at least?

What I like about this that I expect that it will adapt to the set of system calls recognized by libseccomp. So even if the policy lists faccessat2, but libseccomp does not support it, there's no system call number for it, the ENOSYS boundary will be below faccessat2, and faccesat2 will fail with ENOSYS in the container, as expected.

@cyphar
Copy link
Member Author

cyphar commented Jan 19, 2021

@fweimer-rh

What happens if a policy explicitly permits process_madvise, without mentioning faccessat2? Wouldn't that cause faccessat2 to fail with EPERM? (On x86-64, the numbers are 440 and 439.) I guess this is just something policy authors need to be aware of, for a while at least?

Yes, and this is an intentional decision (in the long term it will be done by kernel version rather than the syscall numbers directly). The reason for this is that the recommended design for a seccomp profile is as an allow list (not a deny list) so any syscall not included is presumed to have been audited by the author and determined to "not be safe". Right now it's more of a footgun than if we ask the user to explicitly specify a minimum kernel version or something similar -- though I'm not sure it'll ever stop being a footgun completely. At the very least we've eliminated the "all future syscalls are banned" footgun.

Requring the user to explicitly specify -EPERM rules for syscalls to indicate they really don't want them has two issues:

  • libseccomp doesn't support specifying the default action for a rule (I don't know why). And you can't make the default action -ENOSYS because then complicated rules won't fail with -EPERM anymore.
  • It makes filter generators like those in Docker more complicated when runc in principle has all the information it needs.

What I like about this that I expect that it will adapt to the set of system calls recognized by libseccomp. So even if the policy lists faccessat2, but libseccomp does not support it, there's no system call number for it, the ENOSYS boundary will be below faccessat2, and faccesat2 will fail with ENOSYS in the container, as expected.

Yup, that's right. If a syscall isn't known by seccomp we treat it as though it isn't present in the filter at all and thus it won't be the newest syscall for any architecture.

@kolyshkin

This comment has been minimized.

@kolyshkin
Copy link
Contributor

Should part of this be implemented in https://github.com/seccomp/libseccomp-golang/?

@cyphar
Copy link
Member Author

cyphar commented Jan 20, 2021

@kolyshkin Long-term this functionality should live in libseccomp or whatever we use for filter generation in the future, but there's still lots of work that needs to happen in libseccomp before this functionality exists.

@AkihiroSuda
Copy link
Member

@kolyshkin Long-term this functionality should live in libseccomp or whatever we use for filter generation in the future, but there's still lots of work that needs to happen in libseccomp before this functionality exists.

Any chance to post the list of the remaining work to https://github.com/seccomp/libseccomp/issues ?
This feature seems very useful for non-runc applications as well. So I'd like to see this as a part of the libseccomp C library.

@AkihiroSuda
Copy link
Member

AkihiroSuda commented Jan 21, 2021

We cannot do logic based on syscall number, meaning we cannot specify a "largest known syscall number";

Would it be possible to extend kernel to expose available syscall numbers via sysfs?

For kernel without the sysfs patch, we would need to hard-code the largest syscall number.

@cyphar
Copy link
Member Author

cyphar commented Jan 21, 2021

@AkihiroSuda

Any chance to post the list of the remaining work to https://github.com/seccomp/libseccomp/issues ?

There's already seccomp/libseccomp#11 and seccomp/libseccomp#286 which explain this issue and discuss possible solutions.

The code in this PR really isn't the best way of solving the problem (ideally the -ENOSYS handling would be at the tail of the filter, and it would be based on kernel version infromation) -- but in order to solve the problem nicely it'd be necessary to put that code in libseccomp.

Would it be possible to extend kernel to expose available syscall numbers via sysfs?

libseccomp already provides syscall number information -- that's what I'm using in this patch. That comment in the commit is explaining why we can't use libseccomp alone to solve this problem (libseccomp doesn't allow you to do conditional logic based on syscall number -- which is what I use in this PR to implement the -ENOSYS behaviour).

Exposing syscall information from the kernel is something that has been discussed upstream many times, but I'm not sure that we're any closer to implementing it to be honest (though BTF is opening the door to that possibility).

@richfelker
Copy link

Would it be possible to extend kernel to expose available syscall numbers via sysfs?

This could be done, but it's completely orthogonal to the problem at hand. There is no need to know anything about the set of syscalls supported by the kernel you're running on. What you want to know is the set of syscalls that make up the API set the seccomp policy was written for.

This was already discussed to death in #2151

AkihiroSuda
AkihiroSuda previously approved these changes Jan 21, 2021
Copy link
Member

@AkihiroSuda AkihiroSuda left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should split functions and maybe also Go package to reduce code complexity. We may also want BPF unit tests and RISC-V support, but all of them can be worked out later. So LGTM.

Moby CI with this is currently failing (moby/moby#41900) , but failures are probably unrelated.

@cyphar

This comment has been minimized.

@cyphar
Copy link
Member Author

cyphar commented Jan 25, 2021

I've added some unit tests after noticing that there were actually several bugs in the filter generation found by the unit tests. Now the unit tests pass and the filters generated are far more correct for multi-architecture filters (though again, we will never actually need this functionality in practice).

@cyphar
Copy link
Member Author

cyphar commented Jan 27, 2021

Now that we have a unit testing framework for seccomp-cBPF filters we can add some tests for the entire seccomp filter in the future (but for right now I think this is okay).

silencebay added a commit to silencebay/clash-tproxy that referenced this pull request Jul 5, 2021
stanislavlevin added a commit to stanislavlevin/freeipa that referenced this pull request Oct 15, 2021
This allows application to detect whether the kernel supports
syscall or not. Previously, an error was unconditionally EPERM.
There are many issues about glibc failed with new syscalls in containerized
environments if their host run on old kernel.

More about motivation for ENOSYS over EPERM:
opencontainers/runc#2151
opencontainers/runc#2750

See about defaultErrnoRet introduction:
opencontainers/runtime-spec#1087

Previously, FreeIPA profile was vendored from
https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json

Now it is merged directly from
https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json

Fixes: https://pagure.io/freeipa/issue/9008
Signed-off-by: Stanislav Levin <[email protected]>
abbra pushed a commit to freeipa/freeipa that referenced this pull request Oct 18, 2021
This allows application to detect whether the kernel supports
syscall or not. Previously, an error was unconditionally EPERM.
There are many issues about glibc failed with new syscalls in containerized
environments if their host run on old kernel.

More about motivation for ENOSYS over EPERM:
opencontainers/runc#2151
opencontainers/runc#2750

See about defaultErrnoRet introduction:
opencontainers/runtime-spec#1087

Previously, FreeIPA profile was vendored from
https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json

Now it is merged directly from
https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json

Fixes: https://pagure.io/freeipa/issue/9008
Signed-off-by: Stanislav Levin <[email protected]>
Reviewed-By: Alexander Bokovoy <[email protected]>
flo-renaud pushed a commit to flo-renaud/freeipa that referenced this pull request Oct 18, 2021
This allows application to detect whether the kernel supports
syscall or not. Previously, an error was unconditionally EPERM.
There are many issues about glibc failed with new syscalls in containerized
environments if their host run on old kernel.

More about motivation for ENOSYS over EPERM:
opencontainers/runc#2151
opencontainers/runc#2750

See about defaultErrnoRet introduction:
opencontainers/runtime-spec#1087

Previously, FreeIPA profile was vendored from
https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json

Now it is merged directly from
https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json

Fixes: https://pagure.io/freeipa/issue/9008
Signed-off-by: Stanislav Levin <[email protected]>
flo-renaud pushed a commit to flo-renaud/freeipa that referenced this pull request Oct 18, 2021
This allows application to detect whether the kernel supports
syscall or not. Previously, an error was unconditionally EPERM.
There are many issues about glibc failed with new syscalls in containerized
environments if their host run on old kernel.

More about motivation for ENOSYS over EPERM:
opencontainers/runc#2151
opencontainers/runc#2750

See about defaultErrnoRet introduction:
opencontainers/runtime-spec#1087

Previously, FreeIPA profile was vendored from
https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json

Now it is merged directly from
https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json

Fixes: https://pagure.io/freeipa/issue/9008
Signed-off-by: Stanislav Levin <[email protected]>
stanislavlevin added a commit to stanislavlevin/freeipa that referenced this pull request Oct 19, 2021
This allows application to detect whether the kernel supports
syscall or not. Previously, an error unconditionally was EPERM.
There are many issues about glibc failed to new syscalls in containerized
environments for which host run on old kernel.

More about motivation for ENOSYS over EPERM:
opencontainers/runc#2151
opencontainers/runc#2750

See about defaultErrnoRet introduction:
opencontainers/runtime-spec#1087

Previously, FreeIPA profile was vendored from
https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json

Now it is merged directly from
https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json
abbra pushed a commit to freeipa/freeipa that referenced this pull request Oct 19, 2021
This allows application to detect whether the kernel supports
syscall or not. Previously, an error was unconditionally EPERM.
There are many issues about glibc failed with new syscalls in containerized
environments if their host run on old kernel.

More about motivation for ENOSYS over EPERM:
opencontainers/runc#2151
opencontainers/runc#2750

See about defaultErrnoRet introduction:
opencontainers/runtime-spec#1087

Previously, FreeIPA profile was vendored from
https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json

Now it is merged directly from
https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json

Fixes: https://pagure.io/freeipa/issue/9008
Signed-off-by: Stanislav Levin <[email protected]>
Reviewed-By: Alexander Bokovoy <[email protected]>
stanislavlevin added a commit to stanislavlevin/freeipa that referenced this pull request Oct 21, 2021
This allows application to detect whether the kernel supports
syscall or not. Previously, an error unconditionally was EPERM.
There are many issues about glibc failed to new syscalls in containerized
environments for which host run on old kernel.

More about motivation for ENOSYS over EPERM:
opencontainers/runc#2151
opencontainers/runc#2750

See about defaultErrnoRet introduction:
opencontainers/runtime-spec#1087

Previously, FreeIPA profile was vendored from
https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json

Now it is merged directly from
https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json
rcritten pushed a commit to flo-renaud/freeipa that referenced this pull request Oct 21, 2021
This allows application to detect whether the kernel supports
syscall or not. Previously, an error was unconditionally EPERM.
There are many issues about glibc failed with new syscalls in containerized
environments if their host run on old kernel.

More about motivation for ENOSYS over EPERM:
opencontainers/runc#2151
opencontainers/runc#2750

See about defaultErrnoRet introduction:
opencontainers/runtime-spec#1087

Previously, FreeIPA profile was vendored from
https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json

Now it is merged directly from
https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json

Fixes: https://pagure.io/freeipa/issue/9008
Signed-off-by: Stanislav Levin <[email protected]>
abbra pushed a commit to freeipa/freeipa that referenced this pull request Oct 23, 2021
This allows application to detect whether the kernel supports
syscall or not. Previously, an error was unconditionally EPERM.
There are many issues about glibc failed with new syscalls in containerized
environments if their host run on old kernel.

More about motivation for ENOSYS over EPERM:
opencontainers/runc#2151
opencontainers/runc#2750

See about defaultErrnoRet introduction:
opencontainers/runtime-spec#1087

Previously, FreeIPA profile was vendored from
https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json

Now it is merged directly from
https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json

Fixes: https://pagure.io/freeipa/issue/9008
Signed-off-by: Stanislav Levin <[email protected]>
Reviewed-By: Alexander Bokovoy <[email protected]>
stanislavlevin added a commit to stanislavlevin/freeipa that referenced this pull request Oct 25, 2021
This allows application to detect whether the kernel supports
syscall or not. Previously, an error unconditionally was EPERM.
There are many issues about glibc failed to new syscalls in containerized
environments for which host run on old kernel.

More about motivation for ENOSYS over EPERM:
opencontainers/runc#2151
opencontainers/runc#2750

See about defaultErrnoRet introduction:
opencontainers/runtime-spec#1087

Previously, FreeIPA profile was vendored from
https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json

Now it is merged directly from
https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json
ailispaw added a commit to bargees/moby that referenced this pull request May 31, 2022
- seccomp: prepend -ENOSYS stub to all filters
  opencontainers/runc#2750
stanislavlevin added a commit to stanislavlevin/freeipa that referenced this pull request Sep 9, 2024
This allows application to detect whether the kernel supports
syscall or not. Previously, an error was unconditionally EPERM.
There are many issues about glibc failed with new syscalls in containerized
environments if their host run on old kernel.

More about motivation for ENOSYS over EPERM:
opencontainers/runc#2151
opencontainers/runc#2750

See about defaultErrnoRet introduction:
opencontainers/runtime-spec#1087

Previously, FreeIPA profile was vendored from
https://github.com/containers/podman/blob/main/vendor/github.com/containers/common/pkg/seccomp/seccomp.json

Now it is merged directly from
https://github.com/containers/common/blob/main/pkg/seccomp/seccomp.json

Fixes: https://pagure.io/freeipa/issue/9008
Signed-off-by: Stanislav Levin <[email protected]>
Reviewed-By: Alexander Bokovoy <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

seccomp filter should return ENOSYS for unknown syscalls
7 participants