-
-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle syscalls via allowlist instead of denylist #4462
base: main
Are you sure you want to change the base?
Conversation
I still need to verify that this successfully blocks the syscalls we want to block. |
Ugh, this doesn't actually work. |
I'm sorry, I don't see a way to make this work without being able to add more than one seccomp program to bwrap. I'm way outside my understanding of libseccomp at this point, so I'm not sure what the semantics are of adding more than one non-mutually-exclusive rule for the same syscall number (first match wins? last match wins? most-permissive wins? least-permissive wins?) - but the semantics are not documented, so I'd be reluctant to rely on them as a security boundary in any case. The key problem is that libseccomp doesn't have a MASKED_NEQ filter mode (seccomp/libseccomp#310), so we cannot turn the ioctl deny rule (used with default-allow) into an allow rule (to be used with default-deny). If we are willing to change the semantics of the ioctl deny rule to only allow the "canonicalized" ioctl numbers, then we could allow every ioctl number from 0 up to some suitably large number, except for the single ioctl we want to block - but that would be a huge number of rules, hence a huge BPF program? |
Hmmmm, this is tricky and risky. Will definitely break stuff in the future. We tried this in WebKit once upon a time, and it worked nicely until some pesky kernel developers decided they should be allowed to create new syscalls. (Note for clarity: this is a joke. ;) Thus memfd_create was born. One day, libxshmfence decided it was going to start using memfd_create, and suddenly WebKit couldn't render anything anymore, because this syscall was not on its allowlist. It never hit users because that version of the sandbox was experimental (it failed at the proof of concept stage), but I noticed, and I've been scared of seccomp ever since. Firefox and Chrome have similar problems from time to time, not necessarily due to new syscalls (although I don't doubt those may sometimes cause problems), but because their seccomp filters are especially aggressive and fragile. So the more secure we make the seccomp filters, the more fragile they become, the more unexpected breakage users will experience from time to time. Of course, you already know most of this, but it's still good to write it down here so others can see the context behind why this is a tough design decision to make. Personally, I think existing syscall denylist is a better sweet spot between security and robustness, but I understand you've come to the conclusion that the allowlist approach makes sense after thinking hard about it, and I acknowledge the allowlist is the only way to prevent new kernel versions from introducing new sandbox escapes without flatpak knowing about it. :/ So perhaps it's the right call, even if we occasionally come to regret it when it breaks software.
Me too. This stuff is hard. Sorry for being no help.... |
OK, this part I think I do understand, if only because I've been researching it as fast as I could...
A seccomp program can do one of various actions when it sees a syscall it doesn't like. The ones that work without special support from user-space are that it can use The idea of making "recent" syscalls fail with (For example libxshmfence falls back from This is more compatible than using It's also a lot more compatible than using I think Tracker uses The down side of using |
Hm, that's a pretty good point. I think WebKit was actually using SCMP_ACT_TRAP to pause the web process, then it let a broker process decide what to do. Maybe it was executing the kill. I don't remember for sure. But because your filters return ENOSYS, libxshmfence would clearly just fall back to its original code, and everything should be fine. So that libxshmfence example actually serves to illustrate that maybe this is OK after all. I suppose you've convinced me. (Well, we still have to figure out if it's possible to solve the problems you've discovered....) |
The three most promising routes forward that I can see all require coordination with other packages:
|
I think this only works if the process does not make the system call while signals are blocked, otherwise the process will be terminated by |
I personally think that longterm the --disable-userns switch is better, because it hits the problem at the source. It is conceivable that there are other ways than syscalls (like files in /proc or whatever) to cause user namespace stuff, and those would also be neutered by that route. That said, it makes a lot of sense to also support multiple seccomp fds to bubblewrap, as that seems like a simple and generally useful thing to support. |
Yes, in an ideal world we'd use all three of the solutions I mentioned. |
I wonder if we should also pre-compute the selinux rules when doing the release and ship them with the tarball. That way we could avoid problems with the libseccomp version on the host. In fact, we could avoid a libseccomp runtime dependency at all. |
seccomp rules, do you mean? I don't think that works: they're architecture-specific. We'd have to precompute a rules blob for every architecture supported by libseccomp, and for every possible Flatpak configuration (e.g. Pre-computing the seccomp rules is also bad from the point of view of building from the actual preferred form for modification, which is both legally desirable (because LGPL) and practically desirable (because distros can't apply security fixes or other important fixes if they can't build from source). |
Yeah, its not ideal. We could compute them at build-time, which would get rid of the runtime dependency, but we'd still have the current example for older distros as it would build against an old libseccomp. Maybe we could vendor a recent version... |
Why are you trying to avoid depending on libseccomp? |
Depending on libseccomp is fine, the problem is being able to depend on modern libseccomp. For example, Debian 10 has libseccomp 2.3.3 which doesn't know about any syscalls newer than v4.15-rc7, but the Debian 10 kernel is 4.19, and people might reasonably run Debian 10 on a backported 5.10.x kernel from Debian 11 for better hardware support. Ubuntu has backported libseccomp 2.5.1 all the way back to their 16.04 branch (released in 2016) so this is clearly something that distros can do, and when the Debian security team get back to me about this issue, that's one option that I'm going to raise. |
Not depending on host libseccomp should allow to get features like Inverse of MASKED_EQ right away instead of waiting few years after they land in upstream. Is it right? |
I think that's a non-starter. If distributions don't want to backport libseccomp itself, then they are absolutely not going to want to vendor an entire newer copy of libseccomp into flatpak; and as an upstream and downstream maintainer, I'm not willing to take responsibility for the security and correct functionality of the whole libseccomp codebase. It's quite stressful enough having to backport security fixes in code I do understand.
If it was feasible to avoid depending on the host libseccomp, then yes, that would be the benefit; but for the reasons discussed above, I don't think this is sufficiently feasible that it's a useful direction to be pursuing. |
6cfa838
to
ea081f3
Compare
Rebased, with a submodule update to pick up containers/bubblewrap#459. This is not suitable to be merged as-is, because for users of a system copy of bubblewrap, it requires a version that doesn't exist yet: we should have a bubblewrap release before proceeding. However, I wanted to get this to a testable state before releasing bubblewrap 0.5.1 (or 0.6.0 or whatever the next one should be).
The test from #4505 still works on a modern OS (Debian unstable). Not tested on ye olde OSs yet, but our CI should help with this.
I think ideally we should do both, and if someone with more kernel knowledge reviews containers/bubblewrap#452 then I'm happy to have that too; but I don't feel that I'm qualified to review that one myself, sorry. |
30452df
to
de2b2c7
Compare
I've now successfully tested this on Ubuntu 18.04 (with an unrelated test failure in I'm leaving this marked as a draft because we should get bubblewrap 0.6.0 released first (for which I'd like to include containers/bubblewrap#475 if possible), but other than that I think this is ready for review. |
* Add `--add-seccomp` (prerequisite for flatpak#4462) * Add a warning when repeated options are ignored * Add a Meson build system * Invoke bash via `PATH` * Exit early when `argc == 0` Signed-off-by: Simon McVittie <[email protected]>
de2b2c7
to
024cb72
Compare
* Add `--add-seccomp` (prerequisite for #4462) * Add a warning when repeated options are ignored * Add a Meson build system * Invoke bash via `PATH` * Exit early when `argc == 0` Signed-off-by: Simon McVittie <[email protected]>
Now that we have bubblewrap 0.6.1, I'd like to consider this for Flatpak 1.13.x. The benefit of this is that if a new kernel with new and exciting syscalls triggers another vulnerability like CVE-2021-41133, we will only need to add those syscalls to a deny-list in older branches like 1.12.x (if we are still supporting those branches), and users of 1.13.x/1.14.x will already be protected from the vulnerability. As a philosophy, having a security mechanism based on "enumerating badness" (a deny-list) is fundamentally flawed, and what I hear from people who know about seccomp is that an allow-list is the only safe way. The cost is that the syscall filter will be more complicated (potentially harming performance in syscall-heavy workloads like games), and Flatpak apps will be unable to use new syscalls until we have added them to our allow-list. I think that's a price we have to pay if we want to reduce kernel attack surface. Because we're using |
Any chance someone could review this, containers/bubblewrap#488 and/or containers/bubblewrap#452? I'd be more comfortable about Flatpak's resilience against vulnerabilities like CVE-2021-41133 if we could get one of those into 1.14.x. |
Overall this looks good to me. I have a performance worry for the allow list though. There is a ton of syscalls in that list, and adding them one by one may lead to a ruleset that is slow. Is it possible to pre-compute a list of all the syscalls that are contiguous and add rules instead of the form "allow syscalls from zero to 50" instead of 50 separate rules? Or does libseccomp do some optimization like that? |
AFAIK libseccomp has no such implicit optimization. If you know the code running with a filter, you can benchmark the syscalls and give hot syscalls a higher priority. That's produces a filter where syscalls are check in order of their priority (instead of an undefined order) letting hot syscalls return earlier. For cases like this (not known code + large filter) you can enable generation of a btree optimized filter with O(log n) by setting Maintainer of libseccomp-rs (Rust bindings for libseccomp) |
This change is less important than it used to be, now that we have
It is probably possible, but realistically I am not going to be able to implement that any time soon. My entire knowledge of libseccomp is what I needed to learn in order to fix past CVEs and write this branch, so I'd appreciate it if someone with better seccomp expertise could take this over. |
I said this back in 2022 and it's still true:
For performance-critical, syscall-heavy workloads, perhaps we could consider having an |
I don't think this is even worth considering unless we know there exists a real performance problem for some real application. |
Such performance problem was reported for some games. |
Yeah, and that was with the current small filters. But with an allowlist each syscall would have to pass a filter that compared it against all possible syscalls. |
Previously, system call filtering (to prevent builders from storing files with setuid/setgid permission bits or extended attributes) was performed using a blocklist. While this looks simple at first, it actually carries significant security and maintainability risks: after all, the kernel may add new syscalls to achieve the same functionality one is trying to block, and it can even be hard to actually add the syscall to the blocklist when building against a C library that doesn't know about it yet. For a recent demonstration of this happening in practice to Nix, see the introduction of fchmodat2 [0] [1]. The allowlist approach does not share the same drawback. While it does require a rather large list of harmless syscalls to be maintained in the codebase, failing to update this list (and roll out the update to all users) in time has rather benign effects; at worst, very recent programs that already rely on new syscalls will fail with an error the same way they would on a slightly older kernel that doesn't support them yet. Most importantly, no unintended new ways of performing dangerous operations will be silently allowed. Another possible drawback is reduced system call performance due to the larger filter created by the allowlist requiring more computation [2]. However, this issue has not convincingly been demonstrated yet in practice, for example in systemd or various browsers. To the contrary, it has been measured that the the actual filter constructed here has approximately the same overhead as a very simple filter blocking only one system call. This commit tries to keep the behavior as close to unchanged as possible. The system call list is in line with libseccomp 2.5.5 and glibc 2.39, which are the latest versions at the point of writing. Since libseccomp 2.5.5 is already a requirement and the distributions shipping this together with older versions of glibc are mostly not a thing any more, this should not lead to more build failures any more. [0] NixOS/nixpkgs#300635 [1] NixOS/nix#10424 [2] flatpak/flatpak#4462 (comment) Change-Id: I541be3ea9b249bcceddfed6a5a13ac10b11e16ad
fec2526
to
a9ed230
Compare
This has been on hold for long enough that I no longer remember its status clearly, and the testing I previously did is no longer necessarily valid. I think we need to take this out of scope for 1.16.x if we want to release 1.16.x in a finite time. The main concern about this change was performance. I would appreciate it if someone with more time than me could take over what needs to happen in this branch:
|
* seccomp -> blocklist_ctx * seccomp_tmpf -> blocklist_tmpf Signed-off-by: Simon McVittie <[email protected]>
This should avoid future vulnerabilities similar to CVE-2021-41133. If a new VFS-manipulating system call is added to the kernel and libseccomp but not added to our list of known system calls, then it won't enter our allowlist, so it will fail with ENOSYS in the sandbox. If a new system call is added to our list of known system calls, but is not supported by libseccomp, then adding it to both our blocklist (if applicable) and our allowlist will fail with -EFAULT. Our blocklist will be unable to make it fail with EPERM or check its arguments, but our allowlist will make it fail with ENOSYS, so we're still protected. Signed-off-by: Simon McVittie <[email protected]>
This means we don't have to know the syscall numbers of recently-added system calls, which means we can drop common/flatpak-syscalls-private.h. Signed-off-by: Simon McVittie <[email protected]>
* Add cachestat() * Add futex_*() (notably used by Wine/Proton) * Add fchmodat2() * Add listmount(), statmount() * Add llseek() * Add lsm_get_self_attr(), lsm_list_modules(), lsm_set_self_attr() * Add map_shadow_stack() * Add mseal() * Add process_mrelease() * Add riscv_hwprobe() (CPUID-like) * Add set_mempolicy_home_node() * Add uretprobe() None of these seem like something that needs to be blocked, so don't add them to the denylist. I've done this as a separate commit rather than squashing it into the commit that added common/known-syscalls.txt, to illustrate what we will need to do periodically. Signed-off-by: Simon McVittie <[email protected]>
a9ed230
to
833272e
Compare
This is hopefully a better solution for CVE-2021-41133.
run: Rename variables to allow for addition of an allowlist
run: Add an allowlist of known system calls
This should avoid future vulnerabilities similar to CVE-2021-41133.
If a new VFS-manipulating system call is added to the kernel and
libseccomp but not added to our list of known system calls, then it won't
enter our allowlist, so it will fail with ENOSYS in the sandbox.
If a new system call is added to our list of known system calls, but is
not supported by libseccomp, then adding it to both our blocklist (if
applicable) and our allowlist will fail with -EFAULT. Our blocklist will
be unable to make it fail with EPERM or check its arguments, but our
allowlist will make it fail with ENOSYS, so we're still protected.
run: Solve CVE-2021-41133 via allowlist, not blocklist
This means we don't have to know the syscall numbers of recently-added
system calls, which means we can drop common/flatpak-syscalls-private.h.
Update list of known syscalls to Linux v6.11.0-rc1
None of these seem like something that needs to be blocked, so don't
add them to the denylist.
I've done this as a separate commit rather than squashing it into the
commit that added common/known-syscalls.txt, to illustrate what we will
need to do periodically.