container-selinux, OpenShift SELinux MCS contexts and types, Kata containers / gVisor / Firecracker #6

westurner · 2023-01-13T11:27:47Z

https://github.com/containers/container-selinux
- https://github.com/containers/container-selinux/blob/main/container.if contexts
- https://github.com/containers/container-selinux/blob/main/container.fc file path <-> context mappings
https://opensource.com/article/18/2/selinux-labels-container-runtimes

when a container runtime like Docker, as well as some of the new ones we have been working on—podman, CRI-O, and Buildah—create a container, they pick a random MCS label to run the container. The MCS labels consist of two random numbers between 0 and 1,023 and have to be unique. They are prefixed with a c or category. SELinux also needs a sensitivity level s0.

So an MCS label looks like s0:c1,c2. Note that s0:c2,c1 is the same thing. Also, the two numbers may not be the same; SELinux would translate s0:c1,c1 as s0:c1. This gives us approximately (1024*1024)/2 - 1024 categories—about 500,000 unique containers on a host.

We originally created MCS labeling back in 2008 for virtual machines, and it was often referred to as sVirt. We figured that running a half-million VMs on a single machine would not happen for a few years. With containers, the number might end up being threatened. But we could always go to three or more categories for each label, although the algorithm becomes more complicated.

SELinux does more than just MCS label. The process and content also get assigned SELinux "Types." Processes usually run with the container_t type, and content is created with the container_file_t type.

Process system_u:system_r:container_t:s0:c1,c2
Content system_u:object_r:container_file_t:s0:c1,c2
https://cloud.redhat.com/blog/a-guide-to-openshift-and-uids

oc get pod -o jsonpath='{range .items[*]}{@.metadata.name}{" runAsUser: "}{@.spec.containers[*].securityContext.runAsUser}{" fsGroup: "}{@.spec.securityContext.fsGroup}{" seLinuxOptions: "}{@.spec.securityContext.seLinuxOptions.level}{"\n"}{end}'
[...]
As it can be seen from the previous output, all the Pods in the same namespaces are running with the same UID, GID and SELinux labels. Notice these are unprivileged Pods running with an unprivileged UID & GID.
https://www.google.com/search?q=%22openshift%22+per-container+selinux+context+%22SCC%22+%22RBAC%22
"Introduction to Security Contexts and SCCs" https://cloud.redhat.com/blog/introduction-to-security-contexts-and-sccs
"Pod Security Admission in OpenShift 4.11"
https://cloud.redhat.com/blog/pod-security-admission-in-openshift-4.11
```
kind: PodSecurityConfiguration
apiVersion: pod-security.admission.config.k8s.io/v1beta1
defaults:
  enforce: "privileged"
  enforce-version: "latest"
  audit: "restricted"
  audit-version: "latest"
  warn: "restricted"
  warn-version: "latest"
exemptions:
  usernames:
```
- k8s: PodSecurityPolicy -> [simplified] Pod Security Admission
- OpenShift: ~PodSecurityPolicy && k8s Pod Security Admission
  
  n OpenShift, there is an OpenShift-specific dedicated pod admission system called Security Context Constraints. This system resembles the now deprecated PodSecurityPolicy admission, even though there have been many changes throughout the years of its existence. Our aim is to keep the Security Context Constraints pod admission system while also allowing users to have access to the Kubernetes Pod Security Admission. The following text describes what we did in order to make it possible in 4.11, and what we plan to do next in 4.12.
StackRox:
https://cloud.redhat.com/blog/red-hat-releases-open-source-stackrox-to-the-community
https://www.stackrox.io/blog/what-is-ebpf/
https://docs.cilium.io/en/stable/bpf/
https://docs.cilium.io/en/stable/concepts/#concepts
- https://docs.cilium.io/en/stable/concepts/overview/
https://zhimin-wen.medium.com/selinux-policy-for-openshift-containers-40baa1c86aa5 for cilium ebpf filtering w/ openshift
https://docs.openshift.com/container-platform/4.11/authentication/managing-security-context-constraints.html
https://access.redhat.com/documentation/en-us/openshift_container_platform/4.11/html/authentication_and_authorization/managing-pod-security-policies
Security context constraints allow an administrator to control:
- Whether a pod can run privileged containers with the allowPrivilegedContainer flag.
- Whether a pod is constrained with the allowPrivilegeEscalation flag.
- The capabilities that a container can request
- The use of host directories as volumes
- The SELinux context of the container
- The container user ID
- The use of host namespaces and networking
- The allocation of an FSGroup that owns the pod volumes
- The configuration of allowable supplemental groups
- Whether a container requires write access to its root file system
- The usage of volume types
- The configuration of allowable seccomp profiles
Understanding OpenShift sandboxed containers" https://docs.openshift.com/container-platform/4.11/sandboxed_containers/understanding-sandboxed-containers.html

support for running Kata Containers as an additional optional runtime. The new runtime supports containers in dedicated virtual machines (VMs), providing improved workload isolation.
https://xphyr.net/post/kata_ocp/ :

While CGroups and Namespaces are a powerful way of defining isolation between applications, faults have been found that allow breaking out of their CGroups jail. Additional measures such as SELinux can assist with keeping applications inside their container, but sometimes your application or workload needs more isolation than CGroups, Namespaces, and SELinux can provide.

There are multiple proposed solutions to this isolation challenge including Amazon Firecracker, gVisor, and Kata Containers. Google’s gVisor takes one approach to solve this problem and leverages a guest kernel in user space to sandbox containerized applications. Because gVisor is re-implementing all the Linux kernel syscalls there can be issues with compatibility and not all syscalls have been fully implemented. Alternatively, both Firecracker and Kata Containers leverage a tried and true technology, virtualization to create a complete sandbox around your containerized application. While Kata can work on a stand-alone machine, by directly integrating with containerd or CRI-O, it is mainly used as a part of a Kubernetes cluster.

The use of Virtualization may seem like abandoning the last 5+ years of progress with containers, but this is not really the case. Kata Containers (Kata) creates a virtual machine instance leveraging one of the four supported hypervisors however they are not your traditional virtual machines. Kata Containers creates a VM using a highly optimized Linux guest kernel designed for running containerized workloads and has a highly optimized boot path for quick start time. Boot times for these virtual machine instances can be under 5 seconds as can be seen in the kernel boot log:

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

container-selinux, OpenShift SELinux MCS contexts and types, Kata containers / gVisor / Firecracker #6

container-selinux, OpenShift SELinux MCS contexts and types, Kata containers / gVisor / Firecracker #6

westurner commented Jan 13, 2023

container-selinux, OpenShift SELinux MCS contexts and types, Kata containers / gVisor / Firecracker #6

container-selinux, OpenShift SELinux MCS contexts and types, Kata containers / gVisor / Firecracker #6

Comments

westurner commented Jan 13, 2023