When deploying any pods/containers on to Kubernetes or OpenShift, deploy with minimal privilege, minimal access to capabilities and avoiding root (0) user where possible (via security contexts and security context constraints (SCCs)).
Due to needing to deploy containers that likely need some kernel parameters being amended via sysctl
(cough, cough, SonarQube because of Elasticsearch...) I wanted to look to make the deployment as secure as possible with regards to security contexts.
Note: SCCs control the actions that a pod can perform and what it has the ability to access. The equivalent in Kubernetes are pod security policies (PSPs).
If using an initContainer to amend sysctls as part of a deployment, the SCC (which can easily be converted into a PSP for Kubernetes) included in this repo reduces the permissions/privilege that can be set for initContainer/containers in the deployment as much as possible. The deployment configuration is also configured to reduce the securityContext permissions/privileges assigned to the initContainer/Containers to the level they require. With the configuration files in this repo, to control the deployment security, the securityContext
settings should be monitored to make sure developers don't amend them and elevate beyond the intended privilege. Separation of duty regarding the setting of (security) values in the deployment and amending the settings in the controlling SCC is also important.
Ideally the initContainer amending sysctls would be removed from the deployment and instead implemented via another mechanism such as a Job or Daemonset. This implementation can then be controlled via a separate SCC/PSP and the main application container can be controlled by a restricted SCC/PSP to avoid over-providing privilege.
The problem with a Job
is that it'll only be deployed once and thus if the application container needs to be relocated to another worker node, the sysctl settings will not be reapplied. To overcome this, a DaemonSet
can be used, whereby all nodes in the cluster will have sysctl settings update, or a subset of worker nodes can be amended via labelling (Kubernetes tainting).
Note: (as stated here) it is good practice to consider nodes with special sysctl settings as tainted within a cluster, and only schedule pods onto them which need those sysctl settings. It is suggested to use the Kubernetes taints and toleration feature to implement this.
As per here:
Because SonarQube uses an embedded Elasticsearch, make sure that your Docker host configuration complies with the Elasticsearch production mode requirements and File Descriptors configuration. For example, on Linux, you can set the recommended values for the current session by running the following commands as root on the host:
- sysctl -w vm.max_map_count=262144
- sysctl -w fs.file-max=65536
- ulimit -n 65536
- ulimit -u 4096
Sysctl is is described here:
sysctl is used to modify kernel parameters at runtime. The parameters available are those listed under /proc/sys/. Procfs is required for sysctl support in Linux. You can use sysctl to both read and write sysctl data.
The sysctl.c code is located here:
I was keen to understand if CAP_SYS_ADMIN is required to make sysctl changes. There are only two references to CAP_SYS_ADMIN
in kernel v3.10 (this version was looked at as it is the kernel on the hosts in the OpenShift cluster I'm using).
/*
* Taint values can only be increased
* This means we can safely use a temporary.
*/
static int proc_taint(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos)
{
struct ctl_table t;
unsigned long tmptaint = get_taint();
int err;
if (write && !capable(CAP_SYS_ADMIN))
return -EPERM;
and
ifdef CONFIG_PRINTK
static int proc_dointvec_minmax_sysadmin(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos)
{
if (write && !capable(CAP_SYS_ADMIN))
return -EPERM;
Some detail regarding the capable
function is detailed here:
The function capable(<CAPABILITY_NAME>)
checks whether the current process has <CAPABILITY_NAME>
as an effective capability.
These two links define capable
as a function:
However note kernel version is 2.6.15.6
, so we need to check the version applicable to us. Perform oc describe node <node_name> | grep Kernel
to show kernel version on node being used:
oc describe node <node_name> | grep Kernel
Kernel Version: 3.10.0-1160.2.2.el7.x86_64
Therefore we need to find where this function is declared in kernel v3.10. Some detail given here:
/**
* capable - Determine if the current task has a superior capability in effect
* @cap: The capability to be tested for
*
* Return true if the current task has the given superior capability currently
* available for use, false if not.
*
* This sets PF_SUPERPRIV on the task if the capability is available on the
* assumption that it's about to be used.
*/
kernel.tainted
will is a non-zero value if the kernel has been tainted. Values can be viewed here.
Effectively, 'tainted' means that the kernel is in a state other than what it would be in if it were built fresh from the open source origin and used in a way that it had been intended. It is a way of flagging a kernel to warn people (e.g., developers) that there may be unknown reasons for it to be unreliable, and that debugging it may be difficult or impossible.
proc_dointvec_minmac_sysadmin
is related to dmesg_restrict
. The dmesg
command is used to see or control the kernel ring buffer.
kernel.demsg_restrict
indicates whether unprivileged users are prevented from using dmesg to view messages from the kernel’s log buffer. When dmesg_restrict is set to (0) there are no restrictions. When dmesg_restrict is set set to (1), users must have CAP_SYSLOG to use dmesg.
As stated here:
The kernel syslog contains debugging information that is often useful during exploitation of other vulnerabilities, such as kernel heap addresses. Rather than futilely attempt to sanitize hundreds (or thousands) of printk statements and simultaneously cripple useful debugging functionality, it is far simpler to create an option that prevents unprivileged users from reading the syslog.
As per here
When dmesg_restrict is set to 1 CAP_SYS_SYSLOG is needed to read the kernel ring buffer. But a root user without CAP_SYS_ADMIN is able to reset dmesg_restrict to 0. This is an issue when e.g. LXC (Linux Containers) are used and complete user space is running without CAP_SYS_ADMIN. An unprivileged and jailed root user can bypass the dmesg_restrict protection. With this patch writing to dmesg_restrict is only allowed when root has CAP_SYS_ADMIN.
It appears that the capability CAP_SYS_ADMIN
is only required for changes related to tainting and viewing kernel syslog.
Docker privileged mode is detailed here:
When the operator executes docker run
--privileged
, Docker will enable access to all devices on the host as well as set some configuration in AppArmor or SELinux to allow the container nearly all the same access to the host as processes running outside containers on the host.
It is worth a read of this, which states (amongst other things):
The following sysctls are known to be namespaced:
- kernel.shm*,
- kernel.msg*,
- kernel.sem,
- fs.mqueue.*,
- net.*.
Sysctls which are not namespaced are called node-level and must be set manually by the cluster admin, either by means of the underlying Linux distribution of the nodes (e.g. via /etc/sysctls.conf) or using a DaemonSet with privileged containers.
In the example script in this repo, the sysctls are:
- vm.max_map_count
- fs.file-max
Neither of which fall into the "namespaced" category.
The following tests have been performed to check settings:
Pod SC fsGroup | Pod SC runAsUser | sysctl runAsUser | sysctl privileged | defaultAddCapabilities | requiredDropCapabilities | sysctl -n Result | sysctl -w Result |
---|---|---|---|---|---|---|---|
1000 |
1000 |
1001 |
true |
[] |
(not incl. SYS_ADMIN) | sysctl read | sysctl: error setting key 'vm.max_map_count': Permission denied |
1000 |
1000 |
0 |
true |
[] |
(not incl. SYS_ADMIN) | sysctl read | sysctl successfully set |
1000 |
1000 |
1001 |
false |
SYS_ADMIN |
(not incl. SYS_ADMIN) | sysctl read | sysctl: error setting key 'vm.max_map_count': Read-only file system |
1000 |
1000 |
0 |
false |
SYS_ADMIN |
(not incl. SYS_ADMIN) | sysctl read | sysctl: error setting key 'vm.max_map_count': Read-only file system |
1000 |
1000 |
0 |
true |
[] |
(incl. SYS_ADMIN) | sysctl read | sysctl successfully set |
1000 |
1000 |
0 |
true |
[] |
ALL |
sysctl read | sysctl successfully set |
(Note: sysctl -n
is used to read the setting, -n
disables printing of the key name when printing values. sysctl -w
is used to write a sysctl setting change).
Summary: Testing has given the following outcomes:
- SYS_ADMIN capability alone is not sufficient to make sysctl changes as
/proc/sys
is only made available with read permission (note: SYS_ADMIN is not actually required for the specific sysctls being set above). - When testing with
prvilieged:true
, it has been shown that root (0) user must be used otherwise permission is denied. - The privileged setting makes
/proc/sys
available with write permission. - It also appears that no capabilities are required for sysctls to be set, just root user and privileged container.
To make non-namespaced sysctl calls, a container must be privileged
. Keen to avoid over-assigning privilege and to minimise attack vector, this config uses an SCC where:
- all capabilities must be dropped (
requiredDropCapabilities: -ALL
) - access to the underlying host IPC, network, PID, posts and directory volume plugin are not permitted
SecurityContexts allow for some granularity of configuration between the pods, initContainers and containers within. Therefore in this example, the Pod is configured with fsGroup:1000
and runAsUser: 1000
, the initContainer will run privileged (privileged: true
) and as user root/0 (runAsUser: 0
).
The container main-container
isn't configured with any securityContexts to override the pod settings and so inherits them.
The volume init-sysctl
is mounted at /tmp/scripts
with user=root and group=1000 (the latter due to the pod setting fsGroup:1000
). Further to this, due to this configuration in the deployment.yaml:
volumes:
- name: init-sysctl
configMap:
name: matt-sysctl
items:
- key: init_sysctl.sh
path: init_sysctl.sh
The file init_sysctl.sh
is present in the /tmp/scripts
directory. It is created with user and group both being root.
The initContainer is configured to create some logs, this can be viewed via: oc logs <pod_name> -c init-sysctl
and with the correct settings, will give results similar to:
uid=0(0) gid=0(root) groups=10(wheel),1000
total 16
drwxrwsrwx 3 root 1000 4096 Nov 14 16:45 .
drwxrwxrwt 1 root root 4096 Nov 14 16:45 ..
drwxr-sr-x 2 root 1000 4096 Nov 14 16:45 ..2020_11_14_16_45_12.427726936
lrwxrwxrwx 1 root root 31 Nov 14 16:45 ..data -> ..2020_11_14_16_45_12.427726936
lrwxrwxrwx 1 root root 21 Nov 14 16:45 init_sysctl.sh -> ..data/init_sysctl.sh
sysctl -n vm.max_map_count = 524288
sysctl -n fs.file-max = 131072
vm.max_map_count = 524288
fs.file-max = 131072
ulimit -n 131072
ulimit -u 8192
vm.max_map_count = 524288
1048576
vm.max_map_count=524288
The SCC that can be used with SonarQube can be configured as per the sysctl-scc.yaml
in this repo.
Set all allowHost
settings to false:
allowHostDirVolumePlugin: false
allowHostIPC: false
allowHostNetwork: false
allowHostPID: false
allowHostPorts: false
Setting allowPrivilegedContainer
(along with allowPrivilegeEscalation
and defaultAllowPrivilegeEscalation
to true):
allowPrivilegedContainer: true
allowPrivilegeEscalation: true
defaultAllowPrivilegeEscalation: true
Otherwise, the following error appears for replicaSet: 'cannot set allowPrivilegeEscalation
to false and privileged
to true'.
All capabilities are required to be dropped, via:
allowedCapabilities: []
defaultAddCapabilities: []
requiredDropCapabilities:
- ALL
Finally, as root (0) user is required by the initContainer; fsGroup, runAsUser and supplementalGroups can all be set to runAsAny
.
Currently looking into why the mounted secret is root:root
rather than having the same group applied as fsGroup does on the volume itself. This seems to be related: kubernetes/kubernetes#81089 but I cannot find the explanation regarding the K8s functionality/reasoning that is causing this to happen. Although you can set the file permissions on the secret, as per here, it looks as though there isn't yet the functionality to set user or group and this is what the referenced github issue is potentially seeking to do.
Look further into DaemonSet configuration for sysctl setting and consider implication.