Before discussing resource enforcement, a preface on load average is necessary, or more pedantically: run-queue size (both are used interchangeably). Run-queue size is the total amount of work parcels enqueued across all logical processors at any given instance. For example, if there are 4 logical processors bound to a server it means - assuming equal assignment - that each logical processor is processing 1 parcel of work and 3/4 have another parcel of work immediately behind it. Parcels compute on the microsecond scale, so momentary spikes - what's reported as a 1-minute load average - only indicate there's a spike and nothing more. A 15-minute load average paints a better picture of the typical server load, which can potentially hint at deeper issues such as insufficient hardware.
Logical processors are the number of processes listed in /proc/cpuinfo
. ApisCP precomputes this number on start and applies it to any CPU calculations (as well as user hz).
Storage is tracked using native quota facilities of the operating system. XFS and Ext4 filesystems are supported, which provide equivalent tracking abilities. This tracking facility is called a "quota". Quotas exist as hard and soft.
An account may not exceed its hard quota whereas a soft quota disposition is at the discretion of the application: it may be simply advisory or fail.
File data is allocated in blocks. Each block is 4 KB. Files must always occupy the entire block size even if there is insufficient data to cover the 4 KB block. Thus a 7 KB file may appear as 7 KB on disk, but is charged for 8 KB of storage by the operating system.
::: details
repquota -ua | sort -n --key=3
will show all files belonging to users ordered by size. Each number, with the exception of 0, is in KB and will always be perfectly divisible by the filesystem block size, 4 KB.
:::
::: tip Storage quotas are controlled by quota service name in the diskquota service class. :::
Before discussing inode quotas, let's talk about an inode. inodes provide metadata about files. They don't contain file data, but instead information about the file or directory or device. inodes have a fixed size, typically 256 or 512 bytes each. With XFS and cyclic redundancy checks - a blessing for ensuring data integrity - these inodes are always 512 bytes. Ext4 uses 256 byte inodes by default.
Large inode sizes means that more information about a file. File size, creation time, modification time, owner, group are mandatory attributes one would find in an inode. Additional attributes include access control lists, granular access rights to a file; extended attributes, arbitrary data about a file; and SELinux security contexts, used by a separate subsystem that defines unambiguous operational boundaries of a file.
File names are not included in an inode, but instead a dentry, which is another storage block that contains information about what file names it contains as well as its inode structures. Each dentry is stored in multiples of 4 KB.
Bringing this together:
For a directory consisting of 2 files, a 4 KB and 7 KB file: the total size charged for these 3 storage items is 4 KB (directory) + 4 KB (file) + 8 KB (file). These would create 3 inodes that are not directly charged to the account's storage quota but are still stored in the filesystem responsible for approximately 1.5 KB of additional storage. These files would instead be charged to the inode quota.
:::tip Inode quotas are controlled by quota service name in the fquota service class. :::
Doing some quick math, the maximum number of files a 40 GB would allow for is approximately 9,320,675 1 byte files - still quite a bit.
::: details A zero byte file doesn't generate a 4 KB block of file storage, but still generates an inode. This is why 1 byte is intended instead of 0 bytes. :::
The maximum number of inodes on XFS is 2^64. The maximum number of Ext4 is 2^32. You could fill up an XFS server with 9.3 million 1 byte files every second for 62,000 years before reaching its limit!
Ext4 on the other hand wouldn't last 10 minutes, assuming you could find a process to generate 9.3 million files every second.
In either situation you're liable to run out of storage before inodes. On XFS systems, 2^64 inodes would require 8,388,608 PB of storage.
If you're on XFS, don't worry counting inodes.
On Ext4/Ext3 platforms, CAP_SYS_RESOURCE allows bypass of quota enforcement. XFS does not honor quota bypass if a user or process has CAP_SYS_RESOURCE capability set. Thus it is possible for services that require creation of a file and are either root or CAP_SYS_RESOURCE to fail upon creation of these files. Do not sgid or suid a directory that may cause an essential service to fail on boot if quotas prohibit it, such as Apache.
It is possible to disable quota enforcement on xfs while counting usage using xfs_quota
:
# findmnt just resolves which block device /home/virtual resides on
xfs_quota -xc 'disable -ugv' "$(findmnt -n -o source --target /home/virtual)"
This affects quota enforcement globally, so use wisely. Likewise don't forget to enable,
xfs_quota -xc 'enable -ugv' "$(findmnt -n -o source --target /home/virtual)"
Bandwidth is tallied every night during logrotation. Logrotation runs via /etc/cron.daily/logrotate
. Its start time may be adjusted using cron.start-range Scope. A secondary task, bwcron.service
runs every night at 12 AM server time (see system.timezone Scope). Enforcement is carried out during this window. Disposition is configured by the bandwidth Tuneable. The following table summarizes several options.
Parameter | Description |
---|---|
resolution | For archiving; bandwidth is rounded down and binned every n seconds. Smaller resolutions increase storage requirements. |
stopgap | Bandwidth stopgap expressed in percentage. 100 terminates a site when it's over allotted bandwidth. Default setting, 200, suspends the site when it has exceeded 200% its bandwidth. 95 would suspend a site when it's within 5% of its bandwidth quota. |
notify | Bandwidth notification threshold expressed in percentage. As a sanity check, bandwidth_notify <= bandwidth_stopgap. Setting 0 would effectively notify every night. |
An email is sent to the customer every night informing them of overage. This template located in resources/views/email/bandwidth
may be customized using typical ApisCP customization rules.
Memory is a misunderstood and complex topic. linuxatemyram.com addresses many surface complaints with free memory in a healthy system that are unfounded. cgroup memory accounting doesn't stray from this complexity. Before discussing technical challenges in accounting (and CoW semantics), let's start with some basics.
# Set ceiling of 512 MB for all processes
EditDomain -c cgroup,memory=512 domain.com
# Switch to domain.com account
su domain.com
# Generate up to 512 MB, some memory is reserved by the shell
yes | tr \\n x | head -c $((512*1024*1024)) | grep n
# Once memory has been reached, process terminates with "Killed"
A site may consume up to 512 MB of memory before the OOM killer is invoked. When an OOM condition is reached, further memory allocation fails, event logged in the memory controller, and offending application ends abruptly.
dmesg
notes an OOM killer invocation on the process,
[2486967.059804] grep invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=600
[2486967.059926] Task in /site133 killed as a result of limit of /site133
[2486967.059929] memory: usage 524288kB, limit 524288kB, failcnt 153
[2486967.059930] memory+swap: usage 525060kB, limit 9007199254740988kB, failcnt 0
[2486967.059932] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[2486967.059933] Memory cgroup stats for /site133: cache:0KB rss:524288KB rss_huge:0KB mapped_file:0KB swap:772KB inactive_anon:31404KB active_anon:492884KB inactive_file:0KB active_file:0KB unevictable:0KB
[2486967.059957] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[2486967.060381] [22040] 0 22040 51698 1729 57 10 0 su
[2486967.060384] [22041] 9730 22041 29616 971 14 322 600 bash
[2486967.060446] [25889] 9730 25889 27014 86 11 0 600 yes
[2486967.060449] [25890] 9730 25890 27020 154 11 0 600 tr
[2486967.060452] [25891] 9730 25891 27016 166 11 0 600 head
[2486967.060455] [25892] 9730 25892 224814 130523 268 0 600 grep
[2486967.060459] Memory cgroup out of memory: Kill process 25892 (grep) score 710 or sacrifice child
[2486967.067494] Killed process 25892 (grep), UID 9730, total-vm:899256kB, anon-rss:521228kB, file-rss:864kB, shmem-rss:0kB
::: tip "OOM" is an initialism for "out of memory". Killer is... a killer. OOM killer is invoked by the kernel to judiciously terminate processes when it's out of memory either on the system or control group. :::
Using Metrics, OOM events can be easily tracked. cpcmd -d domain.com telemetry:get c-cgroup-oom
reports the latest OOM value for a site. A free-form query is also available that provides similar information for all sites.
SELECT
domain,
value,
MAX(ts)
FROM
metrics
JOIN
siteinfo USING (site_id)
JOIN
metric_attributes USING (attr_id)
WHERE
name = 'c-memory-oom'
AND
value > 0
AND
ts > NOW() - INTERVAL '1 DAY'
GROUP BY (domain, value);
As an alternative, range can be used to examine the sum over a window.
cpcmd telemetry:range c-memory-oom -86400 null 12
::: details
c-memory-oom
attribute is summed over the last day (86400 seconds) for site ID 12. false
may be specified after site ID to list per record.
:::
CPU utilization comes in two forms: user and system (real time is the sum of user + system). User time is spent incrementing over a loop, adding numbers, or templating a theme. System time is when a process communicates with the kernel directly to perform a privileged function, such as opening a file, forking a process, or communicating over a network socket.
In typical operation, user will always be an order of magnitude higher than system. time
can help you understand how. Don't worry if it doesn't make sense yet, we'll walk through it.
strace -c -- /bin/sh -c 'time (let SUM=0; for i in $(seq 1 1000) ; do SUM+=$i ; stat / > /dev/null; done)'
real 0m2.231s
user 0m0.777s
sys 0m1.336s
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
99.96 1.335815 667908 2 1 wait4
0.01 0.000152 11 14 mmap
0.01 0.000080 10 8 mprotect
0.01 0.000073 9 8 open
0.00 0.000049 12 4 read
0.00 0.000031 4 8 close
0.00 0.000029 2 16 rt_sigprocmask
0.00 0.000024 3 7 fstat
0.00 0.000024 2 10 rt_sigaction
0.00 0.000022 11 2 munmap
0.00 0.000016 3 5 brk
0.00 0.000016 16 1 execve
0.00 0.000011 6 2 stat
0.00 0.000007 7 1 1 access
0.00 0.000004 4 1 getrlimit
0.00 0.000004 4 1 getpgrp
0.00 0.000003 3 1 getpid
0.00 0.000003 3 1 uname
0.00 0.000003 3 1 getuid
0.00 0.000003 3 1 getgid
0.00 0.000003 3 1 geteuid
0.00 0.000003 3 1 getegid
0.00 0.000003 3 1 getppid
0.00 0.000003 3 1 arch_prctl
0.00 0.000000 0 1 rt_sigreturn
0.00 0.000000 0 1 clone
------ ----------- ----------- --------- --------- ----------------
100.00 1.336381 100 2 total
For Site Administrators, "user" and "system" are 24 hour recorded totals using the same mechanism that counts run-queue size and task duration. As there are 86,400 seconds in a day per logical core, in theory the maximal value would approach 86,400 seconds * <n processors>, but as several hundred processes run on a server it would be impossible for any one task group to ever reach this total (like absolute zero).
A PID is a process ID. A process ID is any thread. A single-threaded application creates 1 process ID. A multithreaded application creates up to n process IDs. The nuance is important because process enforcement affects thread counts, not process counts. In the below example, MySQL is charged with 37 processes. In a typical view with ps
, this may only appear as 1 process on the surface.
Let's set process limit to 100 and induce a fork bomb, which rapidly spawns up to 100 processes before summarily exiting:
EditDomain -c cgroup,proclimit=100 -D site1
su site1
# Uncomment following line to run a fork bomb
# :(){ ; :|:& };:
# Output:
# bash: fork: retry: No child processes
# bash: fork: retry: No child processes
# bash: fork: retry: No child processes
And confirm the PID counter maxed out by investigating the contents of pids.max in the pids controller,
cat /sys/fs/cgroup/pids/site11/pids.max
Likewise if 100 threads were created using a tool such as GNU Parallel a similar result would be seen once the thread count hits 100.
::: tip One of many layers
A secondary defense, in the event no such cgroup protection is applied, exists in FST/siteinfo/etc/security/limits.d/10-apnscp-user.conf
that sets a generous limit of 2,048 processes. This can be adjusted by setting limit_nproc
in Bootstrapper and running system/limits
role.
:::
::: warning Program behavior is often unspecified when it can no longer create additional threads or processes. proclimit should be used judiciously to prevent abuse, not act as a prod for users to upgrade to a more profitable package type. :::
IO restrictions are classified by read and write.
EditDomain -c cgroup,writebw=2 domain.com
# Apply the min of blkio,writebw/blkio,writeiops
# Both are equivalent assuming 4 KB blocks
EditDomain -c cgroup,writebw=2 -c blkio,writeiops=512 domain.com
IO and CPU weighting may be set via ioweight and cpuweight respectively. ioweight requires usage of the CFQ/BFQ IO elevators.
# Default weight is 100
# Halve IO priority, double CPU priority
EditDomain -c cgroup,ioweight=50 -c cgroup,cpuweight=200 domain.com
cgroup reports all memory consumed within the OS by applications, which includes filesystem caches + network buffers. Cache can be automatically expunged when needed by the OS. To expunge the cache forcefully, write "1" to /proc/sys/vm/drop_caches
. For example, working with "site1" or the first site created on the server:
cat /sys/fs/cgroup/memory/site1/memory.usage_in_bytes
# Value is total RSS + TCP buffer + FS cache
echo 1 > /proc/sys/vm/drop_caches
# Value is now RSS
cat /sys/fs/cgroup/memory/site1/memory.usage_in_bytes
This can be confirmed by examining memory.stat
in the cgroup home. Likewise memory reported by a process may be higher than memory reported by cgroup, this is because cgroup only accounts for memory uniquely reserved by the application. A fork shares its parent's memory pages and copies-on-write at which point the newly claimed memory is charged to the cgroup.