Skip to content

Latest commit

 

History

History
276 lines (200 loc) · 16.2 KB

Resource enforcement.md

File metadata and controls

276 lines (200 loc) · 16.2 KB

Resource enforcement

Before discussing resource enforcement, a preface on load average is necessary, or more pedantically: run-queue size (both are used interchangeably). Run-queue size is the total amount of work parcels enqueued across all logical processors at any given instance. For example, if there are 4 logical processors bound to a server it means - assuming equal assignment - that each logical processor is processing 1 parcel of work and 3/4 have another parcel of work immediately behind it. Parcels compute on the microsecond scale, so momentary spikes - what's reported as a 1-minute load average - only indicate there's a spike and nothing more. A 15-minute load average paints a better picture of the typical server load, which can potentially hint at deeper issues such as insufficient hardware.

Logical processors are the number of processes listed in /proc/cpuinfo. ApisCP precomputes this number on start and applies it to any CPU calculations (as well as user hz).

Storage

Storage is tracked using native quota facilities of the operating system. XFS and Ext4 filesystems are supported, which provide equivalent tracking abilities. This tracking facility is called a "quota". Quotas exist as hard and soft.

An account may not exceed its hard quota whereas a soft quota disposition is at the discretion of the application: it may be simply advisory or fail.

File data is allocated in blocks. Each block is 4 KB. Files must always occupy the entire block size even if there is insufficient data to cover the 4 KB block. Thus a 7 KB file may appear as 7 KB on disk, but is charged for 8 KB of storage by the operating system.

::: details repquota -ua | sort -n --key=3 will show all files belonging to users ordered by size. Each number, with the exception of 0, is in KB and will always be perfectly divisible by the filesystem block size, 4 KB. :::

::: tip Storage quotas are controlled by quota service name in the diskquota service class. :::

inode quotas

Before discussing inode quotas, let's talk about an inode. inodes provide metadata about files. They don't contain file data, but instead information about the file or directory or device. inodes have a fixed size, typically 256 or 512 bytes each. With XFS and cyclic redundancy checks - a blessing for ensuring data integrity - these inodes are always 512 bytes. Ext4 uses 256 byte inodes by default.

Large inode sizes means that more information about a file. File size, creation time, modification time, owner, group are mandatory attributes one would find in an inode. Additional attributes include access control lists, granular access rights to a file; extended attributes, arbitrary data about a file; and SELinux security contexts, used by a separate subsystem that defines unambiguous operational boundaries of a file.

File names are not included in an inode, but instead a dentry, which is another storage block that contains information about what file names it contains as well as its inode structures. Each dentry is stored in multiples of 4 KB.

Bringing this together:

For a directory consisting of 2 files, a 4 KB and 7 KB file: the total size charged for these 3 storage items is 4 KB (directory) + 4 KB (file) + 8 KB (file). These would create 3 inodes that are not directly charged to the account's storage quota but are still stored in the filesystem responsible for approximately 1.5 KB of additional storage. These files would instead be charged to the inode quota.

:::tip Inode quotas are controlled by quota service name in the fquota service class. :::

Doing some quick math, the maximum number of files a 40 GB would allow for is approximately 9,320,675 1 byte files - still quite a bit.
quota-inode minimums

::: details A zero byte file doesn't generate a 4 KB block of file storage, but still generates an inode. This is why 1 byte is intended instead of 0 bytes. :::

The maximum number of inodes on XFS is 2^64. The maximum number of Ext4 is 2^32. You could fill up an XFS server with 9.3 million 1 byte files every second for 62,000 years before reaching its limit!

Ext4 on the other hand wouldn't last 10 minutes, assuming you could find a process to generate 9.3 million files every second.

In either situation you're liable to run out of storage before inodes. On XFS systems, 2^64 inodes would require 8,388,608 PB of storage.

If you're on XFS, don't worry counting inodes.

XFS/Ext4 idiosyncrasies

On Ext4/Ext3 platforms, CAP_SYS_RESOURCE allows bypass of quota enforcement. XFS does not honor quota bypass if a user or process has CAP_SYS_RESOURCE capability set. Thus it is possible for services that require creation of a file and are either root or CAP_SYS_RESOURCE to fail upon creation of these files. Do not sgid or suid a directory that may cause an essential service to fail on boot if quotas prohibit it, such as Apache.

It is possible to disable quota enforcement on xfs while counting usage using xfs_quota:

# findmnt just resolves which block device /home/virtual resides on
xfs_quota -xc 'disable -ugv' "$(findmnt -n -o source --target /home/virtual)"

This affects quota enforcement globally, so use wisely. Likewise don't forget to enable,

xfs_quota -xc 'enable -ugv' "$(findmnt -n -o source --target /home/virtual)"

Bandwidth

Bandwidth is tallied every night during logrotation. Logrotation runs via /etc/cron.daily/logrotate. Its start time may be adjusted using cron.start-range Scope. A secondary task, bwcron.service runs every night at 12 AM server time (see system.timezone Scope). Enforcement is carried out during this window. Disposition is configured by the bandwidth Tuneable. The following table summarizes several options.

Parameter Description
resolution For archiving; bandwidth is rounded down and binned every n seconds. Smaller resolutions increase storage requirements.
stopgap Bandwidth stopgap expressed in percentage. 100 terminates a site when it's over allotted bandwidth. Default setting, 200, suspends the site when it has exceeded 200% its bandwidth. 95 would suspend a site when it's within 5% of its bandwidth quota.
notify Bandwidth notification threshold expressed in percentage. As a sanity check, bandwidth_notify <= bandwidth_stopgap. Setting 0 would effectively notify every night.

An email is sent to the customer every night informing them of overage. This template located in resources/views/email/bandwidth may be customized using typical ApisCP customization rules.

Memory

Memory is a misunderstood and complex topic. linuxatemyram.com addresses many surface complaints with free memory in a healthy system that are unfounded. cgroup memory accounting doesn't stray from this complexity. Before discussing technical challenges in accounting (and CoW semantics), let's start with some basics.

# Set ceiling of 512 MB for all processes
EditDomain -c cgroup,memory=512 domain.com
# Switch to domain.com account
su domain.com
# Generate up to 512 MB, some memory is reserved by the shell
yes | tr \\n x | head -c $((512*1024*1024)) | grep n
# Once memory has been reached, process terminates with "Killed"

A site may consume up to 512 MB of memory before the OOM killer is invoked. When an OOM condition is reached, further memory allocation fails, event logged in the memory controller, and offending application ends abruptly.

dmesg notes an OOM killer invocation on the process,

[2486967.059804] grep invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=600
[2486967.059926] Task in /site133 killed as a result of limit of /site133
[2486967.059929] memory: usage 524288kB, limit 524288kB, failcnt 153
[2486967.059930] memory+swap: usage 525060kB, limit 9007199254740988kB, failcnt 0
[2486967.059932] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
[2486967.059933] Memory cgroup stats for /site133: cache:0KB rss:524288KB rss_huge:0KB mapped_file:0KB swap:772KB inactive_anon:31404KB active_anon:492884KB inactive_file:0KB active_file:0KB unevictable:0KB
[2486967.059957] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
[2486967.060381] [22040]     0 22040    51698     1729      57       10             0 su
[2486967.060384] [22041]  9730 22041    29616      971      14      322           600 bash
[2486967.060446] [25889]  9730 25889    27014       86      11        0           600 yes
[2486967.060449] [25890]  9730 25890    27020      154      11        0           600 tr
[2486967.060452] [25891]  9730 25891    27016      166      11        0           600 head
[2486967.060455] [25892]  9730 25892   224814   130523     268        0           600 grep
[2486967.060459] Memory cgroup out of memory: Kill process 25892 (grep) score 710 or sacrifice child
[2486967.067494] Killed process 25892 (grep), UID 9730, total-vm:899256kB, anon-rss:521228kB, file-rss:864kB, shmem-rss:0kB

::: tip "OOM" is an initialism for "out of memory". Killer is... a killer. OOM killer is invoked by the kernel to judiciously terminate processes when it's out of memory either on the system or control group. :::

Using Metrics, OOM events can be easily tracked. cpcmd -d domain.com telemetry:get c-cgroup-oom reports the latest OOM value for a site. A free-form query is also available that provides similar information for all sites.

SELECT 
	domain, 
	value, 
	MAX(ts) 
FROM 
	metrics 
JOIN 
	siteinfo USING (site_id) 
JOIN 
	metric_attributes USING (attr_id) 
WHERE 
	name = 'c-memory-oom' 
	AND 
	value > 0 
	AND 
	ts > NOW() - INTERVAL '1 DAY' 
GROUP BY (domain, value);

As an alternative, range can be used to examine the sum over a window.

cpcmd telemetry:range c-memory-oom -86400 null 12

::: details c-memory-oom attribute is summed over the last day (86400 seconds) for site ID 12. false may be specified after site ID to list per record. :::

CPU

CPU utilization comes in two forms: user and system (real time is the sum of user + system). User time is spent incrementing over a loop, adding numbers, or templating a theme. System time is when a process communicates with the kernel directly to perform a privileged function, such as opening a file, forking a process, or communicating over a network socket.

In typical operation, user will always be an order of magnitude higher than system. time can help you understand how. Don't worry if it doesn't make sense yet, we'll walk through it.

strace -c -- /bin/sh -c 'time  (let SUM=0; for i in $(seq 1 1000) ; do SUM+=$i ; stat / > /dev/null; done)'

real    0m2.231s
user    0m0.777s
sys     0m1.336s
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 99.96    1.335815      667908         2         1 wait4
  0.01    0.000152          11        14           mmap
  0.01    0.000080          10         8           mprotect
  0.01    0.000073           9         8           open
  0.00    0.000049          12         4           read
  0.00    0.000031           4         8           close
  0.00    0.000029           2        16           rt_sigprocmask
  0.00    0.000024           3         7           fstat
  0.00    0.000024           2        10           rt_sigaction
  0.00    0.000022          11         2           munmap
  0.00    0.000016           3         5           brk
  0.00    0.000016          16         1           execve
  0.00    0.000011           6         2           stat
  0.00    0.000007           7         1         1 access
  0.00    0.000004           4         1           getrlimit
  0.00    0.000004           4         1           getpgrp
  0.00    0.000003           3         1           getpid
  0.00    0.000003           3         1           uname
  0.00    0.000003           3         1           getuid
  0.00    0.000003           3         1           getgid
  0.00    0.000003           3         1           geteuid
  0.00    0.000003           3         1           getegid
  0.00    0.000003           3         1           getppid
  0.00    0.000003           3         1           arch_prctl
  0.00    0.000000           0         1           rt_sigreturn
  0.00    0.000000           0         1           clone
------ ----------- ----------- --------- --------- ----------------
100.00    1.336381                   100         2 total

Site Administrator glance

For Site Administrators, "user" and "system" are 24 hour recorded totals using the same mechanism that counts run-queue size and task duration. As there are 86,400 seconds in a day per logical core, in theory the maximal value would approach 86,400 seconds * <n processors>, but as several hundred processes run on a server it would be impossible for any one task group to ever reach this total (like absolute zero).

Process

A PID is a process ID. A process ID is any thread. A single-threaded application creates 1 process ID. A multithreaded application creates up to n process IDs. The nuance is important because process enforcement affects thread counts, not process counts. In the below example, MySQL is charged with 37 processes. In a typical view with ps, this may only appear as 1 process on the surface.

Threaded vs non-threaded PID view of MySQL

Let's set process limit to 100 and induce a fork bomb, which rapidly spawns up to 100 processes before summarily exiting:

EditDomain -c cgroup,proclimit=100 -D site1
su site1
# Uncomment following line to run a fork bomb
# :(){ ; :|:& };:

# Output:
# bash: fork: retry: No child processes
# bash: fork: retry: No child processes
# bash: fork: retry: No child processes

And confirm the PID counter maxed out by investigating the contents of pids.max in the pids controller,

cat /sys/fs/cgroup/pids/site11/pids.max

Likewise if 100 threads were created using a tool such as GNU Parallel a similar result would be seen once the thread count hits 100.

::: tip One of many layers A secondary defense, in the event no such cgroup protection is applied, exists in FST/siteinfo/etc/security/limits.d/10-apnscp-user.conf that sets a generous limit of 2,048 processes. This can be adjusted by setting limit_nproc in Bootstrapper and running system/limits role. :::

::: warning Program behavior is often unspecified when it can no longer create additional threads or processes. proclimit should be used judiciously to prevent abuse, not act as a prod for users to upgrade to a more profitable package type. :::

IO

IO restrictions are classified by read and write.

EditDomain -c cgroup,writebw=2 domain.com
# Apply the min of blkio,writebw/blkio,writeiops
# Both are equivalent assuming 4 KB blocks
EditDomain -c cgroup,writebw=2 -c blkio,writeiops=512 domain.com

IO and CPU weighting may be set via ioweight and cpuweight respectively. ioweight requires usage of the CFQ/BFQ IO elevators.

# Default weight is 100
# Halve IO priority, double CPU priority
EditDomain -c cgroup,ioweight=50 -c cgroup,cpuweight=200 domain.com

Emergency stopgaps

Troubleshooting

Memory reported is different than application memory

cgroup reports all memory consumed within the OS by applications, which includes filesystem caches + network buffers. Cache can be automatically expunged when needed by the OS. To expunge the cache forcefully, write "1" to /proc/sys/vm/drop_caches. For example, working with "site1" or the first site created on the server:

cat /sys/fs/cgroup/memory/site1/memory.usage_in_bytes
# Value is total RSS + TCP buffer + FS cache
echo 1 > /proc/sys/vm/drop_caches
# Value is now RSS
cat /sys/fs/cgroup/memory/site1/memory.usage_in_bytes

This can be confirmed by examining memory.stat in the cgroup home. Likewise memory reported by a process may be higher than memory reported by cgroup, this is because cgroup only accounts for memory uniquely reserved by the application. A fork shares its parent's memory pages and copies-on-write at which point the newly claimed memory is charged to the cgroup.