Skip to content

Feature: System Health

Garrett LeSage edited this page May 22, 2018 · 4 revisions

System health is both a subset and superset of PCP integration. In other words, health includes PCP-only health metrics, as well as non-PCP metrics.

Goals

  • Display system status at a glance
  • Provide ways to easily fix issues
  • When an easy fix is not possible, provide information on how to resolve a problem

Examples

When no health issues are detected, the system should report a healthy state. The lists below are not the only problems that could occur (more could be added later), but are a starter list of possible issues with a system.

General health issues

Some of these issues have simple solutions that Cockpit can automatically fix.

  1. Security-related software updates available
    • Click to view the software updates page
  2. Issues mounting filesystems (as specified in fstab, etc.)
    • Display issues with the filesystem mounts, along with errors while mounting
  3. Insufficient storage space on partitions
    • Show partitions with small amounts of space
  4. SMART issues
    • Display issue, which may include:
      • Bad clusters on a disk
      • IO issues
  5. Swap is currently active
    • Display warning that swap is active (PCP needs to be installed for more details; see below)
  6. Issues with bringing up network interfaces
    • Show problematic network interfaces; click to switch to the network page
  7. Enabled & running systemd service keeps restarting
    • Click to display service's page with its log visible

PCP-derived health issues

Several detectable issues require PCP to be installed to be accurate and/or useful. Most of these will not have a simple 1-click solution. Most will require displaying info and/or digging a bit further.

  1. CPU load is constantly too high
    • Identify top offenders over a window of time and provide actions to stop/restart services and/or kill processes
  2. Not enough memory is free
    • Identify top offenders and provide actions (similar to CPU load), suggest upgrading RAM
  3. Swap is often used (PCP-enhanced version of swap rule above)
    • Related to not enough available memory issue (above)
    • Show top memory offenders while swap is active (this should help identify the offenders)
  4. Network is constantly saturated
    • Show processes transferring the most data
  5. Excessive waiting for storage (disk is >85% busy)
  6. Huge page fragmentation/defragmentation (memory is fragmented and system is spending a lot of time shuffling chunks of memory around to defragment)
  7. Network errors exist
  8. Packet receive (RX) queue is too small, causing many packages to be dropped
    • Provide a means to specify a new queue length, with a suggested default

Mockups

(Mockups are rough sketches and are not intended to be finalized or "pixel-perfect".)

Clone this wiki locally