Skip to content

Feature: System Health

Garrett LeSage edited this page May 22, 2018 · 4 revisions

System health is both a subset and superset of PCP integration. In other words, health includes PCP-only health metrics, as well as non-PCP metrics.

Goals

  • Display system status at a glance
  • Provide ways to easily fix issues
  • When an easy fix is not possible, provide information on how to resolve a problem

Examples

When no health issues are detected, the system should report a healthy state. The lists below are not the only problems that could occur (more could be added later), but are a starter list of possible issues with a system.

General health issues

  1. Security-related software updates available
  2. Issues mounting filesystems (as specified in fstab, etc.)
  3. Insufficient storage space on partitions
  4. SMART issues
    • Bad clusters on a disk
    • IO issues
  5. Swap is active
  6. Issues with bringing up network interfaces

PCP-derived health issues

The following detectable issues require PCP to be installed to be accurate and/or useful:

  1. CPU load is constantly too high
  2. Not enough memory is free
  3. Network is constantly saturated
  4. Swap is often used
  5. Excessive waiting for storage (disk is >85% busy)
  6. Huge page fragmentation/defragmentation (memory is fragmented and system is spending a lot of time shuffling chunks of memory around to defragment)
  7. Network errors exist
  8. Packet receive (RX) queue is too small, causing many packages to be dropped

Mockups

(Mockups are rough sketches and are not intended to be finalized or "pixel-perfect".)

Clone this wiki locally