Skip to content

Health Check Tool

Marek Czernek edited this page Jan 31, 2025 · 6 revisions

Health Check

Introducing Health Check Tool

Health Check is a tool based on the Link to Health Check RFC. Currently, the tool is in the phase of disconnected solution. This means that the tool takes a Documentation about what is supportconfig as an input.

Based on the supportconfig, use Health Check to:

  • Search and visualize errors in log files.
  • Visualize state of the analyzed system.
  • Parse the configuration of the system and detect possibly incorrect configuration values.

Because Health Check is disconnected from a live Uyuni or SUSE Multi Linux Manager (MLM) node, all data is based on the supportconfig. When you make a modification to your configuration and want to verify the correctness to the modification, you must:

  • Take a new supportconfig
  • Reprovision the Health Check tool

Architecture

Health Check tool consists of the following parts:

  • Loki - a time-series database for Promtail.
  • Promtail - a log parser that parses logs from a supportconfig, and saves it in Loki.
  • Supportconfig-exporter - a tool that parses the server configuration from supportconfig and serves specified data by using an HTTP server.
  • Grafana - a visualization layer.
  • Manager - code that starts and stops the tool.
Source Code for Diagram
graph LR
    subgraph "Containers"
        promtail["Promptail"] --> loki["Loki"]
        grafana["Grafana"] --> loki
        grafana --> supportconfig-exporter["supportconfig-exporter"]
    end
    subgraph "filesystem"
      supportconfig["supportconfig directory"] --> promtail
      supportconfig --> supportconfig-exporter
    end
    user["User"] --> grafana

Currently, the tool uses the following ports:

Tool External Port Internal Port
Grafana 3000 3000
Loki 3100 3100
Promtail 9081 9081
supportconfig_exporter 9000 9000

Caution

All ports bind to all network interfaces, and expose potentially sensitive data.

Installation Process

TODO

Usage

To start the tool, first generate a supportconfig from your Uyuni or MLM server, and unzip the supportconfig into a directory, for example /tmp/supportconfig.

Note

The Health Check tool requires the MLM supportconfig plugin for a large part of its functionality. Ensure you have installed the MLM supportconfig plugin before generating the supportconfig.

Then, use the run command to start the Health Check tool:

$ health-check --supportconfig_path /tmp/supportconfig \
  run

The previous command configures Grafana to filter logs from the past 7 days by default. Use the --since parameter to modify the default time window.

The following command configures the default Grafana window to one year:

$ health-check --supportconfig_path /tmp/supportconfig \
  --since 365 run

Navigating Grafana

When Health Check finishes deploying the Health Check tool, open localhost:3000 in your web browser to see the Grafana interface.

Note

If you need to log in, use the admin:admin user and password credentials.

In the left Home menu, click Dashboards, then click Supportconfig with Logs. This is a pre-provisioned dashboard that contains data about logs, errors, and configuration parsed from the provided supportconfig directory.

If you see any alerts firing, click View alert rule next to the alert, then click Details. Alerts typically have a summary that explains why the alert is firing.

FAQ and Troubleshooting

I don't see any logs in Grafana

There are multiple causes for this issue, for example:

  • Promptail hasn't finished parsing the logs yet.
  • Grafana hasn't finished streaming the logs from Loki yet.
  • Grafana has a time window that filters log errors, for example most errors happened with a year-old timestamp and you are looking at a time window of 6 months.
  • There are no logs with error or warning in the supportconfig.
  • There is a bug in our collecting or filtering mechanisms.

Note

Promptail only parses log files specific to Uyuni and MLM, such as Salt logs, HTTPd logs, MLM-specific logs, and similar. Promptail does not parse all log files. For example, journalctl logs are not parsed. It is possible the system is experiencing errors even when Grafana shows no errors.

If you have ruled out all of the problems and an error you found in supportconfig is not displayed in Grafana when you believe it should be, please create an issue.

Health Check consumes too much system resources

This is a known problem. There are several possible resource-intensive parts of Health Check. Use the podman stats command to verify which part of Health Check is the most resource intensive, for example:

$ podman stats --format "{{.Name}}\t{{.MemUsage}}\t{{.AVGCPU}}"
health-check-grafana			0B / 32.35GB	0.98%
health_check_loki			0B / 32.35GB	11.95%
health_check_supportconfig_exporter	0B / 32.35GB	0.04%
...output omitted...

Promtail Resource Consumption

Currently, Promtail contains a memory leak that eventually requires restart of the Promtail container.

To work around this problem, after you start seeing logs and other data from the time frame in which you are interested, you can stop the Promtail container:

$ podman stop health_check_promtail

You can start the container at a later date if you need Promtail to resume parsing logs:

$ podman start health_check_promtail

Grafana Resource Consumption

Grafana can be CPU-intensive due to the number of operations and alerts it uses. Currently, Grafana spawns around 22 threads, which might cause problems on some machines. There is no workaround as of yet.

Clone this wiki locally