-
Notifications
You must be signed in to change notification settings - Fork 195
Health Check Tool
Health Check is a tool based on the Link to Health Check RFC. Currently, the tool is in the phase of disconnected solution. This means that the tool takes a Documentation about what is supportconfig as an input.
Based on the supportconfig, use Health Check to:
- Search and visualize errors in log files.
- Visualize state of the analyzed system.
- Parse the configuration of the system and detect possibly incorrect configuration values.
Because Health Check is disconnected from a live Uyuni or SUSE Multi Linux Manager (MLM) node, all data is based on the supportconfig. When you make a modification to your configuration and want to verify the correctness to the modification, you must:
- Take a new supportconfig
- Reprovision the Health Check tool
Health Check tool consists of the following parts:
- Loki - a time-series database for Promtail.
- Promtail - a log parser that parses logs from a supportconfig, and saves it in Loki.
- Supportconfig-exporter - a tool that parses the server configuration from supportconfig and serves specified data by using an HTTP server.
- Grafana - a visualization layer.
- Manager - code that starts and stops the tool.
![](https://private-user-images.githubusercontent.com/16302538/408146974-3bfa1c7c-77ac-4589-b616-c70dcb56a6c5.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MzkxMDU3NzAsIm5iZiI6MTczOTEwNTQ3MCwicGF0aCI6Ii8xNjMwMjUzOC80MDgxNDY5NzQtM2JmYTFjN2MtNzdhYy00NTg5LWI2MTYtYzcwZGNiNTZhNmM1LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTAyMDklMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwMjA5VDEyNTExMFomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTgwYmYyZTJkMTI2NzEwNGNmNjBlMTYxODhkYzc5YmNjN2M3NmM2ODkxYzk0ZjQ0YWI0ZmYwMzdlZTRjNjFlZmQmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.eMdqUMbOEFdVmqhqcePwa4ZHXsEVXzK_RgGfPNCfm_4)
Source Code for Diagram
graph LR subgraph "Containers" promtail["Promptail"] --> loki["Loki"] grafana["Grafana"] --> loki grafana --> supportconfig-exporter["supportconfig-exporter"] end subgraph "filesystem" supportconfig["supportconfig directory"] --> promtail supportconfig --> supportconfig-exporter end user["User"] --> grafana
Currently, the tool uses the following ports:
Tool | External Port | Internal Port |
---|---|---|
Grafana | 3000 |
3000 |
Loki | 3100 |
3100 |
Promtail | 9081 |
9081 |
supportconfig_exporter | 9000 |
9000 |
Caution
All ports bind to all network interfaces, and expose potentially sensitive data.
TODO
To start the tool, first generate a supportconfig from your Uyuni or MLM server, and unzip the supportconfig into a directory, for example /tmp/supportconfig
.
Note
The Health Check tool requires the MLM supportconfig plugin for a large part of its functionality. Ensure you have installed the MLM supportconfig plugin before generating the supportconfig.
Then, use the run
command to start the Health Check tool:
$ health-check --supportconfig_path /tmp/supportconfig \
run
The previous command configures Grafana to filter logs from the past 7 days by default. Use the --since
parameter to modify the default time window.
The following command configures the default Grafana window to one year:
$ health-check --supportconfig_path /tmp/supportconfig \
--since 365 run
When Health Check finishes deploying the Health Check tool, open localhost:3000
in your web browser to see the Grafana interface.
Note
If you need to log in, use the admin:admin
user and password credentials.
In the left Home
menu, click Dashboards, then click Supportconfig with Logs. This is a pre-provisioned dashboard that contains data about logs, errors, and configuration parsed from the provided supportconfig directory.
If you see any alerts firing, click View alert rule next to the alert, then click Details. Alerts typically have a summary that explains why the alert is firing.
There are multiple causes for this issue, for example:
- Promptail hasn't finished parsing the logs yet.
- Grafana hasn't finished streaming the logs from Loki yet.
- Grafana has a time window that filters log errors, for example most errors happened with a year-old timestamp and you are looking at a time window of 6 months.
- There are no logs with error or warning in the supportconfig.
- There is a bug in our collecting or filtering mechanisms.
Note
Promptail only parses log files specific to Uyuni and MLM, such as Salt logs, HTTPd logs, MLM-specific logs, and similar. Promptail does not parse all log files. For example, journalctl logs are not parsed. It is possible the system is experiencing errors even when Grafana shows no errors.
If you have ruled out all of the problems and an error you found in supportconfig is not displayed in Grafana when you believe it should be, please create an issue.
This is a known problem. There are several possible resource-intensive parts of Health Check. Use the podman stats
command to verify which part of Health Check is the most resource intensive, for example:
$ podman stats --format "{{.Name}}\t{{.MemUsage}}\t{{.AVGCPU}}"
health-check-grafana 0B / 32.35GB 0.98%
health_check_loki 0B / 32.35GB 11.95%
health_check_supportconfig_exporter 0B / 32.35GB 0.04%
...output omitted...
Currently, Promtail contains a memory leak that eventually requires restart of the Promtail container.
To work around this problem, after you start seeing logs and other data from the time frame in which you are interested, you can stop the Promtail container:
$ podman stop health_check_promtail
You can start the container at a later date if you need Promtail to resume parsing logs:
$ podman start health_check_promtail
Grafana can be CPU-intensive due to the number of operations and alerts it uses. Currently, Grafana spawns around 22
threads, which might cause problems on some machines. There is no workaround as of yet.