Service monitoring

Configuration of the Harvester agent to collect metrics

Harvester service metrics can be pushed out. You need to enable it in your harvester configuration file. It requires psutil >= 5.4.8 and harvester code after 12 Feb 2019

[service_monitor]
active = True
disk_volumes = data,data1
pidfile = /var/log/harvester/panda_harvester.pid

disk_volumes is optional, and supports a comma separated list of volumes
pidfile is only mandatory when using uwsgi

The logs will be written to panda-service_monitor.log. A healthy snippet is:

2019-03-28 03:36:15,559 panda.log.service_monitor: DEBUG    Running service monitor
2019-03-28 03:36:15,576 panda.log.service_monitor: DEBUG    Memory usage: 178.6640625 MiB/2.5024127947056387%, CPU usage: 0.0
2019-03-28 03:36:15,589 panda.log.service_monitor: DEBUG    Disk usage of data: 69.0 %
...

Configuration of the central alerting agent

Once harvester is pushing out service metrics, you need to configure the thresholds and alerts on the alerting agent. The completed xml file will have to be added to the configuration directory(send it to the service managers):

<?xml version="1.0"?>
<instances>
    <instance harvesterid="YOUR HARVESTER ID" instanceisenable="True">
        <hostlist>
            <host hostname="THE HOST RUNNING HARVESTER" hostisenable="True">
                <contacts>
                    <email>WHO TO NOTIFY 1</email>
                    <email>WHO TO NOTIFY 2</email>
                </contacts>
                <metrics>
                    <metric name="lastsubmittedworker" enable="True">
                        <value>30</value>
                    </metric>
                    <metric name="lastheartbeat" enable="True">
                        <value>30</value>
                    </metric>
                    <metric name="memory" enable="True">
                        <memory_warning>50</memory_warning>
                        <memory_critical>80</memory_critical>
                    </metric>
                    <metric name="cpu" enable="True">
                        <cpu_warning>50</cpu_warning>
                        <cpu_critical>80</cpu_critical>
                    </metric>
                    <metric name="disk" enable="True">
                        <disk_warning>75</disk_warning>
                        <disk_critical>80</disk_critical>
                    </metric>
                </metrics>
            </host>
... YOU CAN ADD MULTIPLE HOSTS
        </hostlist>
    </instance>
</instances>

lastsubmittedworker and lastheartbeat examples: 30 (minutes), 60d... (you can disable the metric in cases where you don't expect regular worker submission)
disk_warning/critical, cpu_warning/critical, memory_warning/critical: 50 (expressed in %)

Home

Getting started
Installation and configuration
Testing and running
Debugging
Work with Middleware
Admin FAQ

Developer pages
Code structure
DB structure
State and sequence diagrams
Plugin API specifications
Agents and Plugins descriptions
Plugin utilities
Workflows supported by harvester
Developer Q&A
Release

Development guides
Development workflow
Tagging

Production & commissioning
Condor experiences
Commissioning on the grid
Production servers
Service monitoring
Auto Queue Configuration with AGIS
GCE setup
Kubernetes setup
SSH+RPC middleware setup

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Service monitoring

Configuration of the Harvester agent to collect metrics

Configuration of the central alerting agent

Clone this wiki locally