Skip to content

Service monitoring

Fernando Barreiro edited this page Mar 28, 2019 · 7 revisions

Harvester service metrics can be pushed out. It requires psutil >= 5.4.8

[service_monitor]
active = True
disk_volumes = data,data1
pidfile = /var/log/harvester/panda_harvester.pid
  • disk_volumes is optional, and supports a comma separated list of volumes
  • pidfile is only mandatory when using uwsgi

Once harvester is pushing out service metrics, you need to configure the thresholds and alerts (https://github.com/PanDAWMS/harvester_service_monitoring). The completed xml file will have to be added to the configuration directory:

<?xml version="1.0"?>
<instances>
    <instance harvesterid="YOUR HARVESTER ID" instanceisenable="True">
        <hostlist>
            <host hostname="THE HOST RUNNING HARVESTER" hostisenable="True">
                <contacts>
                    <email>WHO TO NOTIFY 1</email>
                    <email>WHO TO NOTIFY 2</email>
                </contacts>
                <metrics>
                    <metric name="lastsubmittedworker" enable="True">
                        <value>30</value>
                    </metric>
                    <metric name="lastheartbeat" enable="True">
                        <value>30</value>
                    </metric>
                    <metric name="memory" enable="True">
                        <memory_warning>50</memory_warning>
                        <memory_critical>80</memory_critical>
                    </metric>
                    <metric name="cpu" enable="True">
                        <cpu_warning>50</cpu_warning>
                        <cpu_critical>80</cpu_critical>
                    </metric>
                    <metric name="disk" enable="True">
                        <disk_warning>75</disk_warning>
                        <disk_critical>80</disk_critical>
                    </metric>
                </metrics>
            </host>
... YOU CAN ADD MULTIPLE HOSTS
        </hostlist>
    </instance>
</instances>
  • lastsubmittedworker and lastheartbeat examples: 30 (minutes), 60d... (you can disable the metric in cases where you don't expect regular worker submission)
  • disk_warning/critical, cpu_warning/critical, memory_warning/critical: 50 (expressed in %)