Skip to content

Lbclient

reguero edited this page Apr 3, 2022 · 1 revision

Collectd based lbclient configuration

You may use as well any collectd metrics defined in your node for health monitoring by adding lbclient::alias::check entries of type collectd. You may add any number of them. Not fulfilling any of the correlations will trigger a removal of the machine from the load balance alias. For instance:

# this example works if there is a swap partition
lbclient::alias::check{ 'check swap space at least 10% free':
  check => {'type' => 'collectd', 'data' => ' [swap-var_swap/percent-free] > 10',},
}
# and this example works if there is a swap file. Note that the name of the collectd metric changes
lbclient::alias::check{ 'check swap file t least 10% free':
  check => {'type' => 'collectd', 'data' => ' [swap-swapfile/percent-free] > 10',},
}

lbclient::alias::check { 'Check root full expection':
  check => {'type' => 'collectd', 'data' => ' [df-root/percent_bytes-free] > 15',},
}

# Checking that the ssh process is there
if $facts['os']['release']['major'] == 7 {
  lbclient::alias::check { 'Check ssh':
    check => {'type' => 'collectd', 'data' => '[systemd-sshd/gauge-running]>0',},
  }
} else {
  lbclient::alias::check { 'Check ssh':
    check => {'type' => 'collectd', 'data' => '[processcount-sshd/gauge]>0',},
}

You may also use collectd metrics to define the load metric to be used for load balancing by the lbclient by specifying lbclient::alias::load entries. For instance:

lbclient::alias::load { 'Load balance based on LoadAvg':
  load => {'type' => 'collectd', 'data' => '1 + [load/load-relative] * 12.5 ',},
}

Please note:

  • The collectd metrics may multiplied by coefficients.
  • Multiple collectd metrics can be specified in the same expression, like 'data' => ' * 2 + '
  • Constants may be specified using lbclient::alias::load with type =>'constant'.
  • When you specify several loadi metrics they will be added.
  • If the sum of the metrics gives 0 or a negative number the node will not be considered for the alias thus it will effectively taken out from it.
  • If the sum of the metrics gives a fractional number between 0 and 1 it will be rounded to 0 by the lbclient because the load metric is of integer OID type.
  • The last lbclient::alias::load entry in the example shows how to add 1 to the metric value thus making sure that the node stays in the alias when the combined metric is within 0 and 1.
  • Components of multivalued metrics can be used. They are supported by specifying <metric_name>:identifier where the identifier either the label of the component or an index (starting by '1'). For instance, when having:
# collectdctl getval xxxx.cern.ch/load/load-relative
shortterm=3.000000e-02
midterm=2.500000e-02
longterm=2.500000e-02

We can specify

lbclient::alias::load {  'Load average, midterm':
  load => {type => 'collectd', data => "[load/load-relative:midterm]*${scale}",},
}

that is equivalent to

lbclient::alias::load {  'Load average, midterm':
  load => {type => 'collectd', data => "[load/load-relative:2]*${scale}",},
}

Please note that when just specifying the metric name, whe point to the first element of the multivalued metric, so

lbclient::alias::load {  'Load average, shortterm':
  load => {type => 'collectd', data => "[load/load-relative:shortterm]*${scale}",},
}

is equivalent to

lbclient::alias::load {  'Load average, shortterm':
  load => {type => 'collectd', data => "[load/load-relative]*${scale}",},
}

or to

lbclient::alias::load {  'Load average, shortterm':
  load => {type => 'collectd', data => "[load/load-relative:1]*${scale}",},
}

Here follows a more complex example with a lineal combination of collectd metrics based on the configuration of the messaging service:

#
# Load balancing algorithm
#
$scale = 100 / $::processorcount

lbclient::alias::load {  'Load average, shortterm':
  load => {type => 'collectd', data => "[load/load-relative]*${scale}",},
}

#Including a check to verify the cpu steal is below 10
lbclient::alias::check {  'CPU Steal':
  check => {type => 'collectd', data => "[cpu/percent-steal] < 10",},
}

#Reading the number of octets, and converting to KB 
lbclient::alias::load {  'Network interface read, converted to KB':
  load => {type => 'collectd', data => '[interface-eth0/if_octets:rx] / 1024.' ,},
}

lbclient::alias::load {  'Network interface write, converted to KB':
  load => {type => 'collectd', data => '[interface-eth0/if_octets:tx] / 1024.',},
}

# Note that in this one we have two different values on the same correlation
#First, we have to include the collectd metric for disk rate
include ::collectd::plugin::disk
# Now, check the disk rate. For reading, we will divide by 1000
lbclient::alias::load {  'Read disk operations':
  load => {type => 'collectd', data => '[disk-vda/disk_ops:1] / 1000.', },
}
# And for the write rate, use a different scale factor
lbclient::alias::load {  'Write disk operations':
  load => {type => 'collectd', data => '[disk-vda/disk_ops:2] / 100.', },
}

# We can also define a constant value
lbclient::alias::load {  'Constant value 1':
  load => {type => 'constant', data => '1',},
}

Collectd alarms based lbclient configuration

You may use as well any collectd_alarms metrics defined in your node for health monitoring by adding lbclient::alias::check entries of type collectd. You may add any number of them. Not fulfilling any of the correlations will trigger a removal of the machine from the load balance alias. For instance:

# Check if the metric [swap-var_swap/percent-free] has the alarm state [okay]
lbclient::alias::check{ 'check swap_percent_free_is_okay':
  check => {'type' => 'collectd_alarms', 'data' => ' [{"swap-var_swap/percent-free": "okay"}]',}
}
# Check if the metric [cpu/percent-idle] has the alarm state [okay] or [unknown]
lbclient::alias::check{ 'check _cpu_percent_idle_is_okay_or_unknown':
  check => {'type' => 'collectd_alarms', 'data' => ' [{"cpu/percent-idle": ["okay", "unknown"]}]',},
}
# Check if the metric [interface-eth0/if_dropped] has the alarm state [warning] or [unknown] and the metric [vmem/vmpage_number-writeback] has the alarm state [error] or [unknown]
lbclient::alias::check { 'check_eth0_if_dropped_vmpage_nrw':
  check => {'type' => 'collectd_alarms', 'data' => ' [{"interface-eth0/if_dropped": ["warning", "unknown"]},{"vmem/vmpage_number-writeback":["error", "unknown"]}]',},
}
  • The alarm state is case insensitive.

Please note:

  • The collectd alarm might be in the unknown state if the collectd daemon has been restarted or has not acquired enough data
  • Multiple collectd alarm metric:states pairs can be supplied in a single line
  • Please always use the following syntax when defining a collecd_alarms check:
[{"<metric_name_w/o_hostname_1>":["<desired_state_1", "<desired_state_2>"]},{"<metric_name_w/o_hostname_2>":"<desired_state"}]

Daemon based lbclient configuration

You may use as well any daemon metrics defined in your node for health monitoring by adding lbclient::alias::check entries of type daemon. You may add any number of them. Not fulfilling any of the correlations will trigger a removal of the machine from the load balance alias. For instance:

# This example works if a daemon is listening on the port 25, without taking into account the protocol or host
lbclient::alias::check{ 'Check that there is a daemon listening on the port 25':
  check => {'type' => 'daemon', 'data' => '{"port": 25}',},
}

# For more precision, one could specify the tcp protocol when looking for the port 25
lbclient::alias::check{ 'Check that there is a daemon listening, using tcp, on the port 25':
  check => {'type' => 'daemon', 'data' => '{"port": 25, "protocol": "tcp"}',},
}

# We also support multiple values to be given and evaluate them using an OR condition. In this example, the check will
#  succeed if there is a daemon listening on either port 25 or 26 whilst using TCP or UDP
lbclient::alias::check { 'Check that there is a daemon listening on either port 25 or 26 whilst using TCP or UDP':
  check => {'type' => 'daemon', 'data' => '{"port": [25, 26], "protocol": ["tcp", "udp"]}',},
}

# As a final field, we also support the distinction between IP version levels. In the following example, the check will
#  only succeed if there is any daemon listening on either IPv6 or IPv4 in the port 25 or 26
lbclient::alias::check { 'Check that there is a daemon listening on either port 25 or 26 whilst using IPv6 or IPv4':
  check => {'type' => 'daemon', 'data' => '{"port": [25, 26], "ip": ["ipv6", "ipv4"]}',},
}

## The following aliases can also be used to indicate IPv6 and IPv4 correspondingly
lbclient::alias::check { 'Check that there is a daemon listening on either port 25 or 26 whilst using TCP or UDP':
  check => {'type' => 'daemon', 'data' => '{"port": [25, 26], "ip": ["6", "4"]}',},
}

Please note:

  • All the filter fields support a single or multiple value and are case-insensitive. The check will only take into account the provided fields.
  • The supported filter fields are:
    • Mandatory:
      • port: (integer or string) value that indicates in which ports the daemon needs to be listening on
    • Optional:
      • ip: (string) value that indicates which IP versions the daemon needs to be using (i.e. 'IPv4' or '4' and 'IPv6' or '6')
      • protocol: (string) value that indicates which protocol the daemon needs to be using (i.e. 'TCP' or 'UDP')
      • host: (string) the IP of the desired host that the daemon needs to be serving (e.g. '::' and '127.0.0.1')

Here follows a more complex example that uses all the supported filter fields:

# Check if there is any daemon that is listening on any IPv6 interface, whilst using TCP  or UDP on either port 1122 or 2211

lbclient::alias::load {  'Complex daemon check':
  check => {type => 'daemon', data => '{"port": 22, "host": "::", "protocol": ["tcp", "udp"]}',},
}



Lbclient codes

Negative values of the load metric returned from the lbclient take the node out the alias. This is because negative values of the load metric indicate that some of the health monitoring check conditions configured for the lbclient on the node are not true.

The specific meaning of the numbers are the following:

  • -1: no_login (/etc/nologin or /etc/iss.nologin present)
  • -6: /tmp dir is full (ether blocks or inodes)
  • -7: missing sshd
  • -8: no web server services around
  • -9: no ftp server listening
  • -10: AFS problems
  • -11: no GridFTP daemon listening
  • -12: condition from lemon metric (obsolete)
  • -13: roger appstate != 'production' (when roger enabled in `lbclient::config)
  • -14: health monitoring command giving return code != 0 (when command used in `lbclient::config)
  • -15: condition from collectd metric
  • -16: provided constant could not be loaded
  • -17: the desired port, protocol and daemon combination was not found to be listening
  • -18: EOS problems
  • -19: The machine is scheduled for a reboot

Debugging the lbclient

You may run the lbclient anytime on any LB alias member by invoking /usr/local/sbin/lbclient.

This will give an integer number:

  • When the number is positive, it is the value of the load metric.
  • If it is one of the negative numbers specified in the section lbclient codes, this indicates the some of the health monitoring checks configured for the alias yield a false result (for instance, because a monitoring alarm that takes the node out of the alias when triggered).

For instance, a collectd metric configured to take the node out of the alias when triggered (condition != true) will produce load metric value -15.

You may also run /usr/local/sbin/lbclient with the -d flag and a debugger output level option: [TRACE, WARN, DEBUG, INFO, ERROR, FATAL] (default: FATAL) that will give you additional information on how the current value has been produced. For instance, to have the most vebose output use /usr/local/sbin/lbclient -d TRACE.