Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consul-alerts stops working after some time #276

Open
nh2 opened this issue Mar 30, 2020 · 0 comments
Open

consul-alerts stops working after some time #276

nh2 opened this issue Mar 30, 2020 · 0 comments

Comments

@nh2
Copy link
Contributor

nh2 commented Mar 30, 2020

Background: I've been running consul-alerts for years in a cluster of 3 consul servers.

The last restart I've done was ~4 months ago.

Today external website monitoring alerted me that one of my 3 machines was down (the reason being that the underlying VPN excluded it), including 1 of the 3 consul servers. I was surprised that external monitoring told me that, and consul-alerts did not, as that is its main job.

I checked the configured consul-alerts/config/notifiers/log/path on all 3 machines, and the last outputs as recorded in those log files were from 7 days ago, 6 days ago, and 3 days ago respectively.

System logs on all 3 machines:

[root@node-1:~]# systemctl status consul-alerts | cat
● consul-alerts.service
   Loaded: loaded (/nix/store/xj9xpqylbnifwzxaisllqnq4yrnzw0xg-unit-consul-alerts.service/consul-alerts.service; enabled; vendor preset: enabled)
   Active: active (running) since Sun 2019-11-03 02:23:58 UTC; 4 months 26 days ago
 Main PID: 1703 (consul-alerts)
    Tasks: 83 (limit: 4915)
   CGroup: /system.slice/consul-alerts.service
           ├─ 1703 /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts start --alert-addr=localhost:9000 --consul-addr=localhost:8500 --watch-checks --watch-events
           ├─ 1769 consul watch -http-addr localhost:8500 -datacenter dc1 -type checks /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch checks --alert-addr localhost:9000 --log-level warn
           ├─ 1770 consul watch -http-addr localhost:8500 -datacenter dc1 -type event /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch event --alert-addr localhost:9000 --log-level warn
           └─17052 /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch checks --alert-addr localhost:9000 --log-level warn

Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.
[root@node-2:~]# systemctl status consul-alerts | cat
● consul-alerts.service
   Loaded: loaded (/nix/store/xj9xpqylbnifwzxaisllqnq4yrnzw0xg-unit-consul-alerts.service/consul-alerts.service; enabled; vendor preset: enabled)
   Active: active (running) since Fri 2020-01-17 17:50:03 UTC; 2 months 12 days ago
 Main PID: 21793 (consul-alerts)
    Tasks: 76 (limit: 4915)
   CGroup: /system.slice/consul-alerts.service
           ├─21793 /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts start --alert-addr=localhost:9000 --consul-addr=localhost:8500 --watch-checks --watch-events
           ├─21806 consul watch -http-addr localhost:8500 -datacenter dc1 -type event /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch event --alert-addr localhost:9000 --log-level warn
           └─21807 consul watch -http-addr localhost:8500 -datacenter dc1 -type checks /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch checks --alert-addr localhost:9000 --log-level warn

Mar 30 16:14:05 node-2 consul-alerts[21793]: 2020/03/30 16:14:05 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:17:12 node-2 consul-alerts[21793]: 2020/03/30 16:17:12 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:20:20 node-2 consul-alerts[21793]: 2020/03/30 16:20:20 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:23:27 node-2 consul-alerts[21793]: 2020/03/30 16:23:27 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:26:34 node-2 consul-alerts[21793]: 2020/03/30 16:26:34 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:29:41 node-2 consul-alerts[21793]: 2020/03/30 16:29:41 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:32:48 node-2 consul-alerts[21793]: 2020/03/30 16:32:48 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:35:56 node-2 consul-alerts[21793]: 2020/03/30 16:35:56 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:39:03 node-2 consul-alerts[21793]: 2020/03/30 16:39:03 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
Mar 30 16:42:10 node-2 consul-alerts[21793]: 2020/03/30 16:42:10 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 3m0s
[root@node-3:~]# systemctl status consul-alerts | cat
● consul-alerts.service
   Loaded: loaded (/nix/store/xj9xpqylbnifwzxaisllqnq4yrnzw0xg-unit-consul-alerts.service/consul-alerts.service; enabled; vendor preset: enabled)
   Active: active (running) since Sun 2019-11-03 02:24:38 UTC; 4 months 26 days ago
 Main PID: 1592 (consul-alerts)
    Tasks: 76 (limit: 4915)
   CGroup: /system.slice/consul-alerts.service
           ├─1592 /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts start --alert-addr=localhost:9000 --consul-addr=localhost:8500 --watch-checks --watch-events
           ├─1624 consul watch -http-addr localhost:8500 -datacenter dc1 -type event /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch event --alert-addr localhost:9000 --log-level warn
           └─1625 consul watch -http-addr localhost:8500 -datacenter dc1 -type checks /nix/store/x2d3vj5wybgm7h0db40hbfnwmlskmsak-consul-alerts-0.5.0-bin/bin/consul-alerts watch checks --alert-addr localhost:9000 --log-level warn

Mar 19 13:27:24 node-3 consul-alerts[1592]: 2020/03/19 13:27:24 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (rpc error making call: EOF), retry in 5s
Mar 19 13:27:36 node-3 consul-alerts[1592]: 2020/03/19 13:27:36 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (No cluster leader), retry in 20s
Mar 30 15:07:48 node-3 consul-alerts[1592]: 2020/03/30 15:07:48 [ERR] consul.watch: Watch (type: checks) errored: Unexpected response code: 500 (rpc error making call: EOF), retry in 5s
Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable.

Note the Warning: Journal has been rotated since unit was started. Log output is incomplete or unavailable. is not important here -- that is what systemd prints for log rotation, but the newer (perhaps relevant) logs are still present here).

In the output we can see that node-1 logged no errors, node-2 was in a retry loop (makes sense given that it is the one that was kicked out of the VPN, so it cannot talk to the Consul server quorum), and node 3 apparently had a short error today, but retried only once because its retry did not grow beyond 5s.

If I log into the consul dashboard (localhost:8500 on either server), I can see that lots of services are marked red.

Yet, consul-alerts today sent me no notifications at all -- neither to Slack (as I have it configured), nor into the configured log file. It usually does notify my.

So the question is: Why did it fail as a watchdog today?

I would have expected it to notify me of the errors shown in the Consul UI shortly after they ocurred.

A systemctl restart consul-alerts on node-1 immediately resulted in a Slack notification, and correct entries in the log file (but only in the log file of node-3!).

So it seems that something got stuck, and only manual restarting revived it (which is of course not good for an automated high-availability watchdog).

I cannot currently tell whether the fault is in consul-alerts, or in consul itself (consul-alerts starts consul watch, so in case that one hung and didn't output anything, it would be consul's fault).

Maybe anybody knows anything?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant