Remove infinite loop from Local Watchdog (LWD) #86

cbouchar · 2020-05-26T13:51:36Z

The fix for bugzilla 1813935 now interrupts the infinite loop issue since the directory
did not exist, the 20_sysinfo plugin got stuck in an infinite loop. In my opinion, the
plugin should timeout after a period of time.

My reasons for this:

There could be another malfunctioning process on the device that won't quit
sending output to the log file and this file monitoring for more messages will
never quit until External Watchdog.
I'm concerned about the size of the journalctl/message file if we let it go for too
long a period of time.
When plugins such as this fails, it's difficult to determine where in the job the
failure has occurred. With LWD plugin, it is especially difficult since restraint
doesn't have another timer fired off to interrupt this so it's not until 1/2 hour
later when EWD kicks in to kill the job.

Solution:
I propose we quit the loop after about 5 minutes. If the timer expires, we will generate
a message reflecting this and report the journalctl/message log with whatever has been
captured up to that point.

danrodrig · 2020-06-01T09:21:25Z

Quoting myself from #76 (comment)

Now, regarding the timeout, we should look at what's the purpose of WatchFile. It's used only in SysRq, and the intent here is to have the output triggered by echo $ACTION > /proc/sysrq-trigger delimited between the log entries "List of $ACTION Tasks" Start and "List of $ACTION Tasks" Stop. So WatchFile is there to wait until output produced by echo $ACTION > /proc/sysrq-trigger is in the log file.
Notice that per how WatchFile is implemented, it expects and idle system, logging wise. The timeout would be needed if the log file keeps growing between the 2 seconds interval. Therefore a proper value for the timeout should consider how long does echo $ACTION > /proc/sysrq-trigger take to complete.

Regarding point 1,

There could be another malfunctioning process on the device that won't quit sending output to the log file and this file monitoring for more messages will never quit until External Watchdog.

Has this happened before?

The purpose of WatchFile is to make sure that the output triggered by /proc/sysrq-trigger is in the log file. Per docs in https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html,

kernel will respond to regardless of whatever else it is doing, unless it is completely locked up.

So, ideally, SysRq could ensure that the output was dumped in a different way. If the output of /proc/sysrq-trigger can be serialized with other messages, a sentinel message could be used for this.
If the only option is to check for log file changing, and the timeout is really needed, a reasonable value should be used, because maybe 5 minutes is way to much, and we want this plugins to be as non intrusive as possible.

In either case, we need better understanding of /proc/sysrq-trigger to decide on a proper solution.

cbouchar · 2020-06-01T13:00:09Z

Quoting myself from #76 (comment)

Now, regarding the timeout, we should look at what's the purpose of WatchFile. It's used only in SysRq, and the intent here is to have the output triggered by echo $ACTION > /proc/sysrq-trigger delimited between the log entries "List of $ACTION Tasks" Start and "List of $ACTION Tasks" Stop. So WatchFile is there to wait until output produced by echo $ACTION > /proc/sysrq-trigger is in the log file.
Notice that per how WatchFile is implemented, it expects and idle system, logging wise. The timeout would be needed if the log file keeps growing between the 2 seconds interval. Therefore a proper value for the timeout should consider how long does echo $ACTION > /proc/sysrq-trigger take to complete.

Regarding point 1,

There could be another malfunctioning process on the device that won't quit sending output to the log file and this file monitoring for more messages will never quit until External Watchdog.

Has this happened before?
I have seen this kind of behavior in other projects where bad situations cause the log file to fill
with repetitive information making it difficult or even possible to see what I needed to see.

The purpose of WatchFile is to make sure that the output triggered by /proc/sysrq-trigger is in the log file. Per docs in https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html,

kernel will respond to regardless of whatever else it is doing, unless it is completely locked up.

I am aware of what 'WatchFile' is doing and it is waiting for completion of sysrq-trigger requests
and it is run 3 times in 20_sysinfo for various output.

So, ideally, SysRq could ensure that the output was dumped in a different way. If the output of /proc/sysrq-trigger can be serialized with other messages, a sentinel message could be used for this.
If the only option is to check for log file changing, and the timeout is really needed, a reasonable value should be used, because maybe 5 minutes is way to much, and we want this plugins to be as non intrusive as possible.

I had proposed 5 minutes as a starting point waiting for opinions. I felt 5 minutes is an improvement
over infinity(EWD). My objective is to have the plugin log the expiry failure so we can leave some
breadcrumbs behind instead of guessing what transpired. I ran my job which is very simple
and I saw (5, 5, 4) seconds between start and stop. WIth this is mind, why don't you propose
a timespan you would be comfortable with keeping in mind my job was very simple? Perhaps
we can make the timer value configurable to override the default though I don't think Martin
would like to do that.

In either case, we need better understanding of /proc/sysrq-trigger to decide on a proper solution.

cbouchar self-assigned this May 26, 2020

cbouchar changed the title ~~Remove infinite loop from lwd~~ Remove infinite loop from Local Watchdog (LWD) May 26, 2020

cbouchar linked a pull request Jun 25, 2020 that will close this issue

Fix: Remove infinite loop in LWD plugin #98

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove infinite loop from Local Watchdog (LWD) #86

Remove infinite loop from Local Watchdog (LWD) #86

cbouchar commented May 26, 2020 •

edited

Loading

danrodrig commented Jun 1, 2020

cbouchar commented Jun 1, 2020

Remove infinite loop from Local Watchdog (LWD) #86

Remove infinite loop from Local Watchdog (LWD) #86

Comments

cbouchar commented May 26, 2020 • edited Loading

danrodrig commented Jun 1, 2020

cbouchar commented Jun 1, 2020

cbouchar commented May 26, 2020 •

edited

Loading