Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove infinite loop from Local Watchdog (LWD) #86

Open
cbouchar opened this issue May 26, 2020 · 2 comments · May be fixed by #98
Open

Remove infinite loop from Local Watchdog (LWD) #86

cbouchar opened this issue May 26, 2020 · 2 comments · May be fixed by #98
Assignees

Comments

@cbouchar
Copy link
Contributor

cbouchar commented May 26, 2020

The fix for bugzilla 1813935 now interrupts the infinite loop issue since the directory
did not exist, the 20_sysinfo plugin got stuck in an infinite loop. In my opinion, the
plugin should timeout after a period of time.

My reasons for this:

  1. There could be another malfunctioning process on the device that won't quit
    sending output to the log file and this file monitoring for more messages will
    never quit until External Watchdog.
  2. I'm concerned about the size of the journalctl/message file if we let it go for too
    long a period of time.
  3. When plugins such as this fails, it's difficult to determine where in the job the
    failure has occurred. With LWD plugin, it is especially difficult since restraint
    doesn't have another timer fired off to interrupt this so it's not until 1/2 hour
    later when EWD kicks in to kill the job.

Solution:
I propose we quit the loop after about 5 minutes. If the timer expires, we will generate
a message reflecting this and report the journalctl/message log with whatever has been
captured up to that point.

@cbouchar cbouchar self-assigned this May 26, 2020
@cbouchar cbouchar changed the title Remove infinite loop from lwd Remove infinite loop from Local Watchdog (LWD) May 26, 2020
@danrodrig
Copy link
Contributor

Quoting myself from #76 (comment)

Now, regarding the timeout, we should look at what's the purpose of WatchFile. It's used only in SysRq, and the intent here is to have the output triggered by echo $ACTION > /proc/sysrq-trigger delimited between the log entries "List of $ACTION Tasks" Start and "List of $ACTION Tasks" Stop. So WatchFile is there to wait until output produced by echo $ACTION > /proc/sysrq-trigger is in the log file.
Notice that per how WatchFile is implemented, it expects and idle system, logging wise. The timeout would be needed if the log file keeps growing between the 2 seconds interval. Therefore a proper value for the timeout should consider how long does echo $ACTION > /proc/sysrq-trigger take to complete.

Regarding point 1,

There could be another malfunctioning process on the device that won't quit sending output to the log file and this file monitoring for more messages will never quit until External Watchdog.

Has this happened before?

The purpose of WatchFile is to make sure that the output triggered by /proc/sysrq-trigger is in the log file. Per docs in https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html,

kernel will respond to regardless of whatever else it is doing, unless it is completely locked up.

So, ideally, SysRq could ensure that the output was dumped in a different way. If the output of /proc/sysrq-trigger can be serialized with other messages, a sentinel message could be used for this.
If the only option is to check for log file changing, and the timeout is really needed, a reasonable value should be used, because maybe 5 minutes is way to much, and we want this plugins to be as non intrusive as possible.

In either case, we need better understanding of /proc/sysrq-trigger to decide on a proper solution.

@cbouchar
Copy link
Contributor Author

cbouchar commented Jun 1, 2020

Quoting myself from #76 (comment)

Now, regarding the timeout, we should look at what's the purpose of WatchFile. It's used only in SysRq, and the intent here is to have the output triggered by echo $ACTION > /proc/sysrq-trigger delimited between the log entries "List of $ACTION Tasks" Start and "List of $ACTION Tasks" Stop. So WatchFile is there to wait until output produced by echo $ACTION > /proc/sysrq-trigger is in the log file.
Notice that per how WatchFile is implemented, it expects and idle system, logging wise. The timeout would be needed if the log file keeps growing between the 2 seconds interval. Therefore a proper value for the timeout should consider how long does echo $ACTION > /proc/sysrq-trigger take to complete.

Regarding point 1,

There could be another malfunctioning process on the device that won't quit sending output to the log file and this file monitoring for more messages will never quit until External Watchdog.

Has this happened before?
I have seen this kind of behavior in other projects where bad situations cause the log file to fill
with repetitive information making it difficult or even possible to see what I needed to see.

The purpose of WatchFile is to make sure that the output triggered by /proc/sysrq-trigger is in the log file. Per docs in https://www.kernel.org/doc/html/latest/admin-guide/sysrq.html,

kernel will respond to regardless of whatever else it is doing, unless it is completely locked up.

I am aware of what 'WatchFile' is doing and it is waiting for completion of sysrq-trigger requests
and it is run 3 times in 20_sysinfo for various output.

So, ideally, SysRq could ensure that the output was dumped in a different way. If the output of /proc/sysrq-trigger can be serialized with other messages, a sentinel message could be used for this.
If the only option is to check for log file changing, and the timeout is really needed, a reasonable value should be used, because maybe 5 minutes is way to much, and we want this plugins to be as non intrusive as possible.

I had proposed 5 minutes as a starting point waiting for opinions. I felt 5 minutes is an improvement
over infinity(EWD). My objective is to have the plugin log the expiry failure so we can leave some
breadcrumbs behind instead of guessing what transpired. I ran my job which is very simple
and I saw (5, 5, 4) seconds between start and stop. WIth this is mind, why don't you propose
a timespan you would be comfortable with keeping in mind my job was very simple? Perhaps
we can make the timer value configurable to override the default though I don't think Martin
would like to do that.

In either case, we need better understanding of /proc/sysrq-trigger to decide on a proper solution.

@cbouchar cbouchar linked a pull request Jun 25, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants