[Bug]: Automatically restart background worker scheduler #5091

mkindahl · 2022-12-14T11:13:39Z

What type of bug is this?

Other

What subsystems and features are affected?

Background worker

What happened?

If the scheduler exits for any reason, no more workers will be scheduled since the scheduler do not restart automatically.

This is because the scheduler is started with BGW_NEVER_RESTART as restart time.

TimescaleDB version affected

2.9.0

PostgreSQL version used

14.3

What operating system did you use?

Ubuntu 22.04 x86_64

What installation method did you use?

Source

What platform did you run on?

Not applicable

Relevant log output and stack trace

No response

How can we reproduce the bug?

Wait for the scheduler to exit with an error. This can be triggered creating duplicate job entries in the bgw_job table, for example.

The text was updated successfully, but these errors were encountered:

If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit fixes the issue by adding a GUC that can be set to the restart time for the scheduler, and set the default to 30 seconds. It also adds some additional variables to be able to shutdown the scheduler with a non-zero exit code, which allows the restart functionality to be tested. Fixes timescale#5091

If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit fixes the issue by adding a GUC that can be set to the restart time for the scheduler, and set the default to 30 seconds. It also adds some additional variables to be able to shutdown the scheduler with a non-zero exit code, which allows the restart functionality to be tested, as well as tests. Fixes timescale#5091

If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit fixes the issue by adding a GUC that can be set to the restart time for the scheduler, and set the default to 30 seconds. It also adds some additional variables to be able to shutdown the scheduler with a non-zero exit code, which allows the restart functionality to be tested, as well as tests. Fixes #5091

If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit fixes the issue by adding a GUC that can be set to the restart time for the scheduler, and set the default to 30 seconds. It also adds some additional variables to be able to shutdown the scheduler with a non-zero exit code, which allows the restart functionality to be tested, as well as tests. Fixes timescale#5091

mkindahl · 2023-12-07T08:31:33Z

There turned out to be issue with this over time since the slots are not released and new schedulers are started by the launcher.

wrightleft · 2024-12-04T15:57:18Z

Any updates on this issue?

We are using bebd1ab in order to work around this issue, but that commit was reverted by d17b582 and no official fix seems to have come forth.

mkindahl · 2024-12-10T14:01:53Z

Having a look at this now, so a quick summary of the situation.

There is a cluster-wide launcher that is responsible for starting database-specific schedulers that then starts jobs as needed.

When the schedulers exit for later restart, they keep the same slot and will re-used it when restarting, so if the schedulers die, all is fine and they will eventually restart.

However, the launcher currently assumes that there are no schedulers running when it starts, so if the launcher is killed, it will set the state of all the schedulers to "allocated" instead of "started", meaning it will try to start new schedulers rather than pick up the old schedulers.

When starting the new schedulers, new slots will be allocated and the old schedulers will keep the old slots. As a result, we will have multiple scheduler for each database and will also eventually exhaust the slots array allocated for background workers.

It is straightforward to reproduce with these steps:

Apply commit bebd1ab
Set the "exit code" variable to 1 (this will exit the scheduler with 1, triggering a restart)
Set the restart time to something lower than 30 seconds if you're impatient
Terminate the scheduler(s) using pg_terminate_backend.
Let them restart
Terminate the Launcher using pg_terminate_backend.
Watch you getting two schedulers (per database, if you had several running in step 4).

It other words, it works fine as long as the launcher is not killed, but if that happens, you will get into a bad situation.

Looking into ways to either pick up the old schedulers when starting the launcher, or making sure that the schedulers are terminated when the launcher is terminated. Either of these approaches should solve the situation.

If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit adds the GUC `timescaledb.bgw_scheduler_restart_time` that can be set to the restart time for the scheduler. It defaults to 60 seconds, which is the default restart interval for background workers PostgreSQL defines. It also adds `timescaledb.debug_bgw_scheduler_exit_status` to be able to shutdown the scheduler with a non-zero exit status, which allows the restart functionality to be tested. Fixes timescale#5091

mkindahl added bug bgw The background worker subsystem, including the scheduler labels Dec 14, 2022

mkindahl mentioned this issue Oct 13, 2023

Restart scheduler on error #6195

Merged

mkindahl self-assigned this Oct 13, 2023

mkindahl closed this as completed in #6195 Oct 19, 2023

mkindahl reopened this Dec 7, 2023

mkindahl linked a pull request Dec 11, 2024 that will close this issue

Restart scheduler on error #7527

Open

mkindahl linked a pull request Dec 18, 2024 that will close this issue

Restart scheduler on error #7527

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Automatically restart background worker scheduler #5091

[Bug]: Automatically restart background worker scheduler #5091

mkindahl commented Dec 14, 2022 •

edited

Loading

mkindahl commented Dec 7, 2023

wrightleft commented Dec 4, 2024

mkindahl commented Dec 10, 2024 •

edited

Loading

[Bug]: Automatically restart background worker scheduler #5091

[Bug]: Automatically restart background worker scheduler #5091

Comments

mkindahl commented Dec 14, 2022 • edited Loading

What type of bug is this?

What subsystems and features are affected?

What happened?

TimescaleDB version affected

PostgreSQL version used

What operating system did you use?

What installation method did you use?

What platform did you run on?

Relevant log output and stack trace

How can we reproduce the bug?

mkindahl commented Dec 7, 2023

wrightleft commented Dec 4, 2024

mkindahl commented Dec 10, 2024 • edited Loading

mkindahl commented Dec 14, 2022 •

edited

Loading

mkindahl commented Dec 10, 2024 •

edited

Loading