-
Notifications
You must be signed in to change notification settings - Fork 894
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: Automatically restart background worker scheduler #5091
Comments
If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit fixes the issue by adding a GUC that can be set to the restart time for the scheduler, and set the default to 30 seconds. It also adds some additional variables to be able to shutdown the scheduler with a non-zero exit code, which allows the restart functionality to be tested. Fixes timescale#5091
If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit fixes the issue by adding a GUC that can be set to the restart time for the scheduler, and set the default to 30 seconds. It also adds some additional variables to be able to shutdown the scheduler with a non-zero exit code, which allows the restart functionality to be tested. Fixes timescale#5091
If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit fixes the issue by adding a GUC that can be set to the restart time for the scheduler, and set the default to 30 seconds. It also adds some additional variables to be able to shutdown the scheduler with a non-zero exit code, which allows the restart functionality to be tested, as well as tests. Fixes timescale#5091
If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit fixes the issue by adding a GUC that can be set to the restart time for the scheduler, and set the default to 30 seconds. It also adds some additional variables to be able to shutdown the scheduler with a non-zero exit code, which allows the restart functionality to be tested, as well as tests. Fixes timescale#5091
If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit fixes the issue by adding a GUC that can be set to the restart time for the scheduler, and set the default to 30 seconds. It also adds some additional variables to be able to shutdown the scheduler with a non-zero exit code, which allows the restart functionality to be tested, as well as tests. Fixes timescale#5091
If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit fixes the issue by adding a GUC that can be set to the restart time for the scheduler, and set the default to 30 seconds. It also adds some additional variables to be able to shutdown the scheduler with a non-zero exit code, which allows the restart functionality to be tested, as well as tests. Fixes timescale#5091
If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit fixes the issue by adding a GUC that can be set to the restart time for the scheduler, and set the default to 30 seconds. It also adds some additional variables to be able to shutdown the scheduler with a non-zero exit code, which allows the restart functionality to be tested, as well as tests. Fixes timescale#5091
If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit fixes the issue by adding a GUC that can be set to the restart time for the scheduler, and set the default to 30 seconds. It also adds some additional variables to be able to shutdown the scheduler with a non-zero exit code, which allows the restart functionality to be tested, as well as tests. Fixes timescale#5091
If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit fixes the issue by adding a GUC that can be set to the restart time for the scheduler, and set the default to 30 seconds. It also adds some additional variables to be able to shutdown the scheduler with a non-zero exit code, which allows the restart functionality to be tested, as well as tests. Fixes timescale#5091
If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit fixes the issue by adding a GUC that can be set to the restart time for the scheduler, and set the default to 30 seconds. It also adds some additional variables to be able to shutdown the scheduler with a non-zero exit code, which allows the restart functionality to be tested, as well as tests. Fixes timescale#5091
If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit fixes the issue by adding a GUC that can be set to the restart time for the scheduler, and set the default to 30 seconds. It also adds some additional variables to be able to shutdown the scheduler with a non-zero exit code, which allows the restart functionality to be tested, as well as tests. Fixes timescale#5091
If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit fixes the issue by adding a GUC that can be set to the restart time for the scheduler, and set the default to 30 seconds. It also adds some additional variables to be able to shutdown the scheduler with a non-zero exit code, which allows the restart functionality to be tested, as well as tests. Fixes #5091
If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit fixes the issue by adding a GUC that can be set to the restart time for the scheduler, and set the default to 30 seconds. It also adds some additional variables to be able to shutdown the scheduler with a non-zero exit code, which allows the restart functionality to be tested, as well as tests. Fixes timescale#5091
If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit fixes the issue by adding a GUC that can be set to the restart time for the scheduler, and set the default to 30 seconds. It also adds some additional variables to be able to shutdown the scheduler with a non-zero exit code, which allows the restart functionality to be tested, as well as tests. Fixes timescale#5091
There turned out to be issue with this over time since the slots are not released and new schedulers are started by the launcher. |
Having a look at this now, so a quick summary of the situation. There is a cluster-wide launcher that is responsible for starting database-specific schedulers that then starts jobs as needed. When the schedulers exit for later restart, they keep the same slot and will re-used it when restarting, so if the schedulers die, all is fine and they will eventually restart. However, the launcher currently assumes that there are no schedulers running when it starts, so if the launcher is killed, it will set the state of all the schedulers to "allocated" instead of "started", meaning it will try to start new schedulers rather than pick up the old schedulers. When starting the new schedulers, new slots will be allocated and the old schedulers will keep the old slots. As a result, we will have multiple scheduler for each database and will also eventually exhaust the slots array allocated for background workers. It is straightforward to reproduce with these steps:
It other words, it works fine as long as the launcher is not killed, but if that happens, you will get into a bad situation. Looking into ways to either pick up the old schedulers when starting the launcher, or making sure that the schedulers are terminated when the launcher is terminated. Either of these approaches should solve the situation. |
If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit adds the GUC `timescaledb.bgw_scheduler_restart_time` that can be set to the restart time for the scheduler. It defaults to 60 seconds, which is the default restart interval for background workers PostgreSQL defines. It also adds `timescaledb.debug_bgw_scheduler_exit_status` to be able to shutdown the scheduler with a non-zero exit status, which allows the restart functionality to be tested. Fixes timescale#5091
If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit adds the GUC `timescaledb.bgw_scheduler_restart_time` that can be set to the restart time for the scheduler. It defaults to 60 seconds, which is the default restart interval for background workers PostgreSQL defines. It also adds `timescaledb.debug_bgw_scheduler_exit_status` to be able to shutdown the scheduler with a non-zero exit status, which allows the restart functionality to be tested. Fixes timescale#5091
If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit adds the GUC `timescaledb.bgw_scheduler_restart_time` that can be set to the restart time for the scheduler. It defaults to 60 seconds, which is the default restart interval for background workers PostgreSQL defines. It also adds `timescaledb.debug_bgw_scheduler_exit_status` to be able to shutdown the scheduler with a non-zero exit status, which allows the restart functionality to be tested. Fixes timescale#5091
If the scheduler receives an error, it will never restart again since `bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent all jobs from executing. This commit adds the GUC `timescaledb.bgw_scheduler_restart_time` that can be set to the restart time for the scheduler. It defaults to 60 seconds, which is the default restart interval for background workers PostgreSQL defines. It also adds `timescaledb.debug_bgw_scheduler_exit_status` to be able to shutdown the scheduler with a non-zero exit status, which allows the restart functionality to be tested. Fixes timescale#5091
What type of bug is this?
Other
What subsystems and features are affected?
Background worker
What happened?
If the scheduler exits for any reason, no more workers will be scheduled since the scheduler do not restart automatically.
This is because the scheduler is started with
BGW_NEVER_RESTART
as restart time.TimescaleDB version affected
2.9.0
PostgreSQL version used
14.3
What operating system did you use?
Ubuntu 22.04 x86_64
What installation method did you use?
Source
What platform did you run on?
Not applicable
Relevant log output and stack trace
No response
How can we reproduce the bug?
Wait for the scheduler to exit with an error. This can be triggered creating duplicate job entries in the bgw_job table, for example.
The text was updated successfully, but these errors were encountered: