Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Automatically restart background worker scheduler #5091

Open
mkindahl opened this issue Dec 14, 2022 · 3 comments · Fixed by #6195 · May be fixed by #7527
Open

[Bug]: Automatically restart background worker scheduler #5091

mkindahl opened this issue Dec 14, 2022 · 3 comments · Fixed by #6195 · May be fixed by #7527
Assignees
Labels
bgw The background worker subsystem, including the scheduler bug

Comments

@mkindahl
Copy link
Contributor

mkindahl commented Dec 14, 2022

What type of bug is this?

Other

What subsystems and features are affected?

Background worker

What happened?

If the scheduler exits for any reason, no more workers will be scheduled since the scheduler do not restart automatically.

This is because the scheduler is started with BGW_NEVER_RESTART as restart time.

TimescaleDB version affected

2.9.0

PostgreSQL version used

14.3

What operating system did you use?

Ubuntu 22.04 x86_64

What installation method did you use?

Source

What platform did you run on?

Not applicable

Relevant log output and stack trace

No response

How can we reproduce the bug?

Wait for the scheduler to exit with an error. This can be triggered creating duplicate job entries in the bgw_job table, for example.

@mkindahl mkindahl added bug bgw The background worker subsystem, including the scheduler labels Dec 14, 2022
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Oct 13, 2023
If the scheduler receives an error, it will never restart again since
`bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent
all jobs from executing.

This commit fixes the issue by adding a GUC that can be set to the
restart time for the scheduler, and set the default to 30 seconds. It
also adds some additional variables to be able to shutdown the
scheduler with a non-zero exit code, which allows the restart
functionality to be tested.

Fixes timescale#5091
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Oct 13, 2023
If the scheduler receives an error, it will never restart again since
`bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent
all jobs from executing.

This commit fixes the issue by adding a GUC that can be set to the
restart time for the scheduler, and set the default to 30 seconds. It
also adds some additional variables to be able to shutdown the
scheduler with a non-zero exit code, which allows the restart
functionality to be tested.

Fixes timescale#5091
@mkindahl mkindahl self-assigned this Oct 13, 2023
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Oct 13, 2023
If the scheduler receives an error, it will never restart again since
`bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent
all jobs from executing.

This commit fixes the issue by adding a GUC that can be set to the
restart time for the scheduler, and set the default to 30 seconds. It
also adds some additional variables to be able to shutdown the
scheduler with a non-zero exit code, which allows the restart
functionality to be tested, as well as tests.

Fixes timescale#5091
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Oct 16, 2023
If the scheduler receives an error, it will never restart again since
`bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent
all jobs from executing.

This commit fixes the issue by adding a GUC that can be set to the
restart time for the scheduler, and set the default to 30 seconds. It
also adds some additional variables to be able to shutdown the
scheduler with a non-zero exit code, which allows the restart
functionality to be tested, as well as tests.

Fixes timescale#5091
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Oct 16, 2023
If the scheduler receives an error, it will never restart again since
`bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent
all jobs from executing.

This commit fixes the issue by adding a GUC that can be set to the
restart time for the scheduler, and set the default to 30 seconds. It
also adds some additional variables to be able to shutdown the
scheduler with a non-zero exit code, which allows the restart
functionality to be tested, as well as tests.

Fixes timescale#5091
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Oct 16, 2023
If the scheduler receives an error, it will never restart again since
`bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent
all jobs from executing.

This commit fixes the issue by adding a GUC that can be set to the
restart time for the scheduler, and set the default to 30 seconds. It
also adds some additional variables to be able to shutdown the
scheduler with a non-zero exit code, which allows the restart
functionality to be tested, as well as tests.

Fixes timescale#5091
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Oct 16, 2023
If the scheduler receives an error, it will never restart again since
`bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent
all jobs from executing.

This commit fixes the issue by adding a GUC that can be set to the
restart time for the scheduler, and set the default to 30 seconds. It
also adds some additional variables to be able to shutdown the
scheduler with a non-zero exit code, which allows the restart
functionality to be tested, as well as tests.

Fixes timescale#5091
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Oct 16, 2023
If the scheduler receives an error, it will never restart again since
`bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent
all jobs from executing.

This commit fixes the issue by adding a GUC that can be set to the
restart time for the scheduler, and set the default to 30 seconds. It
also adds some additional variables to be able to shutdown the
scheduler with a non-zero exit code, which allows the restart
functionality to be tested, as well as tests.

Fixes timescale#5091
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Oct 19, 2023
If the scheduler receives an error, it will never restart again since
`bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent
all jobs from executing.

This commit fixes the issue by adding a GUC that can be set to the
restart time for the scheduler, and set the default to 30 seconds. It
also adds some additional variables to be able to shutdown the
scheduler with a non-zero exit code, which allows the restart
functionality to be tested, as well as tests.

Fixes timescale#5091
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Oct 19, 2023
If the scheduler receives an error, it will never restart again since
`bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent
all jobs from executing.

This commit fixes the issue by adding a GUC that can be set to the
restart time for the scheduler, and set the default to 30 seconds. It
also adds some additional variables to be able to shutdown the
scheduler with a non-zero exit code, which allows the restart
functionality to be tested, as well as tests.

Fixes timescale#5091
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Oct 19, 2023
If the scheduler receives an error, it will never restart again since
`bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent
all jobs from executing.

This commit fixes the issue by adding a GUC that can be set to the
restart time for the scheduler, and set the default to 30 seconds. It
also adds some additional variables to be able to shutdown the
scheduler with a non-zero exit code, which allows the restart
functionality to be tested, as well as tests.

Fixes timescale#5091
mkindahl added a commit that referenced this issue Oct 19, 2023
If the scheduler receives an error, it will never restart again since
`bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent
all jobs from executing.

This commit fixes the issue by adding a GUC that can be set to the
restart time for the scheduler, and set the default to 30 seconds. It
also adds some additional variables to be able to shutdown the
scheduler with a non-zero exit code, which allows the restart
functionality to be tested, as well as tests.

Fixes #5091
jnidzwetzki pushed a commit to jnidzwetzki/timescaledb that referenced this issue Nov 9, 2023
If the scheduler receives an error, it will never restart again since
`bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent
all jobs from executing.

This commit fixes the issue by adding a GUC that can be set to the
restart time for the scheduler, and set the default to 30 seconds. It
also adds some additional variables to be able to shutdown the
scheduler with a non-zero exit code, which allows the restart
functionality to be tested, as well as tests.

Fixes timescale#5091
jnidzwetzki pushed a commit to jnidzwetzki/timescaledb that referenced this issue Nov 19, 2023
If the scheduler receives an error, it will never restart again since
`bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent
all jobs from executing.

This commit fixes the issue by adding a GUC that can be set to the
restart time for the scheduler, and set the default to 30 seconds. It
also adds some additional variables to be able to shutdown the
scheduler with a non-zero exit code, which allows the restart
functionality to be tested, as well as tests.

Fixes timescale#5091
@mkindahl
Copy link
Contributor Author

mkindahl commented Dec 7, 2023

There turned out to be issue with this over time since the slots are not released and new schedulers are started by the launcher.

@mkindahl mkindahl reopened this Dec 7, 2023
@wrightleft
Copy link

Any updates on this issue?

We are using bebd1ab in order to work around this issue, but that commit was reverted by d17b582 and no official fix seems to have come forth.

@mkindahl
Copy link
Contributor Author

mkindahl commented Dec 10, 2024

Having a look at this now, so a quick summary of the situation.

There is a cluster-wide launcher that is responsible for starting database-specific schedulers that then starts jobs as needed.

When the schedulers exit for later restart, they keep the same slot and will re-used it when restarting, so if the schedulers die, all is fine and they will eventually restart.

However, the launcher currently assumes that there are no schedulers running when it starts, so if the launcher is killed, it will set the state of all the schedulers to "allocated" instead of "started", meaning it will try to start new schedulers rather than pick up the old schedulers.

When starting the new schedulers, new slots will be allocated and the old schedulers will keep the old slots. As a result, we will have multiple scheduler for each database and will also eventually exhaust the slots array allocated for background workers.

It is straightforward to reproduce with these steps:

  1. Apply commit bebd1ab
  2. Set the "exit code" variable to 1 (this will exit the scheduler with 1, triggering a restart)
  3. Set the restart time to something lower than 30 seconds if you're impatient
  4. Terminate the scheduler(s) using pg_terminate_backend.
  5. Let them restart
  6. Terminate the Launcher using pg_terminate_backend.
  7. Watch you getting two schedulers (per database, if you had several running in step 4).

It other words, it works fine as long as the launcher is not killed, but if that happens, you will get into a bad situation.

Looking into ways to either pick up the old schedulers when starting the launcher, or making sure that the schedulers are terminated when the launcher is terminated. Either of these approaches should solve the situation.

mkindahl added a commit to mkindahl/timescaledb that referenced this issue Dec 11, 2024
If the scheduler receives an error, it will never restart again since
`bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent
all jobs from executing.

This commit adds the GUC `timescaledb.bgw_scheduler_restart_time` that
can be set to the restart time for the scheduler. It defaults
to 60 seconds, which is the default restart interval for background
workers PostgreSQL defines.

It also adds `timescaledb.debug_bgw_scheduler_exit_status` to be able
to shutdown the scheduler with a non-zero exit status, which allows the
restart functionality to be tested.

Fixes timescale#5091
@mkindahl mkindahl linked a pull request Dec 11, 2024 that will close this issue
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Dec 11, 2024
If the scheduler receives an error, it will never restart again since
`bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent
all jobs from executing.

This commit adds the GUC `timescaledb.bgw_scheduler_restart_time` that
can be set to the restart time for the scheduler. It defaults
to 60 seconds, which is the default restart interval for background
workers PostgreSQL defines.

It also adds `timescaledb.debug_bgw_scheduler_exit_status` to be able
to shutdown the scheduler with a non-zero exit status, which allows the
restart functionality to be tested.

Fixes timescale#5091
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Dec 11, 2024
If the scheduler receives an error, it will never restart again since
`bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent
all jobs from executing.

This commit adds the GUC `timescaledb.bgw_scheduler_restart_time` that
can be set to the restart time for the scheduler. It defaults
to 60 seconds, which is the default restart interval for background
workers PostgreSQL defines.

It also adds `timescaledb.debug_bgw_scheduler_exit_status` to be able
to shutdown the scheduler with a non-zero exit status, which allows the
restart functionality to be tested.

Fixes timescale#5091
mkindahl added a commit to mkindahl/timescaledb that referenced this issue Dec 11, 2024
If the scheduler receives an error, it will never restart again since
`bgw_restart_time` is set to `BGW_NEVER_RESTART`, which will prevent
all jobs from executing.

This commit adds the GUC `timescaledb.bgw_scheduler_restart_time` that
can be set to the restart time for the scheduler. It defaults
to 60 seconds, which is the default restart interval for background
workers PostgreSQL defines.

It also adds `timescaledb.debug_bgw_scheduler_exit_status` to be able
to shutdown the scheduler with a non-zero exit status, which allows the
restart functionality to be tested.

Fixes timescale#5091
@mkindahl mkindahl linked a pull request Dec 18, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bgw The background worker subsystem, including the scheduler bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants