Skip to content

Commit

Permalink
Merge pull request #6677 from grondo/issue#6644
Browse files Browse the repository at this point in the history
fix prolog/epilog timeout handling
  • Loading branch information
mergify[bot] authored Mar 4, 2025
2 parents 7c6746e + 13f4f73 commit 621c7b8
Show file tree
Hide file tree
Showing 3 changed files with 36 additions and 52 deletions.
54 changes: 30 additions & 24 deletions doc/man5/flux-config-job-manager.rst
Original file line number Diff line number Diff line change
Expand Up @@ -110,28 +110,31 @@ prolog
supported for the ``[job-manager.prolog]`` table:

command
(optional) An array of strings specifying the command to run. If
(optional, string) An array of strings specifying the command to run. If
``exec.imp`` is set, the the default command is ``["flux-imp",
"run", "prolog"]``, otherwise it is an error if command is not set.
per-rank
(optional) By default the job-manager prolog only runs ``command``
(optional, bool) By default the job-manager prolog only runs ``command``
on rank 0. With ``per-rank=true``, the command will be run on each
rank assigned to the job.
timeout
(optional) A string value in Flux Standard Duration specifying a
(optional, string) A string value in Flux Standard Duration specifying a
timeout for the prolog, after which it is terminated (and a job
exception raised). The default prolog timeout is 30m. To disable
the timeout use ``0`` or ``infinity``.
kill-timeout
(optional) If a job exception is raised during the job prolog and
``cancel-on-exception`` is true, the prolog will be canceled by
sending it a SIGTERM signal. ``kill-timeout`` is the number of
seconds to wait until any nodes with prolog tasks that are still
active will be drained. The drain reason will include the string
"canceled then timed out". The default is 60.
(optional, float) If the prolog times out, or a job exception is raised
during the job prolog and ``cancel-on-exception`` is true, the prolog
will be canceled by sending it a SIGTERM signal. ``kill-timeout``
is a floating point number of seconds to wait until any nodes with
prolog tasks that are still active will be drained. The drain reason
will include the string "timed out" if the prolog timeout was reached,
or "canceled then timed out" if the prolog was canceled after a job
exception then timed out. The default is 60.
cancel-on-exception
(optional) A boolean indicating whether a fatal job exception raised
while the prolog is active terminates the prolog. The default is true.
(optional, bool) A boolean indicating if the prolog should be terminated
when a fatal job exception is raised while the prolog is active. The
default is true.

epilog
(optional) Table of configuration for a job-manager epilog. If
Expand All @@ -141,29 +144,32 @@ epilog
The ``[job-manager.epilog]`` table supports the following keys:

command
(optional) An array of strings specifying the command to run. If
(optional, string) An array of strings specifying the command to run. If
``exec.imp`` is set, the the default command is ``["flux-imp",
"run", "prolog"]``, otherwise it is an error if command is not set.
per-rank
(optional) By default the job-manager epilog only runs ``command``
(optional, bool) By default the job-manager epilog only runs ``command``
on rank 0. With ``per-rank=true``, the command will be run on each
rank assigned to the job.
timeout
(optional) A string value in Flux Standard Duration specifying a
(optional, string) A string value in Flux Standard Duration specifying a
timeout for the epilog, after which it is terminated (and a job
exception raised). By default, the epilog timeout is disabled.
kill-timeout
(optional) If a job exception is raised during the job epilog and
``cancel-on-exception`` is ``true``, then the epilog will be canceled
by sending it a SIGTERM signal. ``kill-timeout`` is the number of
seconds to wait until any nodes with prolog tasks that are still
active will be drained. The drain reason will include the string
"canceled then timed out". The default is 10. (``kill-timeout`` with
the job epilog should only be used for testing purposes)
(optional, float) If the epilog times out, or a job exception is raised
during the job epilog and ``cancel-on-exception`` is true, the epilog
will be canceled by sending it a SIGTERM signal. ``kill-timeout``
is a floating point number of seconds to wait until any nodes with
epilog tasks that are still active will be drained. The drain reason
will include the string "timed out" if the epilog timeout was reached,
or "canceled then timed out" if the epilog was canceled after a job
exception then timed out. The default is 60.
cancel-on-exception
(optional) A boolean indicating whether a fatal job exception raised
while the epilog is active terminates the epilog. The default is true.
(cancel-on-exception is only used with the epilog for testing purposes)
(optional, bool) A boolean indicating if the epilog should be terminated
when a fatal job exception is raised while the epilog is active. The
default is false. (``cancel-on-exception`` should only be used with
the epilog for testing purposes, since users can generate exceptions
on their jobs)

perilog
(optional) Common prolog/epilog configuration keys:
Expand Down
27 changes: 6 additions & 21 deletions src/modules/job-manager/plugins/perilog.c
Original file line number Diff line number Diff line change
Expand Up @@ -210,12 +210,6 @@ static struct perilog_procdesc *perilog_procdesc_create (json_t *o,
errprintf (errp, "command must be an array");
return NULL;
}
if (kill_timeout > 0.) {
if (!prolog) {
errprintf (errp, "kill-timeout not allowed for epilog");
return NULL;
}
}
/* If no command is set but exec.imp is non-NULL then set command to
* [ "$imp_path", "run", "$name" ]
*/
Expand Down Expand Up @@ -773,18 +767,8 @@ static void io_cb (struct bulk_exec *bulk_exec,
}
}


static void start_cb (struct bulk_exec *exec, void *arg)
{
struct perilog_proc *proc = bulk_exec_aux_get (exec, "perilog_proc");
/* Start timeout timer when processes have started
*/
if (proc && proc->timer)
flux_watcher_start (proc->timer);
}

static struct bulk_exec_ops ops = {
.on_start = start_cb,
.on_start = NULL,
.on_exit = NULL,
.on_complete = completion_cb,
.on_error = error_cb,
Expand Down Expand Up @@ -889,8 +873,7 @@ static struct perilog_proc *procdesc_run (flux_t *h,
goto error;
}
proc->timer = w;
/* Note: watcher will be started in bulk-exec start callback
*/
flux_watcher_start (proc->timer);
}
proc->R = json_incref (R);
proc->bulk_exec = bulk_exec;
Expand Down Expand Up @@ -1115,8 +1098,10 @@ static int exception_cb (flux_plugin_t *p,
flux_plugin_arg_t *args,
void *arg)
{
/* On exception, kill any prolog running for this job:
* Follow up with SIGKILL after 10s.
/* On exception, send SIGTERM to any remaining active tasks if
* cancel-on-exception is set (default=true for prolog, false for epilog)
* After kill-timeout, any remaining active tasks will have their nodes
* drained.
*/
struct perilog_proc *proc;
int severity;
Expand Down
7 changes: 0 additions & 7 deletions t/t2274-manager-perilog-per-rank.t
Original file line number Diff line number Diff line change
Expand Up @@ -112,13 +112,6 @@ test_expect_success 'perilog: invalid config is rejected' '
test_debug "cat config.err.$i" &&
grep "1 object.*left unpacked: foo" config.err.$i &&
test_must_fail flux config load <<-EOF 2>config.err.$((i+=1)) &&
[job-manager.epilog]
command = [ "foo" ]
kill-timeout = 5.5
EOF
test_debug "cat config.err.$i" &&
grep "kill-timeout not allowed for epilog" config.err.$i &&
test_must_fail flux config load <<-EOF 2>config.err.$((i+=1)) &&
[job-manager.perilog]
log-ignore = "foo"
EOF
Expand Down

0 comments on commit 621c7b8

Please sign in to comment.