Fix enqueueing of Minion jobs breaking `PARALLEL_ONE_HOST_ONLY=1` #6048

Martchus · 2024-11-06T14:24:56Z

When restarting openQA jobs, additional Minion jobs are enqueued (of
git_clone task, this is happening especially often when git_auto_clone
is enabled). So far this didn't happen in a transaction. So the scheduler
might see the openQA jobs but not the Minion jobs they are blocked by. This
is problematic because the scheduler might assign jobs too soon to a
worker. This is especially problematic if it does not happen consistently
across a parallel job cluster as it then breaks the
PARALLEL_ONE_HOST_ONLY=1 feature.

Related ticket: https://progress.opensuse.org/issues/169342

* Avoid having two times the same identical loop * Use better variable names

When restarting openQA jobs, additional Minion jobs are enqueued (of `git_clone` task, this is happening especially often when `git_auto_clone` is enabled). So far this didn't happen in a transaction. So the scheduler might see the openQA jobs but not the Minion jobs they are blocked by. This is problematic because the scheduler might assign jobs too soon to a worker. This is especially problematic if it does not happen consistently across a parallel job cluster as it then breaks the `PARALLEL_ONE_HOST_ONLY=1` feature. Related ticket: https://progress.opensuse.org/issues/169342

perlpunk · 2024-11-06T14:38:45Z

lib/OpenQA/Schema/Result/Jobs.pm

    }

    # create comments on original jobs
    $result_source->schema->resultset('Comments')
      ->create_for_jobs(\@original_job_ids, $comment_text, $comment_user_id, $comments)
      if defined $comment_text;
+
+    # enqueue Minion jobs to clone required Git repositories
+    OpenQA::App->singleton->gru->enqueue_git_clones(\%clones, \@clone_ids);


This would create a minion job for every job, only cluster jobs would be grouped together, right?
That would be way too many jobs in some cases.
The detection for identical git_clone tasks is not perfect, especially if they are created quickly after each other.

I know this leads to more Minion jobs attempted to be enqueued but I was hoping the code for de-duplicating Minion jobs you have recently introduced will ensure that we don't have too many after all.

Note that there will still only be one enqueuing attempt per job cluster. Only if one specified multiple jobs IDs explicitly (e.g. using #5971 when it gets merged) we would have more enqueuing attempts.

Otherwise we really needed a "preparing" state.

I suppose it would work like this:

We create jobs and now the initial state is "preparing" instead of "scheduled" keeping track of the job IDs.

We enqueue Minion jobs.

We set all jobs we haven't created Minion jobs for in step 2 to "scheduled" immediately.

Before deleting GRU tasks we set related jobs to "scheduled". (This would still not fix the scheduling problem when there's a different set of e.g. download jobs within a parallel cluster - unless we take job dependencies into account here.)

Step 2 and 3 would happen in one transaction. Step 4 would also happen in one transaction.

But I suppose I can fix the scheduler first while we think what's best here. With the scheduler fixed we still have this race condition but at least job clusters aren't torn apart.

For the sake of simplicity we could also reduce the number of Minion jobs on job restarts by avoiding Minion jobs for git_auto_update. Of course then git_auto_update would rely only on the periodic updates for restarted jobs.

I don't understand step 4. Maybe you can explain tomorrow after the daily?

OTOH we could try this PR and see how it works out in practice.

Theoretically we could find out if any new minion job is required by keeping track of the git directories in the %git_clones hash while iterating over the jobs to restart. I guess in most cases we will only have one CASEDIR/NEEDLES_DIR or DISTRI.
Just that we would need to pass an additional %git_clones_not_yet_enqueued hash through from the toplevel job_restart, which doesn't sound nice.

For the sake of simplicity we could also reduce the number of Minion jobs on job restarts by avoiding Minion jobs for git_auto_update. Of course then git_auto_update would rely only on the periodic updates for restarted jobs.

That was a requirement though. We did have a complaint from someone who restarted a job and was wondering why it wasn't using the updated code.

btw, I guess this should be:

Suggested change

OpenQA::App->singleton->gru->enqueue_git_clones(\%clones, \@clone_ids);

OpenQA::App->singleton->gru->enqueue_git_clones(\%git_clones, \@clone_ids);

Probably the reason for the test failures

Martchus added 2 commits November 6, 2024 14:33

Improve code for processing cloned jobs

850ba1d

* Avoid having two times the same identical loop * Use better variable names

perlpunk reviewed Nov 6, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix enqueueing of Minion jobs breaking `PARALLEL_ONE_HOST_ONLY=1` #6048

Fix enqueueing of Minion jobs breaking `PARALLEL_ONE_HOST_ONLY=1` #6048

Martchus commented Nov 6, 2024

perlpunk Nov 6, 2024

Martchus Nov 6, 2024

Martchus Nov 6, 2024 •

edited

Loading

Martchus Nov 6, 2024

Martchus Nov 6, 2024

perlpunk Nov 6, 2024

perlpunk Nov 6, 2024 •

edited

Loading

perlpunk Nov 6, 2024

	OpenQA::App->singleton->gru->enqueue_git_clones(\%clones, \@clone_ids);
	OpenQA::App->singleton->gru->enqueue_git_clones(\%git_clones, \@clone_ids);

Fix enqueueing of Minion jobs breaking PARALLEL_ONE_HOST_ONLY=1 #6048

Are you sure you want to change the base?

Fix enqueueing of Minion jobs breaking PARALLEL_ONE_HOST_ONLY=1 #6048

Conversation

Martchus commented Nov 6, 2024

perlpunk Nov 6, 2024

Choose a reason for hiding this comment

Martchus Nov 6, 2024

Choose a reason for hiding this comment

Martchus Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

Martchus Nov 6, 2024

Choose a reason for hiding this comment

Martchus Nov 6, 2024

Choose a reason for hiding this comment

perlpunk Nov 6, 2024

Choose a reason for hiding this comment

perlpunk Nov 6, 2024 • edited Loading

Choose a reason for hiding this comment

perlpunk Nov 6, 2024

Choose a reason for hiding this comment

Fix enqueueing of Minion jobs breaking `PARALLEL_ONE_HOST_ONLY=1` #6048

Fix enqueueing of Minion jobs breaking `PARALLEL_ONE_HOST_ONLY=1` #6048

Martchus Nov 6, 2024 •

edited

Loading

perlpunk Nov 6, 2024 •

edited

Loading