Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid scheduling jobs if not all parallel jobs are ready #6049

Merged
merged 1 commit into from
Nov 8, 2024

Conversation

Martchus
Copy link
Contributor

@Martchus Martchus commented Nov 6, 2024

Some jobs of a parallel cluster can be blocked (by a chained parent or by a pending Gru task) while some jobs can be scheduled. Before this change the scheduler assigns the jobs that can be scheduled which creates a half- scheduled parallel cluster.

This is particularly problematic when PARALLEL_ONE_HOST_ONLY=1 and git_auto_update = yes are used because then repairing half-scheduled clusters is more challenging and the likeliness that a cluster is partially blocked is higher. In theory this is also problematic when jobs within the parallel cluster depend on different asset downloads.

With this change the scheduler skips the whole cluster until all jobs can be scheduled to avoid half-scheduled clusters. Judging by the previous code and its comments this was already intended but the code didn't actually work correctly.

I tested the rewritten version of the code by creating a half-blocked cluster manually and observed the behavior of the scheduler before and after this commit.

Related ticket: https://progress.opensuse.org/issues/169342

t/05-scheduler-dependencies.t Outdated Show resolved Hide resolved
Some jobs of a parallel cluster can be blocked (by a chained parent or by a
pending Gru task) while some jobs can be scheduled. Before this change the
scheduler assigns the jobs that can be scheduled which creates a half-
scheduled parallel cluster.

This is particularly problematic when `PARALLEL_ONE_HOST_ONLY=1` and
`git_auto_update = yes` are used because then repairing half-scheduled
clusters is more challenging and the likeliness that a cluster is partially
blocked is higher. In theory this is also problematic when jobs within the
parallel cluster depend on different asset downloads.

With this change the scheduler skips the whole cluster until all jobs can
be scheduled to avoid half-scheduled clusters. Judging by the previous code
and its comments this was already intended but the code didn't actually
work correctly.

I tested the rewritten version of the code by creating a half-blocked
cluster manually and observed the behavior of the scheduler before and
after this commit.

Related ticket: https://progress.opensuse.org/issues/169342
Copy link

codecov bot commented Nov 6, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.98%. Comparing base (a1c8732) to head (6045c27).
Report is 14 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6049   +/-   ##
=======================================
  Coverage   98.98%   98.98%           
=======================================
  Files         395      395           
  Lines       39436    39441    +5     
=======================================
+ Hits        39036    39041    +5     
  Misses        400      400           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Comment on lines 97 to +98
my $tobescheduled = _to_be_scheduled($j, $scheduled_jobs);
if (!defined $tobescheduled) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correct me but I think _to_be_scheduled_recurse doesnt return undef. shouldnt that be just !$tobescheduled?

Copy link
Contributor Author

@Martchus Martchus Nov 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We call _to_be_scheduled here and that returns undef.

@mergify mergify bot merged commit 2c4a234 into os-autoinst:master Nov 8, 2024
46 checks passed
@Martchus Martchus deleted the fix-scheduler branch November 8, 2024 14:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants