-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid scheduling jobs if not all parallel jobs are ready #6049
Conversation
Some jobs of a parallel cluster can be blocked (by a chained parent or by a pending Gru task) while some jobs can be scheduled. Before this change the scheduler assigns the jobs that can be scheduled which creates a half- scheduled parallel cluster. This is particularly problematic when `PARALLEL_ONE_HOST_ONLY=1` and `git_auto_update = yes` are used because then repairing half-scheduled clusters is more challenging and the likeliness that a cluster is partially blocked is higher. In theory this is also problematic when jobs within the parallel cluster depend on different asset downloads. With this change the scheduler skips the whole cluster until all jobs can be scheduled to avoid half-scheduled clusters. Judging by the previous code and its comments this was already intended but the code didn't actually work correctly. I tested the rewritten version of the code by creating a half-blocked cluster manually and observed the behavior of the scheduler before and after this commit. Related ticket: https://progress.opensuse.org/issues/169342
acd1633
to
6045c27
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #6049 +/- ##
=======================================
Coverage 98.98% 98.98%
=======================================
Files 395 395
Lines 39436 39441 +5
=======================================
+ Hits 39036 39041 +5
Misses 400 400 ☔ View full report in Codecov by Sentry. |
my $tobescheduled = _to_be_scheduled($j, $scheduled_jobs); | ||
if (!defined $tobescheduled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
correct me but I think _to_be_scheduled_recurse
doesnt return undef. shouldnt that be just !$tobescheduled
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We call _to_be_scheduled
here and that returns undef
.
Some jobs of a parallel cluster can be blocked (by a chained parent or by a pending Gru task) while some jobs can be scheduled. Before this change the scheduler assigns the jobs that can be scheduled which creates a half- scheduled parallel cluster.
This is particularly problematic when
PARALLEL_ONE_HOST_ONLY=1
andgit_auto_update = yes
are used because then repairing half-scheduled clusters is more challenging and the likeliness that a cluster is partially blocked is higher. In theory this is also problematic when jobs within the parallel cluster depend on different asset downloads.With this change the scheduler skips the whole cluster until all jobs can be scheduled to avoid half-scheduled clusters. Judging by the previous code and its comments this was already intended but the code didn't actually work correctly.
I tested the rewritten version of the code by creating a half-blocked cluster manually and observed the behavior of the scheduler before and after this commit.
Related ticket: https://progress.opensuse.org/issues/169342