Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bugfix] Evaluate Slurm job nodelist lazily #3022

Merged
merged 2 commits into from
Nov 8, 2023

Conversation

vkarak
Copy link
Contributor

@vkarak vkarak commented Oct 19, 2023

The nodelist evaluation was being triggered early by logging, when we update the log record with all the test attributes:

for attr, alt_name in check_type.loggable_attrs():
extra_name = alt_name or attr
with suppress_deprecations():
# In case of AttributeError, i.e., the variable is undefined,
# we set the value to None
val = getattr(self.check, attr, None)
if attr in check_type.raw_params:
# Attribute is parameter, so format it
val = check_type.raw_params[attr].format(val)
key = f'check_{extra_name}'
self.extra['__rfm_loggable_attrs__'].append(key)
self.extra[key] = val

The problem is that in some cases Slurm does not populate the nodelist of the job immediately, so we might end with an incomplete node list in the test job. We fix this by evaluating lazily the job's nodelist after the job has finished.

Fixes #3018.

@stefaniereuter could you please check if this fixes your problem?

@codecov
Copy link

codecov bot commented Oct 19, 2023

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (3e3f4c5) 86.71% compared to head (8f22223) 86.71%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3022   +/-   ##
=======================================
  Coverage   86.71%   86.71%           
=======================================
  Files          61       61           
  Lines       12002    12002           
=======================================
  Hits        10407    10407           
  Misses       1595     1595           
Files Coverage Δ
reframe/core/schedulers/slurm.py 52.92% <0.00%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@stefaniereuter
Copy link

Thank you so much @vkarak. I will try this as soon as our system is back from maintenance.

@vkarak vkarak linked an issue Oct 24, 2023 that may be closed by this pull request
@vkarak
Copy link
Contributor Author

vkarak commented Oct 26, 2023

@stefaniereuter did you manage to try this out?

@stefaniereuter
Copy link

stefaniereuter commented Oct 26, 2023 via email

@vkarak
Copy link
Contributor Author

vkarak commented Nov 8, 2023

@stefaniereuter Did you have a chance to test this? We will merge this PR so as to make it to the 4.4.1 release. If the problem is not fixed, feel free to reopen the issue.

@vkarak vkarak merged commit 7797ef1 into reframe-hpc:master Nov 8, 2023
17 of 18 checks passed
@vkarak vkarak deleted the bugfix/slurm-lazy-eval-nodelist branch November 8, 2023 21:31
@stefaniereuter
Copy link

sorry for the very late response. Yes we were finally able to test this last week and it seem to fix our issue, thank you so much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Nodelist not properly logged in perflog
3 participants