-
Notifications
You must be signed in to change notification settings - Fork 106
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[feat] Add the Firecrest slurm scheduler #3046
base: develop
Are you sure you want to change the base?
Conversation
… firecrest_scheduler
Hello @ekouts, Thank you for updating! Cheers! There are no PEP8 issues in this Pull Request!Do see the ReFrame Coding Style Guide Comment last updated at 2024-01-09 08:56:44 UTC |
… firecrest_scheduler
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## develop #3046 +/- ##
===========================================
- Coverage 86.61% 85.43% -1.18%
===========================================
Files 61 61
Lines 12035 12230 +195
===========================================
+ Hits 10424 10449 +25
- Misses 1611 1781 +170 ☔ View full report in Codecov by Sentry. |
reframe/core/pipeline.py
Outdated
self, | ||
job_type, | ||
force_local=False, | ||
clean_up_stage=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain the purpose of this argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my issue was that the job is not aware if it's the first to write in the stage directory of the remote filesystem. If it is, we need to cleanup this directory.
reframe/core/schedulers/slurm.py
Outdated
if sys.version_info >= (3, 7): | ||
import firecrest as fc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move the Firecrest backend to a separate file. If you need to use something from Slurm, just import it.
reframe/core/schedulers/slurm.py
Outdated
|
||
|
||
@register_scheduler('slurmfc') | ||
class SlurmFirecrestJobScheduler(SlurmJobScheduler): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd help if in the description of the PR you explain how the plugin is meant to work. Why Slurm is required? Isn't Firecrest agnostic to the actual scheduler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't Firecrest agnostic to the actual scheduler?
Not really. We are submitting slurm scripts through firecrest even if the endpoint and its arguments are "generic". For the hypothetical case where firecrest supports different schedulers and we wanted to support this in reframe we would need to implement something different. That's why I wanted to name is something like Firecrest-Slurm
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to write a better description in the PR.
reframe/core/schedulers/slurm.py
Outdated
client_id = set_mandatory_var("FIRECREST_CLIENT_ID") | ||
client_secret = set_mandatory_var("FIRECREST_CLIENT_SECRET") | ||
token_uri = set_mandatory_var("AUTH_TOKEN_URL") | ||
firecrest_url = set_mandatory_var("FIRECREST_URL") | ||
self._system_name = set_mandatory_var("FIRECREST_SYSTEM") | ||
self._remotedir_prefix = set_mandatory_var('FIRECREST_BASEDIR') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How persistent are those variables? Does it suffice to define them once in the config or are they meant to change on every reframe invocation? If they are persistent, I suggest that you make them sched_options
in the configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How persistent are those variables? Does it suffice to define them once in the config or are they meant to change on every reframe invocation?
Yes, it is enough to define them once in the config. I thought the client ID and secret should come from env vars instead of being hardcoded but I guess we do that in the config anyway.
I suggest that you make them
sched_options
in the configuration.
The thing that I don't like is that these are more per system than per partition https://reframe-hpc.readthedocs.io/en/latest/config_reference.html#config.systems.partitions.sched_options but it could work still.
reframe/core/schedulers/slurm.py
Outdated
params = self.client.parameters() | ||
for p in params['utilities']: | ||
if p['name'] == 'UTILITIES_MAX_FILE_SIZE': | ||
self._max_file_size_utilities = float(p['value'])*1000000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the conversion to float
fail here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm good point. In theory it shouldn't but I will add a try except in case firecrest is not configured properly and returns a non-number
@@ -33,6 +33,7 @@ install_requires = | |||
PyYAML | |||
requests | |||
semver | |||
pyfirecrest; python_version >= '3.7' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment above.
reframe/core/schedulers/slurm.py
Outdated
os.path.relpath(os.getcwd(), job._stage_prefix) | ||
) | ||
|
||
if job._clean_up_stage: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that should be treated transparently to the general Job API. I see that you clean it up between the build and run phase. What if the backend selected a unique remote stage dir for each job? The SSH backend, for example, creates a temporary directory in the remote end without the need to interfere with the job API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see that you clean it up between the build and run phase.
Hm, actually this shouldn't happen and if it is happening it's a bug 😅 I only clean up in the first submission. If it's a normal/build_only test in the build phase, and for run_only in the run phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if the backend selected a unique remote stage dir for each job? The SSH backend, for example, creates a temporary directory in the remote end without the need to interfere with the job API.
Hm not sure what you mean. Could you explain this a bit more? You are talking about a case where build and run jobs have different dir?
reframe/core/schedulers/slurm.py
Outdated
self._push_artefacts(job) | ||
|
||
intervals = itertools.cycle([1, 2, 3]) | ||
while True: | ||
try: | ||
# Make request for submission | ||
submission_result = self.client.submit( | ||
self._system_name, | ||
os.path.join(job._remotedir, job.script_filename), | ||
local_file=False | ||
) | ||
break | ||
except fc.FirecrestException as e: | ||
stderr = e.responses[-1].json().get('error', '') | ||
error_match = re.search( | ||
rf'({"|".join(self._resubmit_on_errors)})', stderr | ||
) | ||
if not self._resubmit_on_errors or not error_match: | ||
raise | ||
|
||
t = next(intervals) | ||
self.log( | ||
f'encountered a job submission error: ' | ||
f'{error_match.group(1)}: will resubmit after {t}s' | ||
) | ||
time.sleep(t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with this implementation is that it is blocking and the submit()
call must non-blocking because otherwise no other job will be able to proceed. Waiting for the artefacts to be uploaded can take quite some time. Also, sleeping in intervals of 1-3s waiting for FC to accept the submission, I think it's too coarse if this were to be done inside the submit()
.
What if you spawn other Python processes to handle the push/submit/pull and from the backend you simply control them, like the SSH backend does. There are two ways to do this:
- The more Pythonic: extend our
_ProcFuture
interface to handlemultiprocess.Process
object. Essentially, make a base future with the common functionality and have the current_ProcFuture
and the_MultiprocessProcFuture
expand from this. - The simplest (maybe): Create a separate simple special purpose python script that would handle the Firecrest connection (push/submit/pull) which you control from the backend in a non-blocking fashion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if you spawn other Python processes to handle the push/submit/pull and from the backend you simply control them, like the SSH backend does.
Yes, this is definitely an important improvement. I skipped it to have an implementation sooner, but you are right. Also there is a whole async version of pyfirecrest that would be a much better solution but I was kinda waiting to drop python 3.6 support in reframe 😅
reframe/core/schedulers/slurm.py
Outdated
job._nodespec = ','.join(m['nodelist'] for m in jobarr_info) | ||
|
||
def wait(self, job): | ||
# Quickly return in case we have finished already |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is not valid when pulling the artefacts. :-)
reframe/core/schedulers/slurm.py
Outdated
|
||
def wait(self, job): | ||
# Quickly return in case we have finished already | ||
self._pull_artefacts(job) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pulling the artefacts here could also be problematic and blocking general progress. This is essentially called after the compile or run phase have completed. For all other backends, this returns immediately, whereas here it will block waiting for the artefacts to be pulled back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm, where would be a better place to do it? When we realize that the job finished in poll? 🤔
def _upload(local_path, remote_path): | ||
f_size = os.path.getsize(local_path) | ||
if f_size <= self._max_file_size_utilities: | ||
self.client.simple_upload( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to add a log entry here stating that this is synchronous upload?
def _push_artefacts(self, job): | ||
def _upload(local_path, remote_path): | ||
f_size = os.path.getsize(local_path) | ||
if f_size <= self._max_file_size_utilities: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make sense to introduce a variable to prefer the external_upload
version? I am wondering about the responsiveness of the framework if the client internet connection is not that great. What do you think?
self.log( | ||
f'Waiting for the uploads, sleeping for {t} sec' | ||
) | ||
time.sleep(t) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think my previous comment about the framework responsiveness makes no sense, right? It will sleep here.
Includes also changes from #3054, as soon it is merged the diff will be a bit smaller.
In order to test it these variables need to be set: