Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job submissions to wrong partition on Hortense #68

Open
laraPPr opened this issue Jul 7, 2023 · 17 comments
Open

job submissions to wrong partition on Hortense #68

laraPPr opened this issue Jul 7, 2023 · 17 comments
Assignees

Comments

@laraPPr
Copy link
Collaborator

laraPPr commented Jul 7, 2023

Reframe submits all the tests to the same partition. So if reframe is started from the cpu_milan partitions all the test that are found for other partitions will also be submitted to cpu_milan. This especially goes horribly wrong when starting from a GPU partition. Since all the tests meant for the cpu-partitions fail mediately.

We have narrowed down the problem to the following parts of the hortense system and the vsc_hortense.py config file:

  • The config file adds the following line, #SBATCH --partition=cpu_milan in the job script ( rfm_job.sh)
  • We have the following environment variable, SBATCH_PARTITION=cpu_rome, set by Hortense cluster module
  • Reframe submits the jobs with sbatch rfm_job.sh
  • So the SBATCH_PARTITION variable wins

Could it be possible that reframe submits the job with sbatch --partition=cpu_milan rfm.job?

A possible work around might be to use prepare_cmds to set the environment variable SBATCH_PARTITION for every partition in config/vsc_hortense.py.

@boegel
Copy link
Contributor

boegel commented Jul 7, 2023

@vkarak Thoughts on this? Is it intentional that having $SBATCH_PARTITION set overrules the ReFrame configuration?

@vkarak
Copy link

vkarak commented Jul 10, 2023

No, it's not intentional, we haven't thought about this type of set up, actually :-) But since this is really related to how the target system is set up, I think that setting the prepare_cmds in the configuration of this partition as @laraPPr suggests is the right way. I have an additional question: where is the Hortense cluster module loaded? Is it also listed in the partition config? If that's the case, then perhaps omitting the --partition from the partition's sched_access is probably another solution, as setting the partition is taken care of by the module already.

@casparvl
Copy link
Collaborator

@vkarak Aren't the prepare_cmds only executed on the batch node? They aren't executed on the submit host (i.e. where one is running the reframe command), right?

We just tried something else together with Lara: we unloaded one of their sticky environment modules (the one that sets the SBATCH_PARTITION variable), then run the reframe. That works, since there is no conflict anymore between the environment variable and the #SBATCH --partition line that ReFrame puts in the batch script.

The only thing is, I don't know what unintended side effects the unload of that sticky module might have, but @boegel probably knows... :)

@vkarak
Copy link

vkarak commented Jul 13, 2023

Aren't the prepare_cmds only executed on the batch node? They aren't executed on the submit host (i.e. where one is running the reframe command), right?

Indeed, you are right: prepare_cmds are part of the generated job script. I haven't understood exactly the scenario.

Currently, in reframe, there is no way to pass options directly to the sbatch command. Perhaps, we could add support through a sched_option to pass all sbatch options to the command line instead of the of the script, but that would make debugging a bit more difficult (you would have to look in the reframe logs to reproduce the exact submission).

We just tried something else together with Lara: we unloaded one of their sticky environment modules

Have you also tried the -u option to unload the module (I'm not sure it force unloads it)?

@boegel
Copy link
Contributor

boegel commented Aug 2, 2023

@vkarak Not being able to add options directly to the sbatch command is actually not a problem, as long as we can modify the environment in which the sbatch command is run (so we can make sure that $SBATCH_PARTITION is either unset, or that it's set correctly).

Is there a way to do that?

Unloading the module (with force) that sets $SBATCH_PARTITION works, but can we specify that in the config file somehow, or does that need to be done before running reframe?

(it's actually the same question twice, sort of...)

@vkarak
Copy link

vkarak commented Aug 9, 2023

I had a look again into it and indeed you can't cleanly modify the environment where sbatch will run, especially per partition. You can only modify globally the environment that reframe will run in. I suggest opening an issue for this, as it requires some thinking on how to do this properly.

@boegel
Copy link
Contributor

boegel commented Jun 14, 2024

reframe-hpc/reframe#2970 was closed just now, so what we need is coming in a ReFrame release soon?

@vkarak
Copy link

vkarak commented Jun 14, 2024

Yes, it is already merged in develop if you would like to try it out and is scheduled for 4.7.

@laraPPr
Copy link
Collaborator Author

laraPPr commented Jun 17, 2024

Just tested it and it is now submitting to the right partition on hortense. Thanks.

@laraPPr
Copy link
Collaborator Author

laraPPr commented Jan 3, 2025

@vkarak I'm testing with version 4.7.2. The jobs are now submitted to the right partition but with autodetection (when 'remote_detect': True,is set). It submits it to the same partition and setting the following in the config seems to be ignored by it.

'sched_options': {
   'shed_access_in_submit': True,
},

@vkarak
Copy link

vkarak commented Jan 9, 2025

@laraPPr Was this regression in 4.7.2 only? Does it work in 4.7.0?

@vkarak
Copy link

vkarak commented Jan 9, 2025

Could you also try 4.7.1? Because the only chance to have broken is by the fixes introduced in 4.7.2. Also, could you describe the exact scenario you are trying and what would be the expected behaviour?

@smoors
Copy link
Collaborator

smoors commented Jan 10, 2025

'sched_options': {
   'shed_access_in_submit': True,
},

as Caspar found out, you have a typo, should be sched_access_in_submit

@laraPPr
Copy link
Collaborator Author

laraPPr commented Jan 13, 2025

The problem was indeed the typo tested with 4.7.0, 4.7.1 and 4.72.

@laraPPr
Copy link
Collaborator Author

laraPPr commented Jan 13, 2025

Incomming PR to fix it

@laraPPr
Copy link
Collaborator Author

laraPPr commented Jan 13, 2025

@laraPPr
Copy link
Collaborator Author

laraPPr commented Jan 14, 2025

@casparvl can this one be closed (Since it is resolved when using ReFrame 4.7.0) or should we wait until the CI is updated to use ReFrame 4.7.0 or older?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants