Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to perform a dryrun with some rocoto commands #114

Open
aerorahul opened this issue Dec 9, 2024 · 7 comments
Open

Add option to perform a dryrun with some rocoto commands #114

aerorahul opened this issue Dec 9, 2024 · 7 comments

Comments

@aerorahul
Copy link

An ability to perform dryrun without executing the underlying rocoto command would be valuable. One such use case would be to be able to rocotorun with a dryrun option to obtain the batch card without actually submitting the job. This can enable the user to validate visually the batch card.

An effort towards achieving this is made here
Is this the right track?

@christopherwharrop-noaa
Copy link
Collaborator

Thanks @aerorahul for your report and initial work. The existing design makes a dry-run submission a little tricky. There are some subtleties regarding how jobs get submitted within a detached daemon process, and how the job is added to the database before the submission attempt even occurs (so that it can track submit failures/delays etc), that need to be accounted for. The expected tuple returned by the submit is the jobid and the output of the submit command. If the submission succeeds, the jobid will be a valid jobid, and the output will be the usual output from the command that was parsed to retrieve the jobid. If it fails, jobid will be nil, and the output will be the error message. It might be better to create a new method just for dry run submissions and to add logic in the various boot, run, rewind, etc. to handle that based on whether the dry-run option is active. There is a lot of room for improvement in how all of this is designed and handled.

@aerorahul
Copy link
Author

Thanks @christopherwharrop-noaa.
If I create a submit_dryrun, I would have to duplicate the contents of submit like the creation of the temporary file. Correct?
I will give it a try, but it seems very much involved and needs an understanding of the inner design of rocoto.

@christopherwharrop-noaa
Copy link
Collaborator

Let me think about it more deeply. There might be a simpler way that I'm just not thinking of. Of course, any user can already get the submit script using the -v option, but not as a dry-run (a live job submission will be attempted). I understand how that can be detrimental when you are trying to debug, as you don't want a valid (from Slurm's point of view), but wrong, submission to occur while you are working on building the workflow.

@aerorahul
Copy link
Author

The approach we have been using/brainstorming is not elegant and extremely hacky.

rocotorun -v 10 .... or rocotoboot -v 10 ...
get the jobid
scancel jobid

As you note, it likely breaks provenance of the rocoto db, and I am sure there are unintended consequences.

@christopherwharrop-noaa
Copy link
Collaborator

I think your request for the dry-run (or whatever we want to call it) feature to get the script that Rocoto will submit for a particular task, is totally reasonable. I strongly suspect this is something that many others would find useful. One other thing, though, is that whatever is implemented has to work for PBSPro and any other supported batch system. Right now, those are, realistically, the only ones in use. The thing that makes it weird is Rocoto's way of submitting the jobs asynchronously in a daemon spawned by the main process. That daemon server process often lives after the main rocotorun process has terminated. And it is the thing that builds the submit script. I think we just need to make all parts aware of when a dry-run is active or not. Some of the plumbing that happens just before job submission attempts are made needs some modification so that it doesn't do things in dry-run mode like store the submit attempt in the database, etc.

@aerorahul
Copy link
Author

Thanks for explaining the work involved.
I started w/ slurm, just to get the conversation going and gauge interest.

@aerorahul
Copy link
Author

@christopherwharrop-noaa In the branch mentioned above, I added the dryrun for the other batch systems in the same spirit as slurm.

If you can help me with:

I think we just need to make all parts aware of when a dry-run is active or not. Some of the plumbing that happens just before job submission attempts are made needs some modification so that it doesn't do things in dry-run mode like store the submit attempt in the database, etc.

I would greatly appreciate your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants