Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use job script on disk for LSF driver and use stdin for PBS #7396

Merged
merged 2 commits into from
Mar 13, 2024

Conversation

berland
Copy link
Contributor

@berland berland commented Mar 8, 2024

Let PBS and LSF make explicit job scripts for test robustness

For OpenPBS the explicit job script only exists in memory and
is provided to qsub via stdin.

For LSF, the bsub command seemingly cannot read stdin, and the
job script is placed on runpath with a unique filename that
will be left behind.

This enables more complex jobs (e.g. redirection) submitted to the
drivers submit() function.

Align Python LSF driver with C-driver in supplying the runpath as the
last argument. This is needed to pass a legacy test. Note that legacy
LSF and legacy Torque drivers differ in this behaviour, the latter does
not supply runpath.

The type of the runpath object supplied is changed from str to
pathlib.Path

Issue
Resolves #7327

Approach
Short description of the approach

(Screenshot of new behavior in GUI if applicable)

  • PR title captures the intent of the changes, and is fitting for release notes.
  • Added appropriate release note label
  • Commit history is consistent and clean, in line with the contribution guidelines.
  • Make sure tests pass locally (after every commit!)

When applicable

  • When there are user facing changes: Updated documentation
  • New behavior or changes to existing untested code: Ensured that unit tests are added (See Ground Rules).
  • Large PR: Prepare changes in small commits for more convenient review
  • Bug fix: Add regression test for the bug
  • Bug fix: Create Backport PR to latest release

@codecov-commenter
Copy link

codecov-commenter commented Mar 8, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 85.19%. Comparing base (55b9e3a) to head (84d9257).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7396      +/-   ##
==========================================
+ Coverage   84.48%   85.19%   +0.71%     
==========================================
  Files         384      384              
  Lines       22857    22877      +20     
  Branches      884      876       -8     
==========================================
+ Hits        19310    19491     +181     
+ Misses       3435     3277     -158     
+ Partials      112      109       -3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@berland berland marked this pull request as ready for review March 8, 2024 12:11
@berland berland self-assigned this Mar 8, 2024
@berland berland changed the title Leave job script on disk for PBS driver Leave job script on disk for PBS and LSF drivers Mar 8, 2024
@berland
Copy link
Contributor Author

berland commented Mar 8, 2024

Onprem testing:
pytest --basetemp=/private/havb/pytest-tmp --lsf tests/integration_tests/scheduler -sxv
Azure testing:

export _ERT_TESTS_DEFAULT_QUEUE_NAME=permanent
pytest --basetemp=/private/havb/pytest-tmp --openpbs tests/integration_tests/scheduler  -sxv

@berland berland marked this pull request as draft March 8, 2024 12:43
@berland berland force-pushed the pbs_leave_script_on_file branch 2 times, most recently from 8e13e55 to 5067a06 Compare March 11, 2024 13:36
@berland
Copy link
Contributor Author

berland commented Mar 11, 2024

Some clearification:

  • We do not like leaving temporary files, especially in runpath, where it can become a de-facto API
  • We want the drivers to behave similarly

Currently, the on-prem /global/bin/bsub cannot take the job script on stdin, which leaves the two goals above impossible to attain.

@berland berland marked this pull request as ready for review March 11, 2024 14:33
@berland berland added the release-notes:misc Automatically categorise as miscellaneous change in release notes label Mar 11, 2024
@berland berland force-pushed the pbs_leave_script_on_file branch 2 times, most recently from c019c08 to 50063af Compare March 12, 2024 12:31
@berland berland changed the title Leave job script on disk for PBS and LSF drivers Leave job script on LSF driver and use stdin for PBS Mar 12, 2024
@berland berland changed the title Leave job script on LSF driver and use stdin for PBS Use job script on disk for LSF driver and use stdin for PBS Mar 12, 2024
@berland berland force-pushed the pbs_leave_script_on_file branch 4 times, most recently from cca6304 to ca13b08 Compare March 12, 2024 14:32
@berland
Copy link
Contributor Author

berland commented Mar 12, 2024

Integration tests with --openpbs on Azure and --lsf onprem has been manually verified (error there will not be visible until next komodo build)

f"cd {shlex.quote(str(runpath))}\n"
f"exec -a {shlex.quote(executable)} {executable} {shlex.join(args)}\n"
)
script_path = runpath / "job_script.sh"
Copy link
Contributor

@xjules xjules Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to use the same tempfile.NamedTemporaryFile pattern like in lsf_driver since when runpath=None this will not work or ?
Actually where is job_script.sh used?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, it is not used any longer after changing this PR to use stdin

pass
if str(tmp_path).startswith("/tmp"):
print(
"Please use --basetemp option to pytest, PBS tests needs a shared disk"
Copy link
Contributor

@xjules xjules Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Please use --basetemp option to pytest, PBS tests needs a shared disk"
"Please use --basetemp option to pytest, the real PBS cluster needs a shared disk"

pass
if str(tmp_path).startswith("/tmp"):
print(
"Please use --basetemp option to pytest, LSF tests needs a shared disk"
Copy link
Contributor

@xjules xjules Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Please use --basetemp option to pytest, LSF tests needs a shared disk"
"Please use --basetemp option to pytest, the real LSF cluster needs a shared disk"

with open("poly.ert", mode="a+", encoding="utf-8") as f:
f.write("QUEUE_SYSTEM TORQUE\nNUM_REALIZATIONS 2")
f.write("QUEUE_SYSTEM TORQUE\nNUM_REALIZATIONS 2\n")
if queue_name is not None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if queue_name is not None:
if (queue_name: = os.getenv("_ERT_TESTS_DEFAULT_QUEUE_NAME")) is not None:

Copy link
Contributor

@xjules xjules Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Get one less line also for the other tests since queue_name is used only in the if statements.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixing!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

used a fixture as well.

Copy link
Contributor

@xjules xjules left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just small comments. It looks very good otherwise! :)

@@ -19,12 +20,19 @@ def mock_openpbs(pytestconfig, monkeypatch, tmp_path):
mock_bin(monkeypatch, tmp_path)


@pytest.mark.timeout(30)
@pytest.fixture()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@berland berland requested a review from xjules March 13, 2024 13:48
For OpenPBS the explicit job script only exists in memory and
is provided to qsub via stdin.

For LSF, the bsub command seemingly cannot read stdin, and the
job script is placed on runpath with a unique filename that
will be left behind.

This enables more complex jobs (e.g. redirection) submitted to the
drivers submit() function.

Align Python LSF driver with C-driver in supplying the runpath as the
last argument. This is needed to pass a legacy test. Note that legacy
LSF and legacy Torque drivers differ in this behaviour, the latter does
not supply runpath.

The type of the runpath object supplied is changed from str to
pathlib.Path
Copy link
Contributor

@xjules xjules left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job @berland 👍 🚀

@berland berland merged commit eee4d2a into equinor:main Mar 13, 2024
49 of 53 checks passed
@berland berland deleted the pbs_leave_script_on_file branch May 29, 2024 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes:misc Automatically categorise as miscellaneous change in release notes scheduler
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Integration test for PBS will never work with current driver
3 participants