Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not retry when we get 'Job has finished' from qdel #7373

Merged
merged 1 commit into from
Mar 7, 2024

Conversation

xjules
Copy link
Contributor

@xjules xjules commented Mar 6, 2024

Issue
Resolves #7369

Approach
Do not retry when qdel failed with Job has finished

(Screenshot of new behavior in GUI if applicable)

  • PR title captures the intent of the changes, and is fitting for release notes.
  • Added appropriate release note label
  • Commit history is consistent and clean, in line with the contribution guidelines.
  • Make sure tests pass locally (after every commit!)

When applicable

  • When there are user facing changes: Updated documentation
  • New behavior or changes to existing untested code: Ensured that unit tests are added (See Ground Rules).
  • Large PR: Prepare changes in small commits for more convenient review
  • Bug fix: Add regression test for the bug
  • Bug fix: Create Backport PR to latest release

@xjules xjules added release-notes:skip If there should be no mention of this in release notes scheduler labels Mar 6, 2024
@xjules xjules self-assigned this Mar 6, 2024
assert "qdel executed" in Path(bin_path / "qdel_output").read_text(encoding="utf-8")
if exit_code == QDEL_JOB_HAS_FINISHED:
# the job has been already qdel-ed so no need to retry
assert not os.path.exists(bin_path / "qdel_output")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick

Suggested change
assert not os.path.exists(bin_path / "qdel_output")
assert not (bin_path / "qdel_output").exists()

# the job has been already qdel-ed so no need to retry
assert not os.path.exists(bin_path / "qdel_output")
else:
assert "qdel executed" in Path(bin_path / "qdel_output").read_text(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick

Suggested change
assert "qdel executed" in Path(bin_path / "qdel_output").read_text(
assert "qdel executed" in (bin_path / "qdel_output").read_text(

assert not os.path.exists(bin_path / "qdel_output")
else:
assert "qdel executed" in Path(bin_path / "qdel_output").read_text(
encoding="utf-8"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Is encoding required? Is the default not "utf-8" already?

Copy link
Contributor

@berland berland Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should use the encoding. utf-8 is default on our computers yes, but lets be kind to the world.

See https://pylint.readthedocs.io/en/latest/user_guide/messages/warning/unspecified-encoding.html

Copy link
Contributor

@berland berland Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was enforced earlier when we had pylint in CI, now I see that our codebase has deterioated on this after the switch to ruff 😢

Copy link
Contributor

@pinkwah pinkwah Mar 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the PEP it sounds like it only is a problem on Windows. It's what we in the activism business called praxis in our fight against the Microsoft monopoly!

But, joking aside, it is reasonable to specify encoding="utf-8" I guess. I always thought it was defined to be as such in Python 3, now that str is unicode.

It's only Windows, and on Windows it looks like it's one of the Windows codepages, eg. cp1250. Not even Windows uses this, preferring (the still bad) UTF-16 Unicode. Very confusing tbh. https://docs.python.org/3/library/locale.html#locale.getencoding

QSUB_INVALID_CREDENTIAL,
QSUB_PREMATURE_END_OF_MESSAGE,
QSUB_CONNECTION_REFUSED,
]

QDEL_EXIT_CODES = [QDEL_REQUEST_INVALID, QDEL_JOB_HAS_FINISHED]
QDEL_EXIT_CODES_TO_RETRY = [QDEL_REQUEST_INVALID]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: the rest of the Scheduler code uses "returncode" (from asyncio.process.Process's returncode property), not "exit_code".

Also, perhaps this should be inlined with the qdel code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed the groups and inject them directly as individual return_codes 👍

Comment on lines 120 to 121
exit_codes_with_retries: List[int],
exit_codes_without_retries: List[int],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe less verbose, more to the point?

Suggested change
exit_codes_with_retries: List[int],
exit_codes_without_retries: List[int],
*,
retry_for: Iterable[int] = (),
accept: Iterable[int] = (),

Copy link
Contributor

@pinkwah pinkwah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with nitpicks

@xjules xjules force-pushed the fix_qdel_retry branch 3 times, most recently from 5ab5f56 to 7776db7 Compare March 6, 2024 15:24
@xjules xjules merged commit 43aa4df into equinor:main Mar 7, 2024
63 of 66 checks passed
@xjules xjules deleted the fix_qdel_retry branch March 7, 2024 14:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release-notes:skip If there should be no mention of this in release notes scheduler
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Killing simulations cause qdel retries on PBS
3 participants