Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job submitted to LSF are not taken down properly when using Terminate experiment #5955

Closed
Tracked by #6270
oyvindeide opened this issue Aug 24, 2023 · 8 comments
Closed
Tracked by #6270
Assignees
Labels

Comments

@oyvindeide
Copy link
Collaborator

Seems sometimes, but not always, terminating experiment does not kill jobs running on lsf.

@oyvindeide oyvindeide added the bug label Aug 24, 2023
@oyvindeide oyvindeide changed the title Job sumitted to LSF are not taken down properly when using Terminate experiment Job submitted to LSF are not taken down properly when using Terminate experiment Aug 24, 2023
@xjules
Copy link
Contributor

xjules commented Oct 2, 2023

We need to confirm if this has been already fixed. Also this could be a flakiness.
Might be moved to #6270 experiment server milestone.

@xjules xjules mentioned this issue Oct 23, 2023
78 tasks
@jonathan-eq jonathan-eq self-assigned this Oct 27, 2023
@jonathan-eq
Copy link
Contributor

Does it not actually kill the the jobs on the cluster, or is the GUI not recognizing that the jobs have been killed? What are the symptoms?

@jonathan-eq
Copy link
Contributor

image

Created a diagram to better understand what is happening behind the scenes when pressing the terminate experiment button.

@jonathan-eq
Copy link
Contributor

jonathan-eq commented Nov 1, 2023

Some ideas:

  1. It could be the event loop that has a lot of events queued up, and that the job is actually finished before the terminate function can be processed. This brings a second question: why are there so many events? There should only be one job getting status update per realization times how many realizations we are running in parallell. Even then, batching dispatcher should batch most of the events together so we would have fewer.
    To try: Go thoroughly through what could be put in the Qt event loop, and try having it output somewhere. Then go through if there are any duplicates. This could lead to the system slowing down / the program failing. This could also be related to the many realizations problem.

  2. It could also be that the job is not actually terminated on the job runner. The bkill command is invoked without any retry logic, so a failed return code would not be retried.
    To try: override the bkill command on the job runner, and have it return a non zero exit code. Then run an experiment and cancel it after a could of jobs. Verify with the bjobs command that the jobs are still running, and compare it with the ERT gui.
    EDIT: After trying this, it seems the jobs keep running after calling bkill if the bkill command returns non zero exit. However, the run dialog closes down.

@jonathan-eq
Copy link
Contributor

Marked as blocked until more context and information on the bug is added

@xjules
Copy link
Contributor

xjules commented Nov 1, 2023

Marked as blocked until more context and information on the bug is added

Could you provide more context on the error @oyvindeide ?

@jonathan-eq
Copy link
Contributor

This was all the information the user gave, and the user was unable to reproduce it himself. Should this be moved back to backlog in case it shows up again, over to Done as we have put in work on it (diagram as delieverable), or will we just delete it?

@sondreso
Copy link
Collaborator

sondreso commented Nov 3, 2023

Closing this as we are not able to reproduce and it has not re-surfaced. Will re-open if we notice this again 🙂

@sondreso sondreso closed this as not planned Won't fix, can't repro, duplicate, stale Nov 3, 2023
@sondreso sondreso removed the blocked label Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Archived in project
Development

No branches or pull requests

4 participants