Refactor queue drivers #6270

sondreso · 2023-10-09T10:06:17Z

Compute resource specifications:
Compute resource specifications should be model specific, and and we should translate the requirements to the appropriate settings for the currently used queue system.

In particular, the number of CPU cores and memory required is inherit to the model, not the specific queue system that is configured. We should translate these requirements/configurations to the specific queue system when submitting the jobs, allowing the queue system to make decisions about how to best distribute the jobs over the available compute resources. (This means that the Torque setting NUM_CPUS_PER_NODE should be deprecated and removed, as it attempts to do this resource allocation manually and leave it up to the user).

Improved status updating and tracking:
Currently every job (JobQueueNode) will interact with the queue system on its own, updating and tracking the status of a single job submitted to the queue system (It creates one thread per realizations, each doing its own polling). This scales poorly when increasing the number of realizations, to the point where we will ddos the queue system when asking for statuses. Mitigating actions exist both in the Torque driver (qstat_proxy.sh to cache the result of the command) and the LSF driver (Internal lookup table (caching) in the driver itself).

A better solution here would be to lift the abstraction to a sightly higher level, such that the "Queue manager" can query the status of all the jobs simultaneously. Something along the lines of this:

def get_torque_status(jobs: List[JobId]) -> Dict[JobId, JobStatus]:
    status_str = subprocess.call("qstat", qstat_args)
    statuses = parse_statuses(status_str)
    return {jobid: statuses[jobid] for jobid in jobs}

Move code to Python:
We should move the current C-code implementation to Python.

NOTE:
The queue driver is performance critical to ERT, and at no point during refactoring we should exceed a performance limit such that ERT will stop working. This means that extra testing to ensure scaling when increasing the number of realizations is needed.

List of issues:

We should make sure that all the new code comes with a benchmarking / performance tests.

Definition of Done:

Migration to Python

Moving Scheduler to Main

drivers

Dask related:

Other Issues:

The text was updated successfully, but these errors were encountered:

sondreso added the milestone label Oct 9, 2023

xjules mentioned this issue Oct 24, 2023

Job submitted to LSF are not taken down properly when using Terminate experiment #5955

Closed

berland mentioned this issue Nov 16, 2023

JobQueue reimplemented in Python #6579

Closed

8 tasks

xjules mentioned this issue Nov 21, 2023

Make JobQueue and JobQueueNode implementation purely in Python #6506

Closed

eivindjahren mentioned this issue Apr 24, 2024

Unhandled timeout #4926

Closed

xjules closed this as completed Sep 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor queue drivers #6270

Refactor queue drivers #6270

sondreso commented Oct 9, 2023 •

edited by xjules

Loading

Refactor queue drivers #6270

Refactor queue drivers #6270

Comments

sondreso commented Oct 9, 2023 • edited by xjules Loading

Migration to Python

Moving Scheduler to Main

Dask related:

Other Issues:

sondreso commented Oct 9, 2023 •

edited by xjules

Loading