Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor queue drivers #6270

Closed
75 of 78 tasks
sondreso opened this issue Oct 9, 2023 · 0 comments
Closed
75 of 78 tasks

Refactor queue drivers #6270

sondreso opened this issue Oct 9, 2023 · 0 comments

Comments

@sondreso
Copy link
Collaborator

sondreso commented Oct 9, 2023

Compute resource specifications:
Compute resource specifications should be model specific, and and we should translate the requirements to the appropriate settings for the currently used queue system.

In particular, the number of CPU cores and memory required is inherit to the model, not the specific queue system that is configured. We should translate these requirements/configurations to the specific queue system when submitting the jobs, allowing the queue system to make decisions about how to best distribute the jobs over the available compute resources. (This means that the Torque setting NUM_CPUS_PER_NODE should be deprecated and removed, as it attempts to do this resource allocation manually and leave it up to the user).

Improved status updating and tracking:
Currently every job (JobQueueNode) will interact with the queue system on its own, updating and tracking the status of a single job submitted to the queue system (It creates one thread per realizations, each doing its own polling). This scales poorly when increasing the number of realizations, to the point where we will ddos the queue system when asking for statuses. Mitigating actions exist both in the Torque driver (qstat_proxy.sh to cache the result of the command) and the LSF driver (Internal lookup table (caching) in the driver itself).

A better solution here would be to lift the abstraction to a sightly higher level, such that the "Queue manager" can query the status of all the jobs simultaneously. Something along the lines of this:

def get_torque_status(jobs: List[JobId]) -> Dict[JobId, JobStatus]:
    status_str = subprocess.call("qstat", qstat_args)
    statuses = parse_statuses(status_str)
    return {jobid: statuses[jobid] for jobid in jobs}

Move code to Python:
We should move the current C-code implementation to Python.

NOTE:
The queue driver is performance critical to ERT, and at no point during refactoring we should exceed a performance limit such that ERT will stop working. This means that extra testing to ensure scaling when increasing the number of realizations is needed.

List of issues:

We should make sure that all the new code comes with a benchmarking / performance tests.


Definition of Done:


Migration to Python

Moving Scheduler to Main

drivers


Dask related:

Other Issues:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants