You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Compute resource specifications:
Compute resource specifications should be model specific, and and we should translate the requirements to the appropriate settings for the currently used queue system.
In particular, the number of CPU cores and memory required is inherit to the model, not the specific queue system that is configured. We should translate these requirements/configurations to the specific queue system when submitting the jobs, allowing the queue system to make decisions about how to best distribute the jobs over the available compute resources. (This means that the Torque setting NUM_CPUS_PER_NODE should be deprecated and removed, as it attempts to do this resource allocation manually and leave it up to the user).
Improved status updating and tracking:
Currently every job (JobQueueNode) will interact with the queue system on its own, updating and tracking the status of a single job submitted to the queue system (It creates one thread per realizations, each doing its own polling). This scales poorly when increasing the number of realizations, to the point where we will ddos the queue system when asking for statuses. Mitigating actions exist both in the Torque driver (qstat_proxy.sh to cache the result of the command) and the LSF driver (Internal lookup table (caching) in the driver itself).
A better solution here would be to lift the abstraction to a sightly higher level, such that the "Queue manager" can query the status of all the jobs simultaneously. Something along the lines of this:
Move code to Python:
We should move the current C-code implementation to Python.
NOTE:
The queue driver is performance critical to ERT, and at no point during refactoring we should exceed a performance limit such that ERT will stop working. This means that extra testing to ensure scaling when increasing the number of realizations is needed.
List of issues:
We should make sure that all the new code comes with a benchmarking / performance tests.
Investigate whether we can just use jobs.json as an input to the queue and let the driver handle execution automatically. It does not seem to be relevant any more as we rely jobs.json still.
Compute resource specifications:
Compute resource specifications should be model specific, and and we should translate the requirements to the appropriate settings for the currently used queue system.
In particular, the number of CPU cores and memory required is inherit to the model, not the specific queue system that is configured. We should translate these requirements/configurations to the specific queue system when submitting the jobs, allowing the queue system to make decisions about how to best distribute the jobs over the available compute resources. (This means that the Torque setting
NUM_CPUS_PER_NODE
should be deprecated and removed, as it attempts to do this resource allocation manually and leave it up to the user).Improved status updating and tracking:
Currently every job (JobQueueNode) will interact with the queue system on its own, updating and tracking the status of a single job submitted to the queue system (It creates one thread per realizations, each doing its own polling). This scales poorly when increasing the number of realizations, to the point where we will ddos the queue system when asking for statuses. Mitigating actions exist both in the Torque driver (
qstat_proxy.sh
to cache the result of the command) and the LSF driver (Internal lookup table (caching) in the driver itself).A better solution here would be to lift the abstraction to a sightly higher level, such that the "Queue manager" can query the status of all the jobs simultaneously. Something along the lines of this:
Move code to Python:
We should move the current C-code implementation to Python.
NOTE:
The queue driver is performance critical to ERT, and at no point during refactoring we should exceed a performance limit such that ERT will stop working. This means that extra testing to ensure scaling when increasing the number of realizations is needed.
List of issues:
We should make sure that all the new code comes with a benchmarking / performance tests.
Definition of Done:
Migration to Python
Moving Scheduler to Main
drivers
Dask related:
Other Issues:
Terminate experiment
#5955Kill realizations
realizations with stateWaiting
are submitted before being stopped. #6134The text was updated successfully, but these errors were encountered: