Developing a Covalent Plugin for HPC Built Around PSI/J #1740
Replies: 1 comment 1 reply
-
Hi @arosen93, Firstly, great thanks for bringing this up! It's a fantastic suggestion and feature recommendation for our project. The idea of utilizing PSI/J as a unified interface for multiple job schedulers has considerable merit, especially considering the need to adapt to next-gen schedulers. This definitely aligns with our aim of creating a sustainable and scalable solution for HPC job management. As for your concerns regarding the requirement of having PSI/J installed in a Python environment on the remote machine, I share your sentiment that it is a small hurdle to cross. As long as we detail this requirement in the documentation, it should be fine. Additionally, we could look into implementing an early error notification on the compute side if this requirement is not met, to guide the users and smooth out the installation process. Furthermore, building on your initial proposal, I believe it would be super beneficial to introduce subclasses of executors, such as |
Beta Was this translation helpful? Give feedback.
-
Summary
It would be worth making a Covalent plugin for HPC batch job schedulers that is built around PSI/J to enable access to both existing and next-gen job schedulers. I am working on a proof-of-concept here but no guarantees.
Background
PSI/J is a Python package developed as part of Exaworks (funded by the Exascale Computing Project) to serve as a unified interface to various job schedulers on HPC machines. This includes your typical schedulers like Slurm/PBS/LSF but also forthcoming job schedulers like Flux that are going to replace Slurm on several US DOE machines in the upcoming years. The idea as-advertised from Exaworks is that hopefully the world of workflow packages can begin adopting PSI/J as a standard component in their stack so we aren't all reinventing the wheel with developing custom interfaces to the various job schedulers.
Current State
Currently, the only job scheduler supported by Covalent is Slurm through the
covalent-slurm-plugin
. It's possible to add a new plugin for every scheduler, but that's a lot of work and would be extremely difficult to maintain in a robust way.Proposal
A PSI/J-based Covalent plugin would be worth developing that way Covalent has access to several schedulers and will be instantly compatible with new schedulers that are adopted in the future, provided they are included in PSI/J.
Aside from having a single unified interface to several job schedulers, there is an added benefit down the road. The next item on the PSI/J development roadmap is the support for remote job management. Once it is developed, this would mean that the Covalent plugin could be greatly streamlined, simply outsourcing this relatively complex task to PSI/J.
Limitations
There are two main limitations of this approach:
It requires someone to actually make it. That said, I've already started making this and we'll see if I can get it to the finish line.
It requires PSI/J to be installed in a Python environment on the remote machine, whereas the current
covalent-slurm-plugin
simply requires the job scheduler to be present. This is a non-issue in my mind because it is simple topip install
PSI/J.Alternatives
Since Covalent already has first-class support for Dask, one could imagine using
dask-jobqueue
to create a unified interface to various job schedulers. However, I think there is major benefit in relying on a lightweight package built and maintained by HPC centers rather than Dask, and there are many complexities associated with spinning up a remote, async Dask cluster as opposed to simply submitting a batch job.Beta Was this translation helpful? Give feedback.
All reactions