Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slurm electrons fails when called within a Dask sublattice which itself is called in a Dask lattice. #69

Open
jackbaker1001 opened this issue Jun 2, 2023 · 2 comments

Comments

@jackbaker1001
Copy link

jackbaker1001 commented Jun 2, 2023

Environment

  • Covalent version: 0.220
  • Covalent-Slurm plugin version: 0.16.0
  • Python version: 3.8.16
  • Operating system: MacOS Ventura 13.3.1

What is happening?

When running a slurm electron within a base (Dask) sublattice and dispatching the sublattice within a base (Dask) lattice, the dispatch will run on the remote cluster, finish the job, then fail when retieving the job. The traceback reported in the GUI is:

Traceback (most recent call last):
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/covalent_dispatcher/_core/runner.py", line 251, in _run_task
output, stdout, stderr, status = await executor._execute(
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/covalent/executor/base.py", line 628, in _execute
return await self.execute(
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/covalent/executor/base.py", line 657, in execute
result = await self.run(function, args, kwargs, task_metadata)
File "/Users/jbaker/code/covalent/covalent-slurm-plugin/covalent_slurm_plugin/slurm.py", line 695, in run
result, stdout, stderr, exception = await self._query_result(
File "/Users/jbaker/code/covalent/covalent-slurm-plugin/covalent_slurm_plugin/slurm.py", line 577, in _query_result
async with aiofiles.open(stderr_file, "r") as f:
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/aiofiles/base.py", line 78, in __aenter__
self._obj = await self._coro
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/site-packages/aiofiles/threadpool/__init__.py", line 80, in _open
f = yield from loop.run_in_executor(executor, cb)
File "/Users/jbaker/miniconda3/envs/covalent_slurm/lib/python3.8/concurrent/futures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
FileNotFoundError: [Errno 2] No such file or directory: '/Users/jbaker/.local/share/covalent/data/ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091/stdout-ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091-0.log'

How can we reproduce the issue?

I am using the sshproxy extra req and have prepared my covalent config file as suggested in the root README.md.

Here's a simple workflow to reproduce the above:

import covalent as ct
import numpy as np

executor = ct.executor.SlurmExecutor(
       remote_workdir="<wdir>",
       options={
           "qos": "regular",
           "t": "00:05:00",
           "nodes": 1,
           "C": "gpu",
           "A": "<acc code>",
           "J": "bug_test",
           "ntasks-per-node": 4,
           "gpus-per-task": 1,
           "gpu-bind": "map_gpu:0,1,2,3"
       },
       prerun_commands=[
           "export COVALENT_CONFIG_DIR="<somewhere in scratch>",
           "export COVALENT_CACHE_DIR="<somewhere in scratch>",
           "export SLURM_CPU_BIND=\"cores\"",
           "export OMP_PROC_BIND=spread",
           "export OMP_PLACES=threads",
           "export OMP_NUM_THREADS=1",
       ],
       username="<username>",
       ssh_key_file="<key>",
       cert_file="<cert>",
       address="perlmutter-p1.nersc.gov",
       conda_env="<conda env>",
       use_srun=False
)

@ct.electron
def get_rand_sum_length(lo, hi):
    np.random.seed(1984)
    return np.random.randint(lo, hi)

# Slurm electron
@ct.electron(executor=executor)
def get_rand_num_slurm(lo, hi):
    np.random.seed(1984)
    return np.random.randint(lo, hi)  

@ct.electron
@ct.lattice
def add_n_random_nums(n, lo, hi):
    np.random.seed(1984)
    sum = 0
    for i in range(n):
        sum += get_rand_num_slurm(lo, hi)
    return sum

@ct.lattice
def random_num_workflow(lo, hi):
    n = get_rand_sum_length(lo, hi)
    sum = add_n_random_nums(n, lo, hi) # sublattice
    return sum

id = ct.dispatch(random_num_workflow)(1, 3)
ct_result = ct.get_result(dispatch_id=id, wait=True)
sum = ct_result.result
print(sum)

What should happen?

The code should run to completion, throwing now error in the GUI and print an integer.

Any suggestions?

It seems to me that the interaction between the Dask and Slurm executors is not quite right. Either way, the file Covalent is looking for exists on the remote directory in <wdir>/stdout-ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091-0.log but does not exist in the local directory /Users/jbaker/.local/share/covalent/data/ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091/stdout-ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091-0.log. Indeed, in /Users/jbaker/.local/share/covalent/data/ee3b1f1b-b21b-4bbd-bc95-6bbc012c3091/ stdout files are contained within the /node/ subdirs.

@santoshkumarradha
Copy link
Member

@cjao this seems like an edge case we need to look at, any recommended pattern for this ?

@Andrew-S-Rosen
Copy link
Contributor

Andrew-S-Rosen commented Jul 29, 2023

@jackbaker1001: Was this issue addressed by AgnostiqHQ/covalent#1736?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants