Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bench_louvain Failing in bench_algos.py Due to CommClosedError #4542

Open
2 tasks done
nv-rliu opened this issue Jul 17, 2024 · 0 comments
Open
2 tasks done

bench_louvain Failing in bench_algos.py Due to CommClosedError #4542

nv-rliu opened this issue Jul 17, 2024 · 0 comments
Assignees
Labels
benchmarks bug Something isn't working graph-devops Issues for the graph-devops team non-breaking Non-breaking change
Milestone

Comments

@nv-rliu
Copy link
Contributor

nv-rliu commented Jul 17, 2024

Version

24.08

Which installation method(s) does this occur on?

Conda, Source

Describe the bug.

The Louvain Algorithm being run in cugraph/benchmarks/cugraph/pytest-based/bench_algos.py is failing due to a ConnectionRefusedError

Minimum reproducible example

pytest -v --import-mode=append bench_algos.py::bench_louvain

Relevant log output

07/17/24-11:32:55.328165306_UTC>>>> NODE 0: ******** STARTING BENCHMARK FROM: ./bench_algos.py::bench_louvain, using 8 GPUs
============================= test session starts ==============================
platform linux -- Python 3.10.14, pytest-8.2.2, pluggy-1.5.0 -- /opt/conda/bin/python3.10
cachedir: .pytest_cache
rapids_pytest_benchmark: 0.0.15
benchmark: 4.0.0 (defaults: timer=time.perf_counter disable_gc=False min_rounds=5 min_time=0.000005 max_time=1.0 calibration_precision=10 warmup=False warmup_iterations=100000)
rootdir: /root/cugraph/benchmarks
configfile: pytest.ini
plugins: rapids-pytest-benchmark-0.0.15, benchmark-4.0.0, cov-5.0.0
collecting ... collected 720 items / 719 deselected / 1 selected

bench_algos.py::bench_louvain[ds:rmat_mg_20_16-mm:False-pa:True] [1721215988.622395] [rno1-m02-c08-dgx1-048:3618260:0]            sock.c:470  UCX  ERROR bind(fd=141 addr=0.0.0.0:37111) failed: Address already in use

Dask client/cluster created using LocalCUDACluster
2024-07-17 05:32:55,418 - distributed.worker - WARNING - Scheduler was unaware of this worker; shutting down.
/opt/conda/lib/python3.10/site-packages/pytest_benchmark/logger.py:46: PytestBenchmarkWarning: Not saving anything, no benchmarks have been run!
  warner(PytestBenchmarkWarning(text))
Process Dask Worker process (from Nanny):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
    target(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1019, in _run
    asyncio_run(run(), loop_factory=get_loop_factory())
  File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
2024-07-17 05:32:57,433 - distributed.nanny - ERROR - Worker process died unexpectedly
Process Dask Worker process (from Nanny):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
    target(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1019, in _run
    asyncio_run(run(), loop_factory=get_loop_factory())
  File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
2024-07-17 05:32:57,436 - distributed.nanny - ERROR - Worker process died unexpectedly
Process Dask Worker process (from Nanny):
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
    target(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1019, in _run
    asyncio_run(run(), loop_factory=get_loop_factory())
  File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
Process Dask Worker process (from Nanny):
2024-07-17 05:32:57,440 - distributed.nanny - ERROR - Worker process died unexpectedly
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
    target(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1019, in _run
    asyncio_run(run(), loop_factory=get_loop_factory())
  File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
Process Dask Worker process (from Nanny):
2024-07-17 05:32:57,444 - distributed.nanny - ERROR - Worker process died unexpectedly
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
    target(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1019, in _run
    asyncio_run(run(), loop_factory=get_loop_factory())
  File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
Process Dask Worker process (from Nanny):
2024-07-17 05:32:57,449 - distributed.nanny - ERROR - Worker process died unexpectedly
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
    target(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1019, in _run
    asyncio_run(run(), loop_factory=get_loop_factory())
  File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
Process Dask Worker process (from Nanny):
Process Dask Worker process (from Nanny):
2024-07-17 05:32:57,456 - distributed.nanny - ERROR - Worker process died unexpectedly
2024-07-17 05:32:57,456 - distributed.nanny - ERROR - Worker process died unexpectedly
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
    target(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1019, in _run
    asyncio_run(run(), loop_factory=get_loop_factory())
  File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/opt/conda/lib/python3.10/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/process.py", line 202, in _run
    target(*args, **kwargs)
  File "/opt/conda/lib/python3.10/site-packages/distributed/nanny.py", line 1019, in _run
    asyncio_run(run(), loop_factory=get_loop_factory())
  File "/opt/conda/lib/python3.10/site-packages/distributed/compatibility.py", line 236, in asyncio_run
    return loop.run_until_complete(main)
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 636, in run_until_complete
    self.run_forever()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 603, in run_forever
    self._run_once()
  File "/opt/conda/lib/python3.10/asyncio/base_events.py", line 1871, in _run_once
    event_list = self._selector.select(timeout)
  File "/opt/conda/lib/python3.10/selectors.py", line 469, in select
    fd_event_list = self._selector.poll(timeout, max_ev)
KeyboardInterrupt

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! KeyboardInterrupt !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
/opt/conda/lib/python3.10/threading.py:324: KeyboardInterrupt
(to show a full traceback on KeyboardInterrupt use --full-trace)
distributed.comm.core.CommClosedError: in <distributed.comm.tcp.TCPConnector object at 0x14909852d9f0>: ConnectionRefusedError: [Errno 111] Connection refused

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/bin/pytest", line 10, in <module>
    sys.exit(console_main())
  File "/opt/conda/lib/python3.10/site-packages/_pytest/config/__init__.py", line 206, in console_main
    code = main()
  File "/opt/conda/lib/python3.10/site-packages/_pytest/config/__init__.py", line 178, in main
    ret: Union[ExitCode, int] = config.hook.pytest_cmdline_main(
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 139, in _multicall
    raise exception.with_traceback(exception.__traceback__)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/main.py", line 332, in pytest_cmdline_main
    return wrap_session(config, _main)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/main.py", line 320, in wrap_session
    config.hook.pytest_sessionfinish(
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_hooks.py", line 513, in __call__
    return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_manager.py", line 120, in _hookexec
    return self._inner_hookexec(hook_name, methods, kwargs, firstresult)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 182, in _multicall
    return outcome.get_result()
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_result.py", line 100, in get_result
    raise exc.with_traceback(exc.__traceback__)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/logging.py", line 872, in pytest_sessionfinish
    return (yield)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/terminal.py", line 867, in pytest_sessionfinish
    result = yield
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 167, in _multicall
    teardown.throw(outcome._exception)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/warnings.py", line 140, in pytest_sessionfinish
    return (yield)
  File "/opt/conda/lib/python3.10/site-packages/pluggy/_callers.py", line 103, in _multicall
    res = hook_impl.function(*args)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 110, in pytest_sessionfinish
    session._setupstate.teardown_exact(None)
  File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 557, in teardown_exact
    raise exceptions[0]
  File "/opt/conda/lib/python3.10/site-packages/_pytest/runner.py", line 546, in teardown_exact
    fin()
  File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 1023, in finish
    raise exceptions[0]
  File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 1012, in finish
    fin()
  File "/opt/conda/lib/python3.10/site-packages/_pytest/fixtures.py", line 896, in _teardown_yield_fixture
    next(it)
  File "/root/cugraph/benchmarks/cugraph/pytest-based/bench_algos.py", line 230, in dataset
    mg_utils.stop_dask_client(client, cluster)
  File "/opt/conda/lib/python3.10/site-packages/cugraph/testing/mg_utils.py", line 178, in stop_dask_client
    Comms.destroy()
  File "/opt/conda/lib/python3.10/site-packages/cugraph/dask/comms/comms.py", line 216, in destroy
    __instance.destroy()
  File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 226, in destroy
    self.client.run(
  File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 3074, in run
    return self.sync(
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 364, in sync
    return sync(
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 440, in sync
    raise error
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 414, in f
    result = yield future
  File "/opt/conda/lib/python3.10/site-packages/tornado/gen.py", line 766, in run
    value = future.result()
  File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 2979, in _run
    raise exc
  File "/opt/conda/lib/python3.10/site-packages/distributed/scheduler.py", line 6527, in send_message
    comm = await self.rpc.connect(addr)
  File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1677, in connect
    return connect_attempt.result()
  File "/opt/conda/lib/python3.10/site-packages/distributed/core.py", line 1567, in _connect
    comm = await connect(
  File "/opt/conda/lib/python3.10/site-packages/distributed/comm/core.py", line 368, in connect
    raise OSError(
OSError: Timed out trying to connect to tcp://127.0.0.1:33921 after 30 s
Exception ignored in: <function Comms.__del__ at 0x14920f451f30>
Traceback (most recent call last):
  File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 135, in __del__
  File "/opt/conda/lib/python3.10/site-packages/raft_dask/common/comms.py", line 226, in destroy
  File "/opt/conda/lib/python3.10/site-packages/distributed/client.py", line 3074, in run
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 364, in sync
  File "/opt/conda/lib/python3.10/site-packages/distributed/utils.py", line 431, in sync
  File "/opt/conda/lib/python3.10/site-packages/tornado/platform/asyncio.py", line 227, in add_callback
AttributeError: 'NoneType' object has no attribute 'get_running_loop'
07/17/24-12:33:27.207025649_UTC>>>> ERROR: command timed out after 3600 seconds
07/17/24-12:33:27.208473673_UTC>>>> NODE 0: pytest exited with code: 124, run-py-tests.sh overall exit code is: 124
07/17/24-12:33:27.325919843_UTC>>>> NODE 0: remaining python processes: [ 3612421 /usr/bin/python2 /usr/local/dcgm-nvdataflow/DcgmNVDataflowPoster.py ]
07/17/24-12:33:27.350685725_UTC>>>> NODE 0: remaining dask processes: [  ]

Environment details

Being run inside the nightly cugraph MNMG testing containers on draco-rno

Other/Misc.

Code of Conduct

  • I agree to follow cuGraph's Code of Conduct
  • I have searched the open bugs and have found no duplicates for this bug report
@nv-rliu nv-rliu added bug Something isn't working ? - Needs Triage Need team to review and classify labels Jul 17, 2024
@nv-rliu nv-rliu assigned nv-rliu and jnke2016 and unassigned nv-rliu Jul 17, 2024
@nv-rliu nv-rliu added non-breaking Non-breaking change graph-devops Issues for the graph-devops team benchmarks and removed ? - Needs Triage Need team to review and classify labels Jul 17, 2024
@nv-rliu nv-rliu added this to the 24.08 milestone Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmarks bug Something isn't working graph-devops Issues for the graph-devops team non-breaking Non-breaking change
Projects
None yet
Development

No branches or pull requests

2 participants