-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deadlock in helo world example witch UCX error #39
Comments
See analysis in test_sanity fails. The root cause is the same: #36 The root cause seems to be interaction between UCP and UCX in docker environment in the test machine.
|
Closed
I used net devices variable:
The results is the same like without this variable set. Pls find full log: |
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
JIRA: DLIS-7830
The example fails
single_file.py
with unreachable UCX error:Default example configuration of network interface used for UCX doesn't work at this machine with running docker.
Reproduction
Use branch https://github.com/triton-inference-server/triton-distributed/tree/piotrm-add-nats-hosts
Start NATS.io server
Start example with default host and port passed:
Expected result
The example sends requests and processes them in workers and return with no error.
Results
Log indicates no request was processed:
Error logs analysis
The most important output logs are not printed at output but pushed into several log files:
All logs zip: logs.zip
It is necessary to inspect all other them to identify root cause of failure.
Log
encoder_decoder.0.ab22e895-d51e-11ef-9355-88a4c2b6c3a5.2511.stdout.log
:Log
encoder_decoder.0.ab22e895-d51e-11ef-9355-88a4c2b6c3a5.2511.stderr.log
:The above exception was the direct cause of other exceptions.
Network configuration of the docker instance
Network configuration in docker:
Python workers analysis
All python thread stopped at idle:
The text was updated successfully, but these errors were encountered: