-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
run distributed training with RDMA reports the libibverbs warning #313
Comments
Can you try adding the path of libibverbs to |
I check the libibverbs path and find a phenomenon:
Need I to |
I tried to
|
I talk about this question with my colleague, the conclusion is that the following warning is normal, the key point is sever and worker print this error: |
Can you make sure that And please paste the complete commands that you used to launch the server processes. |
it looks like normal. sever:
client:
|
I paste the output of ifconfig:
ibdev2netdev:
ibv_devinfo:
|
Here is my running script of scheduler and sever:
server:
|
Seems that your eth0 does not have an available ip. You can either set |
|
It means the v0.2.5 source has fixed this issue? can I directly build v0.2.5 from source without any change? |
I just checked and you still need to change the source code a little bit. The correct process is: Pull byteps v0.2.5, and change the ps-lite submodule to Apologize for the inconvenience. We will fix this in #316. |
OK, thank you very much, I will try it soon. |
Hello, based on your suggesstion, I checkout the commit Here is my build command:
the error is:
|
the output of |
Have you tried tuning the value of |
No,Did you PR not solve this problem? How did I set the value of |
The PR only makes the value configurable.
No need to recompile. Just export the value and then rerun. |
Need I only set the value of |
when I export
|
according to #282 (comment), his problem is sloved by change the sequence of running scheduler, worker and server. I want to know the different execution sequence can cause the error of RDMA registers memory region? Furthermore, what is the correct execution order? |
This is caused by not having enough resources for registering the memory buffers. Here are a few things to try: (duplicate of #216 (comment))
|
I tried these two ways and the problem is still not solved. In https://www.rdmamojo.com/2012/09/07/ibv_reg_mr/, it shows that the another possible for register MR failed is no permission write (official: read only memory cannot be registered with write permissions (either local or remote)). My docker is runned without root permission, I don't know whether the write permission is caused by this error. |
Can you run this benchmark? https://github.com/bytedance/ps-lite#1-basic-benchmark If it works, then the problem is not related to the permission. |
the output of tests/test_benchmark:
the output of tests/test_ipc_benchmark:
it seems works well |
hello, the above error is solved by increase the value of
|
(edited due to misread) PS: You can use the test_benchmark to test the 1v1 RDMA performance. |
You misread the ib_send_bw bandwidth unit. The performance is expected. |
@wuyujiji You can check the counters in this folder |
@bobzhuyb Hi, I am not familiar with RDMA, when I test 1-to-1 traffic (test_banchmark.cc), the program finished quckily, the output is:
when checking In addtion, when I run the test_ipc_banchmark.cc for about five minutes, In |
I did another experiment, when reducing to 1worker and 1server in one machine and running the test_ipc_banchmark.cc, the |
|
my system admin checks that the PFC config is enable.
I am sorry that I don't know whether PFC config is enable. Could you please help me check this? thanks a lot! |
Describe the bug
Excuse me, based on https://github.com/bytedance/byteps/blob/master/docs/step-by-step-tutorial.md, when I run distributed training with RDMA, the scheduler will print the following warning:
then the server and worker will print error:
could you please help me? thanks a lot!
The text was updated successfully, but these errors were encountered: