-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some question about to start server. Check failed: mr ibv_reg_mr failed: Cannot allocate memory #282
Comments
Similar issue: #216. We have fixed this (bytedance/ps-lite#30) but it is not merged into master yet. For now you can update the 3rdparty/ps-lite submodule, recompile ps-lite and BytePS, and then try to reduce |
yeah, But when I have 2 machines, one machine as 1 server and 1 worker, another machine as 1 worker and 1 schedule。 maybe i cannot run it this way? but i think in theroy i can do this. |
From your log, it seems that you were still using the old ps-lite. Can you update your ps-lite submodule to 7e4800feb, and then do the following to start over:
This should make sure that you can update ps-lite correctly. |
I follow it and update my ps-lite, but i also get the error in schedule. bpslaunch bpslaunch python example/tensorflow/synthetic_benchmark.py --model VGG16 --num-iters 10 another machine bpslaunch bpslaunch python example/tensorflow/synthetic_benchmark.py --model VGG16 --num-iters 10 |
Can you double check? In 7e4800feb, line 130 is not the associated line as shown in your log... |
sorry to bother again, i check it, and i get a similiar error. the error is as below: i set kRxDepth=256 of rdma_transport.h. free -m |
Can you show the output of |
yeah, thank you! now i solved it, maybe i need the correct sequence of worker and server to start. second exercise but the signle machine is about 149 |
There might be some resource contention that causes the unstable performance. Is there any other process running on your machines? And can you check the CPU utilization? |
there are no any other process running on my machine, but the cpu utilization is also unstable. Have you ever had a problem like this? |
We never saw this problem in our platform. Can you also try to bind the processes to specific cores using |
i use taskset to bind the worker and server process to different cores, but the output is also unstable, do you have ever test the cohost case? put the server and a worker on one machine. |
The colocated mode uses shared memory. It works very well in our environment.. If you are sure this problem does not happen in single machine training, would you test the ps-lite IPC benchmark and paste the performance log? We may be able to know whether the problem is in networking. |
i try to run /tests/test_ipc_benchmark and the representative part of output is as below: and in this case i don't use taskset. [20:11:17] tests/test_ipc_benchmark.cc:136: Application goodput: 93.3967 Gbps |
This is strange. The goodput should not drop that much. If your network is based on RoCE, Is PFC control enabled? (e.g., have you checked the RDMA bandwidth by BTW, the log period you show is quite short. You can reduce the log frequency by |
yeah, thank you. now i solved it, the reason is about PFC. i do not enabled PFC control, it is so stupid. the performance is great! thank you very much!! |
Hello @ymjiang 1 scheduler: docker run \
-e DMLC_ROLE=scheduler \
-e DMLC_NUM_WORKER=2 \
-e DMLC_NUM_SERVER=2 \
-e DMLC_PS_ROOT_URI=11.0.0.201 \
-e DMLC_PS_ROOT_PORT=9876 \
-e DMLC_INTERFACE=ens1f0 \
-e DMLC_ENABLE_RDMA=ibverbs \
-e BYTEPS_ENABLE_IPC=1 \
--device /dev/infiniband/issm0 --device /dev/infiniband/rdma_cm --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 \
--cap-add IPC_LOCK \
--ulimit memlock=-1 \
-it --rm --net=host byteps/pytorch:latest \
bpslaunch 2 servers: docker run \
-e DMLC_ROLE=server \
-e DMLC_NUM_WORKER=2 \
-e DMLC_NUM_SERVER=2 \
-e DMLC_PS_ROOT_URI=11.0.0.201 \
-e DMLC_PS_ROOT_PORT=9876 \
-e DMLC_INTERFACE=ens1f0 \
-e DMLC_ENABLE_RDMA=ibverbs \
-e BYTEPS_ENABLE_IPC=1 \
--device /dev/infiniband/issm0 --device /dev/infiniband/rdma_cm --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 \
--cap-add IPC_LOCK \
--ulimit memlock=-1 \
-it --rm --net=host byteps/pytorch:latest \
bpslaunch 2 worker: docker run \
-e DMLC_ROLE=worker \
-e DMLC_WORKER_ID=0 \ # and 1 for the other
-e DMLC_NUM_WORKER=2 \
-e DMLC_NUM_SERVER=2 \
-e DMLC_PS_ROOT_URI=11.0.0.201 \
-e DMLC_PS_ROOT_PORT=9876 \
-e DMLC_INTERFACE=ens1f0 \
-e DMLC_ENABLE_RDMA=ibverbs \
-e BYTEPS_ENABLE_IPC=1 \
--device /dev/infiniband/issm0 --device /dev/infiniband/rdma_cm --device /dev/infiniband/ucm0 --device /dev/infiniband/umad0 --device /dev/infiniband/uverbs0 \
--cap-add IPC_LOCK \
--ulimit memlock=-1 \
-it --rm --runtime=nvidia --net=host byteps/pytorch:latest \
bpslaunch python3 /usr/local/byteps/example/pytorch/benchmark_byteps.py --model resnet50 --num-iters
20 The logs: BytePS launching scheduler
[05:42:27] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[05:42:27] src/postoffice.cc:19: Creating Van: ibverbs
[05:42:28] src/./rdma_van.h:806: OnConnect to Node 1 with Transport=IPC
[05:42:28] src/./rdma_van.h:234: Connect to Node 1 with Transport=IPC
[05:43:11] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=IPC
[05:43:30] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[05:45:06] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=IPC
[05:45:06] src/./rdma_van.h:806: OnConnect to Node 2147483647 with Transport=RDMA
[05:45:07] src/./rdma_van.h:234: Connect to Node 9 with Transport=RDMA
[05:45:07] 3rdparty/ps-lite/include/dmlc/logging.h:276: [05:45:07] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)
Stack trace returned 7 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x2299c) [0x7fea2a44a99c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ddd) [0x7fea2a44addd]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x77650) [0x7fea2a49f650]
[bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x7877b) [0x7fea2a4a077b]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f) [0x7fea29b2866f]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fea2cebc6db]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fea2d1f588f]
terminate called after throwing an instance of 'dmlc::Error'
what(): [05:45:07] src/./rdma_transport.h:144: Check failed: mr ibv_reg_mr failed: Cannot allocate memory
You can try to reduce BYTEPS_RDMA_START_DEPTH (default 128) or BYTEPS_RDMA_RX_DEPTH (default 2048)
Stack trace returned 7 entries:
[bt] (0) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x2299c) [0x7fea2a44a99c]
[bt] (1) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x22ddd) [0x7fea2a44addd]
[bt] (2) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x77650) [0x7fea2a49f650]
[bt] (3) /usr/local/lib/python3.6/dist-packages/byteps-0.2.4-py3.6-linux-x86_64.egg/byteps/server/c_lib.cpython-36m-x86_64-linux-gnu.so(+0x7877b) [0x7fea2a4a077b]
[bt] (4) /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xbd66f) [0x7fea29b2866f]
[bt] (5) /lib/x86_64-linux-gnu/libpthread.so.0(+0x76db) [0x7fea2cebc6db]
[bt] (6) /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fea2d1f588f] server 0 BytePS launching server
[04:09:19] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[04:09:19] src/postoffice.cc:19: Creating Van: ibverbs
[04:09:20] src/./rdma_van.h:234: Connect to Node 1 with Transport=IPC
[04:09:35] src/./rdma_van.h:893: OnDisconnected from Node 1 server 1 BytePS launching server
[04:09:19] byteps/server/server.cc:430: BytePS server engine uses 4 threads, consider increasing BYTEPS_SERVER_ENGINE_THREAD for higher performance
[04:09:19] src/postoffice.cc:19: Creating Van: ibverbs
[04:09:20] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[04:09:35] src/./rdma_van.h:893: OnDisconnected from Node 1 worker 0 BytePS launching worker
[05:45:06] src/postoffice.cc:19: Creating Van: ibverbs
[05:45:06] src/./rdma_van.h:234: Connect to Node 1 with Transport=IPC
[05:45:07] src/./rdma_van.h:893: OnDisconnected from Node 1 worker 1 BytePS launching worker
[05:45:06] src/postoffice.cc:19: Creating Van: ibverbs
[05:45:06] src/./rdma_van.h:234: Connect to Node 1 with Transport=RDMA
[05:45:07] src/./rdma_van.h:806: OnConnect to Node 1 with Transport=RDMA
[05:45:07] src/./rdma_van.h:893: OnDisconnected from Node 1
[05:45:07] src/./rdma_van.h:893: OnDisconnected from Node 1 I tried reducing any suggestions? |
I want to 1 worker and 1 server, but when I use the following command to start server, I have some error, can anyone meet the same error?
export BYTEPS_LOG_LEVEL=INFO
export BYTEPS_ENABLE_IPC=1
export DMLC_ENABLE_RDMA=1
export DMLC_NUM_WORKER=2
export DMLC_ROLE=server
export DMLC_NUM_SERVER=1
export DMLC_INTERFACE=ens6f1
export DMLC_PS_ROOT_URI=172.168.30.25
export DMLC_PS_ROOT_PORT=9000
bpslaunch
the error is as below:
terminate called after throwing an instance of 'dmlc::Error'
what(): [16:23:05] src/./rdma_transport.h:130: Check failed: mr ibv_reg_mr failed: Cannot allocate memory, i=941, kMempoolChunkSize=56
i don't need to docker.
maybe i must need to use docker pull bytepsimage/tensorflow to correctly start?
can i start byteps without docker?
The text was updated successfully, but these errors were encountered: