We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Describe the bug When I set export BYTEPS_ENABLE_ASYNC=1 and run the demo mnist training. it produce the incorrect result like this:
BytePS: enable asynchronous training INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Create CheckpointSaverHook. [12:06:44] src/customer.cc:368: Do not use thread pool for receiving. [12:06:44] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024 [12:06:44] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4 [12:06:44] src/van.cc[:357: Bind to role=worker, ip=10.0.0.1, port=21127, is_recovery=0 12:06:44] src/./zmq_van.h:286: Start ZMQ recv thread INFO:tensorflow:Graph was finalized. 2019-11-05 12:06:45.071764: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties: name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531 pciBusID: 0000:04:00.0 totalMemory: 22.38GiB freeMemory: 22.13GiB 2019-11-05 12:06:45.071808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0 2019-11-05 12:06:45.071844: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix: 2019-11-05 12:06:45.071858: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 2019-11-05 12:06:45.071870: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N 2019-11-05 12:06:45.072599: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21532 MB memory) -> physical GPU (device: 0, name: Tesla P40, p ci bus id: 0000:04:00.0, compute capability: 6.1) INFO:tensorflow:Running local_init_op. INFO:tensorflow:Done running local_init_op. INFO:tensorflow:Saving checkpoints for 0 into ./checkpoints/model.ckpt. [12:07:13] src/./zmq_van.h:286: Start ZMQ recv thread [[12:07:13] src/./zmq_van.h:286: Start ZMQ recv thread 12:07:13] src/./zmq_van.h:286: Start ZMQ recv thread [12:07:13] src/van.cc:306: W[11] is connected to others INFO:tensorflow:loss = 2.3320873, step = 0 INFO:tensorflow:loss = 2.3230207, step = 0 INFO:tensorflow:loss = 2.3025851, step = 10 (0.428 sec) INFO:tensorflow:loss = 2.3025851, step = 10 (0.430 sec) INFO:tensorflow:loss = 2.3025851, step = 20 (0.314 sec) INFO:tensorflow:loss = 2.3025851, step = 20 (0.314 sec) INFO:tensorflow:loss = 2.3025851, step = 30 (0.297 sec) INFO:tensorflow:loss = 2.3025851, step = 30 (0.297 sec) INFO:tensorflow:loss = 2.3025851, step = 40 (0.318 sec) INFO:tensorflow:loss = 2.3025851, step = 40 (0.318 sec) INFO:tensorflow:loss = 2.3025851, step = 50 (0.305 sec) INFO:tensorflow:loss = 2.3025851, step = 50 (0.305 sec) INFO:tensorflow:loss = 2.3025851, step = 60 (0.311 sec) INFO:tensorflow:loss = 2.3025851, step = 60 (0.312 sec)
It looks the loss is incorrect and does not change. Someone can explain it?
The text was updated successfully, but these errors were encountered:
Thank you for reporting this. We will try to reproduce and figure it out.
Sorry, something went wrong.
No branches or pull requests
Describe the bug
When I set
export BYTEPS_ENABLE_ASYNC=1
and run the demo mnist training.
it produce the incorrect result like this:
BytePS: enable asynchronous training
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Create CheckpointSaverHook.
[12:06:44] src/customer.cc:368: Do not use thread pool for receiving.
[12:06:44] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024
[12:06:44] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4
[12:06:44] src/van.cc[:357: Bind to role=worker, ip=10.0.0.1, port=21127, is_recovery=0
12:06:44] src/./zmq_van.h:286: Start ZMQ recv thread
INFO:tensorflow:Graph was finalized.
2019-11-05 12:06:45.071764: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:04:00.0
totalMemory: 22.38GiB freeMemory: 22.13GiB
2019-11-05 12:06:45.071808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-11-05 12:06:45.071844: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-05 12:06:45.071858: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-11-05 12:06:45.071870: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-11-05 12:06:45.072599: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21532 MB memory) -> physical GPU (device: 0, name: Tesla P40, p
ci bus id: 0000:04:00.0, compute capability: 6.1)
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into ./checkpoints/model.ckpt.
[12:07:13] src/./zmq_van.h:286: Start ZMQ recv thread
[[12:07:13] src/./zmq_van.h:286: Start ZMQ recv thread
12:07:13] src/./zmq_van.h:286: Start ZMQ recv thread
[12:07:13] src/van.cc:306: W[11] is connected to others
INFO:tensorflow:loss = 2.3320873, step = 0
INFO:tensorflow:loss = 2.3230207, step = 0
INFO:tensorflow:loss = 2.3025851, step = 10 (0.428 sec)
INFO:tensorflow:loss = 2.3025851, step = 10 (0.430 sec)
INFO:tensorflow:loss = 2.3025851, step = 20 (0.314 sec)
INFO:tensorflow:loss = 2.3025851, step = 20 (0.314 sec)
INFO:tensorflow:loss = 2.3025851, step = 30 (0.297 sec)
INFO:tensorflow:loss = 2.3025851, step = 30 (0.297 sec)
INFO:tensorflow:loss = 2.3025851, step = 40 (0.318 sec)
INFO:tensorflow:loss = 2.3025851, step = 40 (0.318 sec)
INFO:tensorflow:loss = 2.3025851, step = 50 (0.305 sec)
INFO:tensorflow:loss = 2.3025851, step = 50 (0.305 sec)
INFO:tensorflow:loss = 2.3025851, step = 60 (0.311 sec)
INFO:tensorflow:loss = 2.3025851, step = 60 (0.312 sec)
It looks the loss is incorrect and does not change. Someone can explain it?
The text was updated successfully, but these errors were encountered: