BYTEPS_ENABLE_ASYNC=1 produces an incorrect result #142

bobroute · 2019-11-06T02:45:38Z

Describe the bug
When I set
export BYTEPS_ENABLE_ASYNC=1
and run the demo mnist training.
it produce the incorrect result like this:

BytePS: enable asynchronous training
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Create CheckpointSaverHook.
[12:06:44] src/customer.cc:368: Do not use thread pool for receiving.
[12:06:44] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024
[12:06:44] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4
[12:06:44] src/van.cc[:357: Bind to role=worker, ip=10.0.0.1, port=21127, is_recovery=0
12:06:44] src/./zmq_van.h:286: Start ZMQ recv thread
INFO:tensorflow:Graph was finalized.
2019-11-05 12:06:45.071764: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:04:00.0
totalMemory: 22.38GiB freeMemory: 22.13GiB
2019-11-05 12:06:45.071808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-11-05 12:06:45.071844: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-05 12:06:45.071858: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-11-05 12:06:45.071870: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-11-05 12:06:45.072599: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21532 MB memory) -> physical GPU (device: 0, name: Tesla P40, p
ci bus id: 0000:04:00.0, compute capability: 6.1)
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into ./checkpoints/model.ckpt.
[12:07:13] src/./zmq_van.h:286: Start ZMQ recv thread
[[12:07:13] src/./zmq_van.h:286: Start ZMQ recv thread
12:07:13] src/./zmq_van.h:286: Start ZMQ recv thread
[12:07:13] src/van.cc:306: W[11] is connected to others
INFO:tensorflow:loss = 2.3320873, step = 0
INFO:tensorflow:loss = 2.3230207, step = 0
INFO:tensorflow:loss = 2.3025851, step = 10 (0.428 sec)
INFO:tensorflow:loss = 2.3025851, step = 10 (0.430 sec)
INFO:tensorflow:loss = 2.3025851, step = 20 (0.314 sec)
INFO:tensorflow:loss = 2.3025851, step = 20 (0.314 sec)
INFO:tensorflow:loss = 2.3025851, step = 30 (0.297 sec)
INFO:tensorflow:loss = 2.3025851, step = 30 (0.297 sec)
INFO:tensorflow:loss = 2.3025851, step = 40 (0.318 sec)
INFO:tensorflow:loss = 2.3025851, step = 40 (0.318 sec)
INFO:tensorflow:loss = 2.3025851, step = 50 (0.305 sec)
INFO:tensorflow:loss = 2.3025851, step = 50 (0.305 sec)
INFO:tensorflow:loss = 2.3025851, step = 60 (0.311 sec)
INFO:tensorflow:loss = 2.3025851, step = 60 (0.312 sec)

It looks the loss is incorrect and does not change. Someone can explain it?

ymjiang · 2019-11-06T09:10:12Z

Thank you for reporting this. We will try to reproduce and figure it out.

ymjiang added the bug Something isn't working label Nov 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BYTEPS_ENABLE_ASYNC=1 produces an incorrect result #142

BYTEPS_ENABLE_ASYNC=1 produces an incorrect result #142

bobroute commented Nov 6, 2019

ymjiang commented Nov 6, 2019

BYTEPS_ENABLE_ASYNC=1 produces an incorrect result #142

BYTEPS_ENABLE_ASYNC=1 produces an incorrect result #142

Comments

bobroute commented Nov 6, 2019

ymjiang commented Nov 6, 2019