Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BYTEPS_ENABLE_ASYNC=1 produces an incorrect result #142

Open
bobroute opened this issue Nov 6, 2019 · 1 comment
Open

BYTEPS_ENABLE_ASYNC=1 produces an incorrect result #142

bobroute opened this issue Nov 6, 2019 · 1 comment
Labels
bug Something isn't working

Comments

@bobroute
Copy link

bobroute commented Nov 6, 2019

Describe the bug
When I set
export BYTEPS_ENABLE_ASYNC=1
and run the demo mnist training.
it produce the incorrect result like this:

BytePS: enable asynchronous training
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Create CheckpointSaverHook.
[12:06:44] src/customer.cc:368: Do not use thread pool for receiving.
[12:06:44] src/./zmq_van.h:61: BYTEPS_ZMQ_MAX_SOCKET set to 1024
[12:06:44] src/./zmq_van.h:66: BYTEPS_ZMQ_NTHREADS set to 4
[12:06:44] src/van.cc[:357: Bind to role=worker, ip=10.0.0.1, port=21127, is_recovery=0
12:06:44] src/./zmq_van.h:286: Start ZMQ recv thread
INFO:tensorflow:Graph was finalized.
2019-11-05 12:06:45.071764: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 0 with properties:
name: Tesla P40 major: 6 minor: 1 memoryClockRate(GHz): 1.531
pciBusID: 0000:04:00.0
totalMemory: 22.38GiB freeMemory: 22.13GiB
2019-11-05 12:06:45.071808: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0
2019-11-05 12:06:45.071844: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-11-05 12:06:45.071858: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0
2019-11-05 12:06:45.071870: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N
2019-11-05 12:06:45.072599: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 21532 MB memory) -> physical GPU (device: 0, name: Tesla P40, p
ci bus id: 0000:04:00.0, compute capability: 6.1)
INFO:tensorflow:Running local_init_op.
INFO:tensorflow:Done running local_init_op.
INFO:tensorflow:Saving checkpoints for 0 into ./checkpoints/model.ckpt.
[12:07:13] src/./zmq_van.h:286: Start ZMQ recv thread
[[12:07:13] src/./zmq_van.h:286: Start ZMQ recv thread
12:07:13] src/./zmq_van.h:286: Start ZMQ recv thread
[12:07:13] src/van.cc:306: W[11] is connected to others
INFO:tensorflow:loss = 2.3320873, step = 0
INFO:tensorflow:loss = 2.3230207, step = 0
INFO:tensorflow:loss = 2.3025851, step = 10 (0.428 sec)
INFO:tensorflow:loss = 2.3025851, step = 10 (0.430 sec)
INFO:tensorflow:loss = 2.3025851, step = 20 (0.314 sec)
INFO:tensorflow:loss = 2.3025851, step = 20 (0.314 sec)
INFO:tensorflow:loss = 2.3025851, step = 30 (0.297 sec)
INFO:tensorflow:loss = 2.3025851, step = 30 (0.297 sec)
INFO:tensorflow:loss = 2.3025851, step = 40 (0.318 sec)
INFO:tensorflow:loss = 2.3025851, step = 40 (0.318 sec)
INFO:tensorflow:loss = 2.3025851, step = 50 (0.305 sec)
INFO:tensorflow:loss = 2.3025851, step = 50 (0.305 sec)
INFO:tensorflow:loss = 2.3025851, step = 60 (0.311 sec)
INFO:tensorflow:loss = 2.3025851, step = 60 (0.312 sec)

It looks the loss is incorrect and does not change. Someone can explain it?

@ymjiang
Copy link
Member

ymjiang commented Nov 6, 2019

Thank you for reporting this. We will try to reproduce and figure it out.

@ymjiang ymjiang added the bug Something isn't working label Nov 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants