protocol "grpc+verbs", the PS node meeted a coredump. #28

wu-yy · 2020-08-04T03:19:38Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Linux Centos 7.6
TensorFlow installed from (source or binary): source
TensorFlow version: 1.15.0
Python version: 2.7
Installed using virtualenv? pip? conda?: No
CUDA/cuDNN version: No
GPU model and memory: No
Worker job number: 50
Ps job number: 10
chief job number: 1
Describe the problem

Sadly, I have run an experiment using TensorFlow with Verbs for communication on multiple Workers, which means I use the protocol "grpc+verbs". The framework is 50 worker'nodes,10 ps' nodes, and 1 chief' node. When at the end of the training, all 50 workes stopped normally. But only one of 10 ps nodes met the core-dump problem. Other ps' nodes and chief's node stoped normally.

when using the gdb to print the bt of cored-ump file. The print of the ps' node is as follows.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

protocol "grpc+verbs", the PS node meeted a coredump. #28

protocol "grpc+verbs", the PS node meeted a coredump. #28

wu-yy commented Aug 4, 2020 •

edited

Loading

protocol "grpc+verbs", the PS node meeted a coredump. #28

protocol "grpc+verbs", the PS node meeted a coredump. #28

Comments

wu-yy commented Aug 4, 2020 • edited Loading

wu-yy commented Aug 4, 2020 •

edited

Loading