Skip to content

Tensorboard callback write to mounted S3 path fails with GPU #5676

Open
@Atharex

Description

@Atharex

Environment information (required)

Diagnostics

Diagnostics output
--- check: autoidentify
INFO: diagnose_tensorboard.py version e43767ef2b648d0d5d57c00f38ccbd38390e38da

--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=8, micro=9, releaselevel='final', serial=0)
INFO: os.name: posix
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='28c43cd82c2f', release='3.10.0-1160.42.2.el7.x86_64', version='#1 SMP Tue Sep 7 14:49:57 UTC 2021', machine='x86_64')
INFO: sys.getwindowsversion(): N/A

--- check: package_management
INFO: has conda-meta: False
INFO: $VIRTUAL_ENV: None

--- check: installed_packages
INFO: installed: tensorboard==2.8.0
INFO: installed: tensorflow-gpu==2.8.0
INFO: installed: tf-estimator-nightly==2.8.0.dev2021122109
INFO: installed: tensorboard-data-server==0.6.1

--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '2.8.0'

--- check: tensorflow_python_version
INFO: tensorflow.__version__: '2.8.0'
INFO: tensorflow.__git_version__: 'v2.8.0-rc1-32-g3f878cff5b6'

--- check: tensorboard_data_server_version
INFO: data server binary: '/usr/local/lib/python3.8/site-packages/tensorboard_data_server/bin/server'
INFO: failed to check binary version: Command '['/usr/local/lib/python3.8/site-packages/tensorboard_data_server/bin/server', '--version']' returned non-zero exit status 1.

--- check: tensorboard_binary_path
INFO: which tensorboard: b'/usr/local/bin/tensorboard\n'

--- check: addrinfos
socket.has_ipv6 = True
socket.AF_UNSPEC = <AddressFamily.AF_UNSPEC: 0>
socket.SOCK_STREAM = <SocketKind.SOCK_STREAM: 1>
socket.AI_ADDRCONFIG = <AddressInfo.AI_ADDRCONFIG: 32>
socket.AI_PASSIVE = <AddressInfo.AI_PASSIVE: 1>
Loopback flags: <AddressInfo.AI_ADDRCONFIG: 32>
Loopback infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 0))]
Wildcard flags: <AddressInfo.AI_PASSIVE: 1>
Wildcard infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('0.0.0.0', 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::', 0, 0, 0))]

--- check: readable_fqdn
INFO: socket.getfqdn(): '28c43cd82c2f'

--- check: stat_tensorboardinfo
INFO: directory: /tmp/.tensorboard-info
INFO: .tensorboard-info directory does not exist

--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['/usr/local/lib/python3.8/site-packages']; bad_roots (0): []

--- check: full_pip_freeze
INFO: pip freeze --all:
absl-py==1.0.0
astunparse==1.6.3
cachetools==5.0.0
certifi==2021.10.8
charset-normalizer==2.0.12
click==8.1.2
cloudpickle==2.0.0
cuda-python==11.5.0
cudf==22.4.0
cupy==9.5.0
cupy-cuda115==10.3.1
cycler==0.11.0
Cython==0.29.24
dask==2022.3.0
dask-cudf==22.4.0
distributed==2022.3.0
distro==1.7.0
dlpack==0.1
fastavro==1.4.5
fastrlock==0.8
flatbuffers==2.0
fsspec==2022.3.0
gast==0.5.3
google-auth==2.6.4
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
googleapis-common-protos==1.56.0
grpcio==1.44.0
h5py==3.6.0
HeapDict==1.0.1
idna==3.3
importlib-metadata==4.11.3
Jinja2==3.1.1
joblib==1.1.0
keras==2.8.0
Keras-Preprocessing==1.1.2
kiwisolver==1.4.2
libclang==13.0.0
llvmlite==0.37.0
locket==0.2.1
Markdown==3.3.6
MarkupSafe==2.1.1
matplotlib==3.2.1
msgpack==1.0.3
numba==0.54.1
numpy==1.20.0
nvidia-dali-cuda110==1.12.0
nvidia-dali-tf-plugin-cuda110==1.12.0
nvtx==0.2.3
oauthlib==3.2.0
opt-einsum==3.3.0
packaging==21.3
pandas==1.2.4
partd==1.2.0
pathlib2==2.3.7.post1
Pillow==9.0.1
pip==20.2.4
protobuf==3.20.0
psutil==5.9.0
pyarrow @ file:///opt/apache-arrow/python/dist/pyarrow-6.0.1-cp38-cp38-linux_x86_64.whl
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyparsing==3.0.8
python-dateutil==2.8.2
pytz==2022.1
PyYAML==5.3.1
requests==2.27.1
requests-oauthlib==1.3.1
rmm==21.12.0
rsa==4.8
scikit-build==0.14.1
scikit-learn==1.0.2
scipy==1.8.0
setuptools==61.3.1
six==1.16.0
sortedcontainers==2.4.0
tblib==1.7.0
tensorboard==2.8.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow-gpu==2.8.0
tensorflow-io==0.24.0
tensorflow-io-gcs-filesystem==0.24.0
termcolor==1.1.0
tf-estimator-nightly==2.8.0.dev2021122109
threadpoolctl==3.1.0
toolz==0.11.2
tornado==6.1
typing-extensions==4.1.1
urllib3==1.26.9
verta==0.17.0
Werkzeug==2.1.1
wheel==0.37.1
wrapt==1.14.0
zict==2.1.0
zipp==3.8.0

Issue description

I'm running a TF training run, which writes TB logs to a rclone mounted S3 storage location (my own Minio installation).
The callback code for tensorboard is as follows:

logdir = os.path.join(TENSORBOARD_PATH, time.strftime("%Y%m%d-%H%M%S"))
file_writer = tf.summary.create_file_writer(logdir + "/metrics")

# log training metrics
parameter_summaries(writer=file_writer,
                            consts=[EPOCHS, BATCH_SIZE, PERIOD, INITAL_EPOCH, LEARNING_RATE, OPTIMIZER_TYPE],
                            names=["epochs", "batch_size", "period", "initial_epoch", "learning_rate",
                                   "optimizer_type"], step=0)

tensorboard = tf.keras.callbacks.TensorBoard(log_dir=logdir)
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, mode='min')
checkpointer = tf.keras.callbacks.ModelCheckpoint(monitor='val_loss', filepath=CHECKPOINTS_PATH,
                                                          save_weights_only=True, period=PERIOD, mode='min')
callbacks_list = [checkpointer, early_stopping, tensorboard]

This code is run in kubernetes, with a container that mounts the remote location during it's runtime.

rclone mount minio:${TENSORBOARD_BUCKET} /${TENSORBOARD_BUCKET} --daemon && \
model_training.py
umount /${TENSORBOARD_BUCKET}"]

When it is executed on the CPU, the code & logging run without problems

Epoch 19/20
1501/1501 [==============================] - 22s 15ms/step - loss: 0.0342 - accuracy: 0.9886 - precision: 1.0000 - recall: 0.9983 - f1_score: 0.9991 - val_loss: 0.0098 - val_accurac
y: 0.9972 - val_precision: 1.0000 - val_recall: 0.9993 - val_f1_score: 0.9996
Epoch 20/20
1501/1501 [==============================] - 22s 15ms/step - loss: 0.0323 - accuracy: 0.9899 - precision: 1.0000 - recall: 0.9984 - f1_score: 0.9992 - val_loss: 0.0101 - val_accurac
y: 0.9968 - val_precision: 1.0000 - val_recall: 0.9996 - val_f1_score: 0.9998
1501/1501 [==============================] - 8s 4ms/step - loss: 0.0095 - accuracy: 0.9972 - precision: 1.0000 - recall: 0.9998 - f1_score: 0.9999
313/313 [==============================] - 1s 4ms/step - loss: 0.0279 - accuracy: 0.9917 - precision: 1.0000 - recall: 0.9981 - f1_score: 0.9990
2022-04-19 13:28:02.400727: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
uploading part 1
upload complete (custom_modules)
uploading part 1
upload complete (model.pkl)
uploading part 1
upload complete (model_api.json)

But when I use the GPU, the tensorboard logging part fails:

2022-04-20 05:51:15,199 - INFO - model_training.py - Using distributed MirroredStrategy.
2022-04-20 05:51:15.200018: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the foll
owing CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-04-20 05:51:16.059450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9652 MB memory:  -> device: 0,
name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:1a:00.0, compute capability: 7.5
2022-04-20 05:51:16.060150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9650 MB memory:  -> device: 1,
name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:68:00.0, compute capability: 7.5
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.
Epoch 1/5
2022-04-20 05:51:37.248551: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8302
2022-04-20 05:51:37.672889: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8302
1427/1501 [===========================>..] - ETA: 0s - loss: 0.5082 - accuracy: 0.8353 - precision: 0.952022-04-20 05:51:58.708918: W tensorflow/core/framework/op_kernel.cc:1745] OP
_REQUIRES failed at summary_kernels.cc:114 : FAILED_PRECONDITION: /tensorboard-bucket/mnist-images-e2e/mnist-e2e-11/20220420-055131/train/events.out.tfevents.1650433891.mnist-images
-e2e-run-w7s2q-3609785273.158.1.v2; Bad file descriptor
        Failed to flush 5 events to /tensorboard-bucket/mnist-images-e2e/mnist-e2e-11/20220420-055131/train/events.out.tfevents.1650433891.mnist-images-e2e-run-w7s2q-3609785273.158.
1.v2
        Could not flush events file.
    trained_model = trainer.train(train_params)
  File "/usr/local/lib/python3.8/site-packages/mnist_e2e/training/trainer.py", line 115, in train
    model.fit(train_ds.repeat().batch(BATCH_SIZE),
  File "/usr/local/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
    raise e.with_traceback(filtered_tb) from None
  File "/usr/local/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 7186, in raise_from_not_ok_status
    raise core._status_to_exception(e) from None  # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.FailedPreconditionError: /tensorboard-bucket/mnist-images-e2e/mnist-e2e-11/20220420-055131/train/events.out.tfevents.1650433891.mnist-images-
e2e-run-w7s2q-3609785273.158.1.v2; Bad file descriptor
        Failed to flush 5 events to /tensorboard-bucket/mnist-images-e2e/mnist-e2e-11/20220420-055131/train/events.out.tfevents.1650433891.mnist-images-e2e-run-w7s2q-3609785273.158.
1.v2
        Could not flush events file. [Op:FlushSummaryWriter]
1501/1501 [==============================] - ETA: 0s - loss: 0.4948 - accuracy: 0.8399 - precision: 0.9560 - recall: 0.8648 - f1_score: 0.9015

If I run the code on GPU and then copy the tensorboard logs afterwards, the workflow works normally. So the benefit I would like here is monitoring a tensorflow run as it is executed, via a remote S3 bucket mounted tensorboard. For a CPU run this works, but with a GPU I guess there are problems with files being written to the mounted disk too quickly or too many at a time? Are there any suggestions how this could be fixed, or is this an unfeasible approach?

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions