Description
Environment information (required)
Diagnostics
Diagnostics output
--- check: autoidentify
INFO: diagnose_tensorboard.py version e43767ef2b648d0d5d57c00f38ccbd38390e38da
--- check: general
INFO: sys.version_info: sys.version_info(major=3, minor=8, micro=9, releaselevel='final', serial=0)
INFO: os.name: posix
INFO: os.uname(): posix.uname_result(sysname='Linux', nodename='28c43cd82c2f', release='3.10.0-1160.42.2.el7.x86_64', version='#1 SMP Tue Sep 7 14:49:57 UTC 2021', machine='x86_64')
INFO: sys.getwindowsversion(): N/A
--- check: package_management
INFO: has conda-meta: False
INFO: $VIRTUAL_ENV: None
--- check: installed_packages
INFO: installed: tensorboard==2.8.0
INFO: installed: tensorflow-gpu==2.8.0
INFO: installed: tf-estimator-nightly==2.8.0.dev2021122109
INFO: installed: tensorboard-data-server==0.6.1
--- check: tensorboard_python_version
INFO: tensorboard.version.VERSION: '2.8.0'
--- check: tensorflow_python_version
INFO: tensorflow.__version__: '2.8.0'
INFO: tensorflow.__git_version__: 'v2.8.0-rc1-32-g3f878cff5b6'
--- check: tensorboard_data_server_version
INFO: data server binary: '/usr/local/lib/python3.8/site-packages/tensorboard_data_server/bin/server'
INFO: failed to check binary version: Command '['/usr/local/lib/python3.8/site-packages/tensorboard_data_server/bin/server', '--version']' returned non-zero exit status 1.
--- check: tensorboard_binary_path
INFO: which tensorboard: b'/usr/local/bin/tensorboard\n'
--- check: addrinfos
socket.has_ipv6 = True
socket.AF_UNSPEC = <AddressFamily.AF_UNSPEC: 0>
socket.SOCK_STREAM = <SocketKind.SOCK_STREAM: 1>
socket.AI_ADDRCONFIG = <AddressInfo.AI_ADDRCONFIG: 32>
socket.AI_PASSIVE = <AddressInfo.AI_PASSIVE: 1>
Loopback flags: <AddressInfo.AI_ADDRCONFIG: 32>
Loopback infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('127.0.0.1', 0))]
Wildcard flags: <AddressInfo.AI_PASSIVE: 1>
Wildcard infos: [(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('0.0.0.0', 0)), (<AddressFamily.AF_INET6: 10>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('::', 0, 0, 0))]
--- check: readable_fqdn
INFO: socket.getfqdn(): '28c43cd82c2f'
--- check: stat_tensorboardinfo
INFO: directory: /tmp/.tensorboard-info
INFO: .tensorboard-info directory does not exist
--- check: source_trees_without_genfiles
INFO: tensorboard_roots (1): ['/usr/local/lib/python3.8/site-packages']; bad_roots (0): []
--- check: full_pip_freeze
INFO: pip freeze --all:
absl-py==1.0.0
astunparse==1.6.3
cachetools==5.0.0
certifi==2021.10.8
charset-normalizer==2.0.12
click==8.1.2
cloudpickle==2.0.0
cuda-python==11.5.0
cudf==22.4.0
cupy==9.5.0
cupy-cuda115==10.3.1
cycler==0.11.0
Cython==0.29.24
dask==2022.3.0
dask-cudf==22.4.0
distributed==2022.3.0
distro==1.7.0
dlpack==0.1
fastavro==1.4.5
fastrlock==0.8
flatbuffers==2.0
fsspec==2022.3.0
gast==0.5.3
google-auth==2.6.4
google-auth-oauthlib==0.4.6
google-pasta==0.2.0
googleapis-common-protos==1.56.0
grpcio==1.44.0
h5py==3.6.0
HeapDict==1.0.1
idna==3.3
importlib-metadata==4.11.3
Jinja2==3.1.1
joblib==1.1.0
keras==2.8.0
Keras-Preprocessing==1.1.2
kiwisolver==1.4.2
libclang==13.0.0
llvmlite==0.37.0
locket==0.2.1
Markdown==3.3.6
MarkupSafe==2.1.1
matplotlib==3.2.1
msgpack==1.0.3
numba==0.54.1
numpy==1.20.0
nvidia-dali-cuda110==1.12.0
nvidia-dali-tf-plugin-cuda110==1.12.0
nvtx==0.2.3
oauthlib==3.2.0
opt-einsum==3.3.0
packaging==21.3
pandas==1.2.4
partd==1.2.0
pathlib2==2.3.7.post1
Pillow==9.0.1
pip==20.2.4
protobuf==3.20.0
psutil==5.9.0
pyarrow @ file:///opt/apache-arrow/python/dist/pyarrow-6.0.1-cp38-cp38-linux_x86_64.whl
pyasn1==0.4.8
pyasn1-modules==0.2.8
pyparsing==3.0.8
python-dateutil==2.8.2
pytz==2022.1
PyYAML==5.3.1
requests==2.27.1
requests-oauthlib==1.3.1
rmm==21.12.0
rsa==4.8
scikit-build==0.14.1
scikit-learn==1.0.2
scipy==1.8.0
setuptools==61.3.1
six==1.16.0
sortedcontainers==2.4.0
tblib==1.7.0
tensorboard==2.8.0
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow-gpu==2.8.0
tensorflow-io==0.24.0
tensorflow-io-gcs-filesystem==0.24.0
termcolor==1.1.0
tf-estimator-nightly==2.8.0.dev2021122109
threadpoolctl==3.1.0
toolz==0.11.2
tornado==6.1
typing-extensions==4.1.1
urllib3==1.26.9
verta==0.17.0
Werkzeug==2.1.1
wheel==0.37.1
wrapt==1.14.0
zict==2.1.0
zipp==3.8.0
Issue description
I'm running a TF training run, which writes TB logs to a rclone mounted S3 storage location (my own Minio installation).
The callback code for tensorboard is as follows:
logdir = os.path.join(TENSORBOARD_PATH, time.strftime("%Y%m%d-%H%M%S"))
file_writer = tf.summary.create_file_writer(logdir + "/metrics")
# log training metrics
parameter_summaries(writer=file_writer,
consts=[EPOCHS, BATCH_SIZE, PERIOD, INITAL_EPOCH, LEARNING_RATE, OPTIMIZER_TYPE],
names=["epochs", "batch_size", "period", "initial_epoch", "learning_rate",
"optimizer_type"], step=0)
tensorboard = tf.keras.callbacks.TensorBoard(log_dir=logdir)
early_stopping = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=3, mode='min')
checkpointer = tf.keras.callbacks.ModelCheckpoint(monitor='val_loss', filepath=CHECKPOINTS_PATH,
save_weights_only=True, period=PERIOD, mode='min')
callbacks_list = [checkpointer, early_stopping, tensorboard]
This code is run in kubernetes, with a container that mounts the remote location during it's runtime.
rclone mount minio:${TENSORBOARD_BUCKET} /${TENSORBOARD_BUCKET} --daemon && \
model_training.py
umount /${TENSORBOARD_BUCKET}"]
When it is executed on the CPU, the code & logging run without problems
Epoch 19/20
1501/1501 [==============================] - 22s 15ms/step - loss: 0.0342 - accuracy: 0.9886 - precision: 1.0000 - recall: 0.9983 - f1_score: 0.9991 - val_loss: 0.0098 - val_accurac
y: 0.9972 - val_precision: 1.0000 - val_recall: 0.9993 - val_f1_score: 0.9996
Epoch 20/20
1501/1501 [==============================] - 22s 15ms/step - loss: 0.0323 - accuracy: 0.9899 - precision: 1.0000 - recall: 0.9984 - f1_score: 0.9992 - val_loss: 0.0101 - val_accurac
y: 0.9968 - val_precision: 1.0000 - val_recall: 0.9996 - val_f1_score: 0.9998
1501/1501 [==============================] - 8s 4ms/step - loss: 0.0095 - accuracy: 0.9972 - precision: 1.0000 - recall: 0.9998 - f1_score: 0.9999
313/313 [==============================] - 1s 4ms/step - loss: 0.0279 - accuracy: 0.9917 - precision: 1.0000 - recall: 0.9981 - f1_score: 0.9990
2022-04-19 13:28:02.400727: W tensorflow/python/util/util.cc:368] Sets are not currently considered sequences, but this may change in the future, so consider avoiding using them.
uploading part 1
upload complete (custom_modules)
uploading part 1
upload complete (model.pkl)
uploading part 1
upload complete (model_api.json)
But when I use the GPU, the tensorboard logging part fails:
2022-04-20 05:51:15,199 - INFO - model_training.py - Using distributed MirroredStrategy.
2022-04-20 05:51:15.200018: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the foll
owing CPU instructions in performance-critical operations: AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2022-04-20 05:51:16.059450: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 9652 MB memory: -> device: 0,
name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:1a:00.0, compute capability: 7.5
2022-04-20 05:51:16.060150: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1525] Created device /job:localhost/replica:0/task:0/device:GPU:1 with 9650 MB memory: -> device: 1,
name: NVIDIA GeForce RTX 2080 Ti, pci bus id: 0000:68:00.0, compute capability: 7.5
WARNING:tensorflow:`period` argument is deprecated. Please use `save_freq` to specify the frequency in number of batches seen.
Epoch 1/5
2022-04-20 05:51:37.248551: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8302
2022-04-20 05:51:37.672889: I tensorflow/stream_executor/cuda/cuda_dnn.cc:368] Loaded cuDNN version 8302
1427/1501 [===========================>..] - ETA: 0s - loss: 0.5082 - accuracy: 0.8353 - precision: 0.952022-04-20 05:51:58.708918: W tensorflow/core/framework/op_kernel.cc:1745] OP
_REQUIRES failed at summary_kernels.cc:114 : FAILED_PRECONDITION: /tensorboard-bucket/mnist-images-e2e/mnist-e2e-11/20220420-055131/train/events.out.tfevents.1650433891.mnist-images
-e2e-run-w7s2q-3609785273.158.1.v2; Bad file descriptor
Failed to flush 5 events to /tensorboard-bucket/mnist-images-e2e/mnist-e2e-11/20220420-055131/train/events.out.tfevents.1650433891.mnist-images-e2e-run-w7s2q-3609785273.158.
1.v2
Could not flush events file.
trained_model = trainer.train(train_params)
File "/usr/local/lib/python3.8/site-packages/mnist_e2e/training/trainer.py", line 115, in train
model.fit(train_ds.repeat().batch(BATCH_SIZE),
File "/usr/local/lib/python3.8/site-packages/keras/utils/traceback_utils.py", line 67, in error_handler
raise e.with_traceback(filtered_tb) from None
File "/usr/local/lib/python3.8/site-packages/tensorflow/python/framework/ops.py", line 7186, in raise_from_not_ok_status
raise core._status_to_exception(e) from None # pylint: disable=protected-access
tensorflow.python.framework.errors_impl.FailedPreconditionError: /tensorboard-bucket/mnist-images-e2e/mnist-e2e-11/20220420-055131/train/events.out.tfevents.1650433891.mnist-images-
e2e-run-w7s2q-3609785273.158.1.v2; Bad file descriptor
Failed to flush 5 events to /tensorboard-bucket/mnist-images-e2e/mnist-e2e-11/20220420-055131/train/events.out.tfevents.1650433891.mnist-images-e2e-run-w7s2q-3609785273.158.
1.v2
Could not flush events file. [Op:FlushSummaryWriter]
1501/1501 [==============================] - ETA: 0s - loss: 0.4948 - accuracy: 0.8399 - precision: 0.9560 - recall: 0.8648 - f1_score: 0.9015
If I run the code on GPU and then copy the tensorboard logs afterwards, the workflow works normally. So the benefit I would like here is monitoring a tensorflow run as it is executed, via a remote S3 bucket mounted tensorboard. For a CPU run this works, but with a GPU I guess there are problems with files being written to the mounted disk too quickly or too many at a time? Are there any suggestions how this could be fixed, or is this an unfeasible approach?